Re: [PATCH v2 3/3] ARM: davinci: da850: add EHRPWM & ECAP DT node

2013-03-21 Thread Sekhar Nori
On 3/21/2013 1:31 PM, Philip, Avinash wrote:
> On Wed, Mar 20, 2013 at 18:17:59, Peter Korsgaard wrote:
>>> "Sekhar" == Sekhar Nori  writes:
>>
>>  Sekhar> On 3/20/2013 12:11 PM, Philip Avinash wrote:
>>  >> Add da850 EHRPWM & ECAP DT node.
>>  >> Also adds OF_DEV_AUXDATA for EHRPWM & ECAP driver to use EHRPWM & ECAP
>>  >> clock.
>>  >> 
>>  >> Signed-off-by: Philip Avinash 
>>  >> ---
>>  >> Changes since v1:
>>  >> - Reusing ti,am33xx as compatible field as both IP's are
>>  >> same with am33xx platform and da850 has no platform specific
>>  >> dependency.
>>
>>  Sekhar> Which is fine, but I think the binding documentation still needs to 
>> be
>>  Sekhar> updated to document the ti,da850-ehrpwm binding. Looping Peter (it 
>> is
>>  Sekhar> always a good idea to CC folks who reviewed your patch last time
>>  Sekhar> around). Also, please Cc the DT folks and devicetree-discuss list 
>> too
>>  Sekhar> for their opinion.
>>
>> Yes, thanks for CC'ing me. I also think da850-* should be
>> documented. See Documentation/devicetree/bindings/gpio/8xxx_gpio.txt for
>> an similar (old) example.
> 
> Peter,
> 
> In this binding file also, only initial compatible field is updated. Later on 
> many
> compatible were added in driver. But not update back to binding doc.

Probably someone forgot to keep updating the binding doc after a point.

> 
> 
> ---
>   followed by "fsl,mpc8349-gpio" for 83xx, "fsl,mpc8572-gpio" for 85xx and
>   "fsl,mpc8610-gpio" for 86xx.
> ---
> 
> ---
> static struct of_device_id mpc8xxx_gpio_ids[] __initdata = {
> { .compatible = "fsl,mpc8349-gpio", },
> { .compatible = "fsl,mpc8572-gpio", },
> { .compatible = "fsl,mpc8610-gpio", },
> { .compatible = "fsl,mpc5121-gpio", .data = mpc512x_irq_set_type, },
> { .compatible = "fsl,pq3-gpio", },
> { .compatible = "fsl,qoriq-gpio",   },
> {}
> };
> ---
> 
> Grant/Rob,
> With respect peters explanation (below), what is your opinion on adding 
> information to 
> binding doc for da850 support?
> 
> Peter --> if the hardware block is identical the dts should simply list:
> Peter --> compatible = "ti,da850-ecap", "ti,am33xx-ecap"
> Peter --> And the driver only bind to ti,am33xx-ecap (unless there ever needs 
> to
> Peter --> be a da850 specific workaround.
> 
> Or
> Should I update both binding doc & the driver to use new compatible option 
> "ti,da830-ecap"
> (as da830 platform has initial IP support)?
> Even with this, platforms da830, da850 and am33xx can reuse "ti,da830-ecap" 
> compatible field.

To me updating the driver to keep adding a .compatible even when not
using it elsewhere in code is not required. Adding the new binding in
.dts file is must since you may need to do something specific to da830
at a later time (and the .dtb should be considered ROM'ed so you wont be
able to change it then). Adding documentation for the binding is also
required to help anyone wanting to know more about the binding after
reading the .dts file.

Thanks,
Sekhar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] sysfs: fix race between readdir and lseek

2013-03-21 Thread Li Zefan
On 2013/3/21 12:48, Ming Lei wrote:
> On Thu, Mar 21, 2013 at 11:28 AM, Li Zefan  wrote:
>> On 2013/3/21 11:17, Ming Lei wrote:
>>> On Thu, Mar 21, 2013 at 10:41 AM, Li Zefan  wrote:

 In fact the same race exists between readdir() and read()/write()...
>>>
>>> Fortunately, no read()/write() are implemented on sysfs directory, :-)
>>>
>>
>> That's irrelevant...
> 
> As far as sysfs is concerned, the filp->f_ops can't be changed in
> read/write path.
> 

Yes, it can...As I said, it's irrelevant, because it's vfs that changes
file->f_pos.

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
struct fd f = fdget(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file); <--- read f_pos
ret = vfs_read(f.file, buf, count, );   <--- return 
-EISDIR
file_pos_write(f.file, pos);<--- write f_pos
fdput(f);
}
return ret;
}

>>
>> See my report:
>>
>> https://patchwork.kernel.org/patch/2160771/
> 
> Yes, I know there might be some mess after the commit ef3d0fd2
> (vfs: do (nearly) lockless generic_file_llseek).
> 
> Also looks it has been stated in Documentation/filesystems/Locking:
> 
> ->llseek() locking has moved from llseek to the individual llseek
> implementations.  If your fs is not using generic_file_llseek, you
> need to acquire and release the appropriate locks in your ->llseek().
> For many filesystems, it is probably safe to acquire the inode
> mutex or just to use i_size_read() instead.
> Note: this does not protect the file->f_pos against concurrent modifications
> since this is something the userspace has to take care about.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: Initial fsck has landed

2013-03-21 Thread Daniel Phillips
Hi Dave,

Thank you for your insightful post. The answer to the riddle is that
the PHTree scheme as described in the link you cited has already
become "last gen" and that, after roughly ten years of searching, I am
cautiously optimistic that I have discovered a satisfactory next gen
indexing scheme with the properties I was seeking. This is what
Hirofumi and I have busy prototyping and testing for the last few
weeks. More below...

On Thu, Mar 21, 2013 at 6:57 PM, Dave Chinner  wrote:
> On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
>> At the moment we're being pretty quiet because of being in the middle
>> of developing the next-gen directory index. Not such a small task, as
>> you might imagine.
>
> The "next-gen directory index" comment made me curious. I wanted to
> know if there's anything I could learn from what you are doing and
> whether anything of your new algorithms could be applied to, say,
> the XFS directory structure to improve it.
>
> I went looking for design docs and found this:
>
> http://phunq.net/pipermail/tux3/2013-January/001938.html
>
> In a word: Disappointment.

Me too. While I convinced myself that the PHTree scheme would scale
significantly better than HTree while being only modestly slower than
HTree in the smaller range (millions of files) even the new scheme
began hitting significant difficulties in the form of write
multiplication in the larger range (billions of files). Most probably,
you discovered the same thing. The problem is not so much about
randomly thrashing the index, because these days even a cheap desktop
can cache the entire index, but rather getting the index onto disk
with proper atomic update at reasonable intervals. We can't accept a
situation where crashing on the 999,999,999th file create requires the
entire index to be rebuilt, or even a significant portion of it. That
means we need ACID commit at normal intervals all the way through the
heavy create load, and unfortunately, that's where the write
multiplication issue rears its ugly head. It turned out that most of
the PHTree index blocks would end up being written to disk hundreds of
times each, effectively stretching out what should be a 10 minute test
to hours.

To solve this, I eventually came up with a secondary indexing scheme
that would kick in under heavy file create load, to take care of
committing enough state to disk at regular intervals that remounting
after a crash would only lose a few seconds of work. With this, PHTree
would satisfy all the performance goals we set out for it, which can
be summarized as: scale smoothly all the way from one file per
directory to one billion files per directory.

The only really distasteful element remaining was the little matter of
having two different directory indexes, the PHTree and the temporary
secondary index. That seems like one index too many. Then the big aha
landed out of the blue: we can actually throw away the primary BTree
and the secondary index will work fine all on its own. So the
secondary index was suddenly promoted to a "next gen" primary index.
This new index (which does not yet have a name) is based on hashing
and sharding and has nothing whatsoever to do with BTrees. It
currently exists in prototype with enough implementation in place to
get some early benchmarks, but we are not quite ready to provide those
details yet. You are completely welcome to call me a tease or be
sceptical, which is just what I would do coming from the other side,
but just now we're in the thick of the heavy work, and the key as we
see it is to keep on concentrating until the time is right. After all,
this amounts to the end of a ten year search that began around the
time HTree went into service in Ext3. Another couple of weeks hardly
seems worth worrying about.

> Compared to the XFS directory structure, the most striking
> architectural similarity that I see is this:
>
> "the file bteee[sic] effectively is a second directory index
> that imposes a stable ordering on directory blocks".
>
> That was the key architectural innovation in the XFS directory
> structure that allowed it to provide the correct seekdir/telldir/NFS
> readdir semantics and still scale. i.e. virtually mapped directory
> entries. I explained this layout recently here:
>
> http://marc.info/?l=linux-ext4=136081996316453=2
> http://marc.info/?l=linux-ext4=136082221117399=2
> http://marc.info/?l=linux-ext4=136089526928538=2
>
> We could swap the relevant portions of your PHTree design doc with
> my comments (and vice versa) and both sets of references would still
> make perfect sense. :P
>
> Further, the PHTree description of tag based freespace tracking is
> rather close to how XFS uses tags to track free space regions,
> including the fact that XFS can be lazy at updating global free
> space indexes.  The global freespace tree indexing is slightly
> different to the XFS method - it's closer to the original V1 dir
> code in XFS (that didn't scale at all well) 

Re: [CFT] Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 10:18 PM, Al Viro  wrote:
>>
>> This seems too simple, but I don't see why iget_locked() would be any
>> better. It just wastes time hashing the inode that we aren't really
>> interested in hashing. The inode is always filled by the caller
>> anyway, so we migth as well just get a fresh pseudo-filesystem inode
>> without any crazy hashing..
>
> Umm...
> static int proc_delete_dentry(const struct dentry * dentry)
> {
> return 1;
> }
>
> static const struct dentry_operations proc_dentry_operations =
> {
> .d_delete   = proc_delete_dentry,
> };
>
> IOW, dcache retention in procfs is inexistent and the damn thing tries
> to cut down on the amount of inode allocation/freeing/filling.

Ok, that's kind of ugly, but shouldn't be a correctness issue. It
should still work - just cycle through inodes quite aggressivelydue to
no longer re-using them - so I suspect Dave could test it (with the
extra line removed I pointed out just a moment ago).

And I wonder how big of a deal the aggressive dentry deletion is.
Maybe it's even ok to allocate/free the inodes all the time. The whole
"get the inode hash lock and look it up there" can't be all that
wonderful either. It takes the inode->i_lock for each entry it finds
on the hash list, which looks horrible. I suspect our slab allocator
isn't much worse than that, although the RCU freeing of the inodes
could end up being problematic.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 02/34] arch: Consolidate tsk_is_polling()

2013-03-21 Thread Tony Breeds
On Thu, Mar 21, 2013 at 09:52:57PM -, Thomas Gleixner wrote:
 
> +/*
> + * Idle thread specific functions to determine the need_resched
> + * polling state. We have two versions, one based on TS_POLLING in
> + * thread_info.status and one based on TIF_POLLING_NRFLAG in
> + * thread_info.flags
> + */
> +#ifdef TS_POLLING
> +static inline int tsk_is_polling(struct task_struct *p)
> +{
> + return task_thread_info(p)->status & TS_POLLING;
> +}
> +#elif defined(TIF_POLLING_NRFLAG)
> +static inline int tsk_is_polling(struct task_struct *p)
> +{
> + test_tsk_thread_flag(p, TIF_POLLING_NRFLAG);
> +}

On powerpc (at least) this is used before it's declared.  Also I think
you're missing a 'return' in that function.

*cough* https://www.kernel.org/pub/tools/crosstool/ *cough*

Yours Tony


pgpVkdKEDNLIV.pgp
Description: PGP signature


Re: [CFT] Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 9:55 PM, Linus Torvalds
 wrote:
>
> So why not just use "new_inode_pseudo()" instead? IOW, something like
> this (totally untested, natch) patch?

Ok, so I think that patch is still probably the right way to go, but
it's broken for a silly reason: because it's not using iget_locked()
any more, it also needs to remove the now unnecessary (and actively
incorrect) call to unlock_new_inode().

So remove that one line too, and maybe it could even work. Still not
*tested*, though.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CFT] Re: VFS deadlock ?

2013-03-21 Thread Al Viro
On Thu, Mar 21, 2013 at 09:55:10PM -0700, Linus Torvalds wrote:
> However, I do wonder if we could take another approach... There's
> really no reason why we should look up an old inode with iget_locked()
> in proc_get_inode() and only fill it in if it is new. We might as well
> just always create a new inode. The "iget_locked()" logic really comes
> from the bad old days when the inode was the primary data structure,
> but it's really the dentry that is the important data structure, and
> the inode might as well follow the dentry, instead of the other way
> around.
> 
> So why not just use "new_inode_pseudo()" instead? IOW, something like
> this (totally untested, natch) patch? This way, if you have a new
> dentry, you are guaranteed to always have a new inode. None of the
> silly inode alias issues..
> 
> This seems too simple, but I don't see why iget_locked() would be any
> better. It just wastes time hashing the inode that we aren't really
> interested in hashing. The inode is always filled by the caller
> anyway, so we migth as well just get a fresh pseudo-filesystem inode
> without any crazy hashing..

Umm...
static int proc_delete_dentry(const struct dentry * dentry)
{
return 1;
}

static const struct dentry_operations proc_dentry_operations =
{
.d_delete   = proc_delete_dentry,
};

IOW, dcache retention in procfs is inexistent and the damn thing tries
to cut down on the amount of inode allocation/freeing/filling.

I agree that we could get rid of icache retention there and everything
ought to keep working.  Hell knows...  It applies only to the stuff that
isn't per-process, so it's probably not particulary hot anyway, but it
does need profiling...  OTOH, we probably could mark "stable" dentries
in some way and let proc_delete_dentry() check that flag in proc_dir_entry -
I seriously suspect that really hot non-per-process ones are of the
"never become invalid" variety.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-21 Thread Preeti U Murthy
Hi,

On 03/22/2013 07:00 AM, Alex Shi wrote:
> On 03/21/2013 06:27 PM, Preeti U Murthy wrote:
 did you close all of background system services?
 In theory the rq->avg.runnable_avg_sum should be zero if there is no
 task a bit long, otherwise there are some bugs in kernel.
>> Could you explain why rq->avg.runnable_avg_sum should be zero? What if
>> some kernel thread ran on this run queue and is now finished? Its
>> utilisation would be say x.How would that ever drop to 0,even if nothing
>> ran on it later?
> 
> the value get from decay_load():
>  sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
> in decay_load it is possible to be set zero.

Yes you are right, it is possible to be set to 0, but after a very long
time, to be more precise, nearly 2 seconds. If you look at decay_load(),
if the period between last update and now has crossed (32*63),only then
will the runnable_avg_sum become 0, else it will simply decay.

This means that for nearly 2seconds,consolidation of loads may not be
possible even after the runqueues have finished executing tasks running
on them.

The exact experiment that I performed was running ebizzy, with just two
threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16
logical cpu machine.When I begin running ebizzy with balance policy, the
2 threads of ebizzy are found one on each socket, while I would expect
them to be on the same socket. All other cpus, except the ones running
ebizzy threads are idle and not running anything on either socket.
I am not running any other processes.

You could run a similar experiment and let me know if you see otherwise.
I am at a loss to understand why else would such a spreading of load
occur, if not for the rq->util not becoming 0 quickly,when it is not
running anything. I have used trace_printks to keep track of runqueue
util of those runqueues not running anything after maybe some initial
load and it does not become 0 till the end of the run.

Regards
Preeti U Murthy


> 
> and /proc/sched_debug also approve this:
> 
>   .tg_runnable_contrib   : 0
>   .tg->runnable_avg  : 50
>   .avg->runnable_avg_sum : 0
>   .avg->runnable_avg_period  : 47507
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: Tree for Mar 22

2013-03-21 Thread Stephen Rothwell
Hi all,

Changes since 20130321:

New trees: mailbox, imx-mxs

Linus' tree still had its build failure for which I reverted a commit.

The sound-asoc tree lost its build failures.

The tty tree lost one of its build failures but still had the other so I
used the version from next-20130319.

The usb-gadget tree gained conflicts against the usb.current tree.

The imx-mxs tree gained a conflict against the arm-soc tree.

The akpm tree lost a patch that turned up elsewhere.



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" as mentioned in the FAQ on the wiki
(see below).

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log files
in the Next directory.  Between each merge, the tree was built with
a ppc64_defconfig for powerpc and an allmodconfig for x86_64. After the
final fixups (if any), it is also built with powerpc allnoconfig (32 and
64 bit), ppc44x_defconfig and allyesconfig (minus
CONFIG_PROFILE_ALL_BRANCHES - this fails its final link) and i386, sparc,
sparc64 and arm defconfig. These builds also have
CONFIG_ENABLE_WARN_DEPRECATED, CONFIG_ENABLE_MUST_CHECK and
CONFIG_DEBUG_INFO disabled when necessary.

Below is a summary of the state of the merge.

We are up to 222 trees (counting Linus' and 31 trees of patches pending
for Linus' tree), more are welcome (even if they are currently empty).
Thanks to those who have contributed, and to those who haven't, please do.

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

There is a wiki covering stuff to do with linux-next at
http://linux.f-seidel.de/linux-next/pmwiki/ .  Thanks to Frank Seidel.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

$ git checkout master
$ git reset --hard stable
Merging origin/master (0a7e453 Merge branch 'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux)
Merging fixes/master (9d3f252 Revert "KVM: allow host header to be included 
even for !CONFIG_KVM")
Merging kbuild-current/rc-fixes (6dbe51c Linux 3.9-rc1)
Merging arc-current/for-curr (367f3fc ARC: Fix the typo in event identifier 
flags used by ptrace)
Merging arm-current/fixes (112ccff Merge branch 'hwmon-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging)
Merging m68k-current/for-linus (5618395 m68k: Sort out !CONFIG_MMU_SUN3 vs. 
CONFIG_HAS_DMA)
Merging powerpc-merge/merge (af81d78 powerpc: Rename USER_ESID_BITS* to 
ESID_BITS*)
Merging sparc/master (10b3866 Merge tag 'for-linus-v3.9-rc4' of 
git://oss.sgi.com/xfs/xfs)
Merging net/master (ae5fc98 net: fix *_DIAG_MAX constants)
Merging ipsec/master (85dfb74 af_key: initialize satype in 
key_notify_policy_flush())
Merging sound-current/for-linus (55a63d4 ALSA: hda - Fix DAC assignment for 
independent HP)
Merging pci-current/for-linus (249bfb8 PCI/PM: Clean up PME state when removing 
a device)
Merging wireless/master (36ef0b47 rtlwifi: usb: add missing freeing of skbuff)
Merging driver-core.current/driver-core-linus (e5110f4 sysfs: handle failure 
path correctly for readdir())
Merging tty.current/tty-linus (8b5c913 serial: 8250_pci: Add WCH CH352 quirk to 
avoid Xscale detection)
Merging usb.current/usb-linus (fc98ab8 USB: ti_usb_3410_5052: fix 
use-after-free in TIOCMIWAIT)
Merging staging.current/staging-linus (27ca039 staging: zcache: fix typo 
"64_BIT")
Merging char-misc.current/char-misc-linus (347e089 VMCI: Fix process-to-process 
DRGAMs.)
Merging input-current/for-linus (4b7d293 Input: mms114 - Fix regulator enable 
and disable paths)
Merging md-current/for-linus (238f590 md: remove CONFIG_MULTICORE_RAID456 
entirely)
Merging audit-current/for-linus (c158a35 audit: no leading space in 
audit_log_d_path prefix)
Merging crypto-current/master (246bbed Revert "crypto: caam - add IPsec ESN 
support")
Merging ide/master (bf6b438 ide: gayle: use module_platform_driver_probe())
Merging dwmw2/master (63662139 params: Fix potential memory leak in 
add_sysfs_param())
Merging sh-current/sh-fixes-for-linus (4403310 SH: Convert out[bwl] macros to 
inline functions)
Merging irqdomain-current/irqdomain/merge (a0d271c Linux 3.6)
Merging devicetree-current/devicetree/merge (ab28698 of: define struct device 
in of_platform.h if !OF_DEVICE and !OF_ADDRESS)
Merging spi-current/spi/merge (d3601e5 

Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, );


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The file lrus also use the two-handed clock algorithm, correct?



If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?


Make sense, thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 02/34] arch: Consolidate tsk_is_polling()

2013-03-21 Thread Tony Breeds
On Thu, Mar 21, 2013 at 09:52:57PM -, Thomas Gleixner wrote:

> Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
> ===
> --- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
> +++ linux-2.6/arch/powerpc/include/asm/thread_info.h
> @@ -182,10 +182,6 @@ static inline bool test_thread_local_fla
>  #define is_32bit_task()  (1)
>  #endif
>  
> -#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
> -
> -#endif   /* !__ASSEMBLY__ */
> -

I think taking out this #endif is wrong.  probably wrong in the other
arches as well.
 

Yours Tony


pgpkvSfjKSD2S.pgp
Description: PGP signature


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Rik,
On 03/22/2013 11:56 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, );


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The file lrus also use the two-handed clock algorithm, correct?


After reinvestigate the codes, the answer is no. But why have this 
difference? I think you are the expert for this question, expect your 
explanation. :-)






If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?


Make sense, thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] nohz1: Documentation

2013-03-21 Thread Rob Landley

On 03/21/2013 10:45:07 AM, Arjan van de Ven wrote:

On 3/20/2013 5:27 PM, Steven Rostedt wrote:

I'm not sure I would recommend idle=poll either. It would certainly
work, but it goes to the other extreme. You think NO_HZ=n drains a
battery? Try idle=poll.



do not ever use idle=poll on anything production.. really bad idea.

if you temporary cannot cope with the latency, you can use the PMQOS  
system

to limit (including going all the way to idle=poll).
but using idle=poll completely is very nasty for the hardware.

In addition we should document that idle=poll will cost you peak  
performance,

possibly quite a bit.


Where should that be documented?

the same is true for the kernel paramter to some extend; it's there  
to work around
broken bioses/hardware/etc; if you have a latency/runtime  
requirement, it's much better

to use PMQOS for this from userspace.


I googled and found  
http://elinux.org/images/f/f9/Elc2008_pm_qos_slides.pdf


Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CFT] Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 9:37 PM, Al Viro  wrote:
>
> FWIW, a relatively crude solution is this:
>
> -   d_add(dentry, inode);
> -   return NULL;
> +   return d_materialise_unique(dentry, inode);
>
> It *is* crude, but it restores the assert, killing the deadlock and lets
> everything work more or less as it used to.  The case where things start
> to look odd is this:
>
> root@kvm-amd64:~# cd /proc/1/net/stat/; ls /proc/2/net/stat; /bin/pwd
> arp_cache  ndisc_cache  rt_cache
> /proc/2/net/stat
>
> IOW, if we were about to create a directory alias, the old dentry gets moved
> in new place.

Ugh, no, this is worse than the bug we're trying to fix.

However, I do wonder if we could take another approach... There's
really no reason why we should look up an old inode with iget_locked()
in proc_get_inode() and only fill it in if it is new. We might as well
just always create a new inode. The "iget_locked()" logic really comes
from the bad old days when the inode was the primary data structure,
but it's really the dentry that is the important data structure, and
the inode might as well follow the dentry, instead of the other way
around.

So why not just use "new_inode_pseudo()" instead? IOW, something like
this (totally untested, natch) patch? This way, if you have a new
dentry, you are guaranteed to always have a new inode. None of the
silly inode alias issues..

This seems too simple, but I don't see why iget_locked() would be any
better. It just wastes time hashing the inode that we aren't really
interested in hashing. The inode is always filled by the caller
anyway, so we migth as well just get a fresh pseudo-filesystem inode
without any crazy hashing..

 Linus


patch.diff
Description: Binary data


[CFT] Re: VFS deadlock ?

2013-03-21 Thread Al Viro
On Fri, Mar 22, 2013 at 01:40:37AM +, Al Viro wrote:

> Yeah, I went to do such patch after sending the previous mail and noticed
> that we already did it that way.  Simplicity of error recovery was probably
> more important consideration there - I honestly don't remember the reasoning
> in such details; it had been a decade or so...  So lock_rename() doing
> ->d_inode comparison (with dire comment re not expecting that to be sufficient
> for anything other than this bug in procfs) will probably suffice for 
> fs/namei.c
> part of it; I'm still looking at dcache.c side of things...

FWIW, a relatively crude solution is this:

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 4b3b3ff..778cbac 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -416,8 +416,7 @@ struct dentry *proc_lookup_de(struct proc_dir_entry *de, 
struct inode *dir,
if (!inode)
return ERR_PTR(-ENOMEM);
d_set_d_op(dentry, _dentry_operations);
-   d_add(dentry, inode);
-   return NULL;
+   return d_materialise_unique(dentry, inode);
}
}
spin_unlock(_subdir_lock);

It *is* crude, but it restores the assert, killing the deadlock and lets
everything work more or less as it used to.  The case where things start
to look odd is this:

root@kvm-amd64:~# cd /proc/1/net/stat/; ls /proc/2/net/stat; /bin/pwd
arp_cache  ndisc_cache  rt_cache
/proc/2/net/stat

IOW, if we were about to create a directory alias, the old dentry gets moved
in new place.  OTOH, I think it's the most robust backportable variant we
can do.  And yes, that should apply at least all the way back to 2.6.25 when
Eric acked a patch from Pavel that really should've been nacked...

Folks, could you test that one and see if any real userland breaks on that?
If everything works, I'd propose that one for -stable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc: drop even more unused Kconfig symbols

2013-03-21 Thread Michael Ellerman
On Thu, Mar 21, 2013 at 12:10:06PM +0100, Paul Bolle wrote:
> When I submitted commit 6805ab6daa2b589fe3242d05ddc47a9dbb0c4eb1
> ("powerpc: drop unused Kconfig symbols") I apparently failed to notice
> that my patch also made PREP_RESIDUAL and PPC_A2_DD2 unused. Drop these
> now.
> 
> Signed-off-by: Paul Bolle 
> ---
> 0) Untested.
> 
> 1) I investigated these Kconfig files a bit and discovered that PPC_PREP
> is marked BROKEN since v2.6.15, see commit
> 5be396b00ca0f2f769c55cf69bbd7c77451c925e ("powerpc: Mark PREP and
> embedded as broken for now"). Though it's not my problem, this does
> suggest PReP support can be removed entirely.

It does, and the best code is deleted code.

Care to send a patch?

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


BUG at kmem_cache_alloc

2013-03-21 Thread CAI Qian
Starting to see those on 3.8.4 (never saw in 3.8.2) stable kernel on a few 
systems
during LTP run,

[11297.597242] BUG: unable to handle kernel paging request at fffe 
[11297.598022] IP: [] kmem_cache_alloc+0x68/0x1e0 
[11297.598022] PGD 7b9eb067 PUD 0  
[11297.598022] Oops:  [#2] SMP  
[11297.598022] Modules linked in: cmtp kernelcapi bnep scsi_transport_iscsi 
rfcomm l2tp_ppp l2tp_netlink l2tp_core hidp ipt_ULOG af_key nfc rds pppoe pppox 
ppp_generic slhc af_802154 atm ip6table_filter ip6_tables iptable_filter 
ip_tables btrfs zlib_deflate vfat fat nfs_layout_nfsv41_files nfsv4 auth_rpcgss 
nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache nfnetlink_log nfnetlink bluetooth 
rfkill arc4 md4 nls_utf8 cifs dns_resolver nf_tproxy_core nls_koi8_u nls_cp932 
ts_kmp sctp sg kvm_amd kvm virtio_balloon i2c_piix4 pcspkr xfs libcrc32c 
ata_generic pata_acpi cirrus drm_kms_helper ttm ata_piix virtio_net drm libata 
virtio_blk i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod [last 
unloaded: ipt_REJECT] 
[11297.598022] CPU 1  
[11297.598022] Pid: 14134, comm: ltp-pan Tainted: G  D  3.8.4+ #1 Bochs 
Bochs 
[11297.598022] RIP: 0010:[]  [] kmem_cache_alloc+0x68/0x1e0 
[11297.598022] RSP: 0018:8800447dbdd0  EFLAGS: 00010246 
[11297.598022] RAX:  RBX: 88007c169970 RCX: 
018acdcd 
[11297.598022] RDX: 0006c104 RSI: 80d0 RDI: 
88007d04ac00 
[11297.598022] RBP: 8800447dbe10 R08: 00017620 R09: 
810fe2e2 
[11297.598022] R10:  R11:  R12: 
fffe 
[11297.598022] R13: 80d0 R14: 88007d04ac00 R15: 
88007d04ac00 
[11297.598022] FS:  7f09c29b4740() GS:88007fd0() 
knlGS:f74d86c0 
[11297.598022] CS:  0010 DS:  ES:  CR0: 8005003b 
[11297.598022] CR2: fffe CR3: 37213000 CR4: 
06e0 
[11297.598022] DR0:  DR1:  DR2: 
 
[11297.598022] DR3:  DR6: 0ff0 DR7: 
0400 
[11297.598022] Process ltp-pan (pid: 14134, threadinfo 8800447da000, task 
8800551ab2e0) 
[11297.598022] Stack: 
[11297.598022]  810fe2e2 8108cf0f 01200011 
88007c169970 
[11297.598022]   7f09c29b4a10  
88007c169970 
[11297.598022]  8800447dbe30 810fe2e2  
01200011 
[11297.598022] Call Trace: 
[11297.598022]  [] ? __delayacct_tsk_init+0x22/0x40 
[11297.598022]  [] ? prepare_creds+0xdf/0x190 
[11297.598022]  [] __delayacct_tsk_init+0x22/0x40 
[11297.598022]  [] copy_process.part.25+0x31f/0x13f0 
[11297.598022]  [] do_fork+0xa9/0x350 
[11297.598022]  [] sys_clone+0x16/0x20 
[11297.598022]  [] stub_clone+0x69/0x90 
[11297.598022]  [] ? system_call_fastpath+0x16/0x1b 
[11297.598022] Code: 90 4d 89 fe 4d 8b 06 65 4c 03 04 25 c8 db 00 00 49 8b 50 
08 4d 8b 20 4d 85 e4 0f 84 2b 01 00 00 49 63 46 20 4d 8b 06 41 f6 c0 0f <49> 8b 
1c 04 0f 85 55 01 00 00 48 8d 4a 01 4c 89 e0 65 49 0f c7  
[11297.598022] RIP  [] kmem_cache_alloc+0x68/0x1e0 
[11297.598022]  RSP  
[11297.598022] CR2: fffe 
[11297.727799] ---[ end trace 037bde72f23b34d2 ]---

Never saw this in mainline but only something like this wondering could be 
related
(that kmem_cache_alloc also in the trace).

[12124.201919] INFO: task kworker/2:1:166 blocked for more than 120 seconds. 
[12124.242758] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message. 
[12124.289801] kworker/2:1 D 88081fc54440 0   166  2 0x 
[12124.330784]  88081361ba68 0046 880813568000 
88081361bfd8 
[12124.373694]  88081361bfd8 88081361bfd8 8808144fb2e0 
880813568000 
[12124.416896]   880813568000 8808133f8930 
0002 
[12124.458674] Call Trace: 
[12124.473291]  [] schedule+0x29/0x70 
[12124.502143]  [] rwsem_down_failed_common+0xda/0x230 
[12124.539311]  [] rwsem_down_write_failed+0x13/0x20 
[12124.575585]  [] call_rwsem_down_write_failed+0x13/0x20 
[12124.614129]  [] ? down_write+0x32/0x40 
[12124.644703]  [] xlog_cil_push+0x89/0x3c0 [xfs] 
[12124.680046]  [] ? up+0x32/0x50 
[12124.706083]  [] ? flush_work+0x113/0x170 
[12124.738078]  [] xlog_cil_force_lsn+0xf7/0x160 [xfs] 
[12124.776062]  [] ? xfs_trans_free_items+0x88/0xb0 [xfs] 
[12124.814503]  [] _xfs_log_force_lsn+0x5a/0x2e0 [xfs] 
[12124.851512]  [] xfs_trans_commit+0x263/0x270 [xfs] 
[12124.887996]  [] xfs_fs_log_dummy+0x61/0x90 [xfs] 
[12124.924015]  [] ? xfs_log_need_covered+0x93/0xc0 [xfs] 
[12124.963079]  [] xfs_log_worker+0x48/0x50 [xfs] 
[12124.997404]  [] process_one_work+0x174/0x3d0 
[12125.031408]  [] worker_thread+0x10f/0x390 
[12125.062936]  [] ? busy_worker_rebind_fn+0xb0/0xb0 
[12125.098924]  [] kthread+0xc0/0xd0 
[12125.126124]  [] ? kthread_create_on_node+0x120/0x120 
[12125.162995]  [] ret_from_fork+0x7c/0xb0 
[12125.193516]  [] ? 

Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue

2013-03-21 Thread Arve Hjønnevåg
On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong  wrote:
> Arve Hjønnevåg  wrote:
>> On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong  wrote:
>> > Changes since v2:
>> > * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
>> >   now and rely on implicit barriers in other places for reading.
>> > * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
>> >   much overhead in the ep_send_events loop and could not eliminate
>> >   starvation dangers from improper EPOLLET usage (the original code
>> >   had this problem, too, the window is just a few cycles larger, now).
>> > * minor code cleanups
>
>> > /*
>> >  * Activate ep->ws before deactivating epi->ws to prevent
>>
>> Does anything deactivate ep->ws now?
>
> Oops, I left that out when I killed ep_scan_ready_list.
> But I think we need a different approach to wakeup sources in
> this series...
>
>> > +   /*
>> > +* reset item state for EPOLLONESHOT and EPOLLET
>> > +* no barrier here, rely on ep->mtx release for write 
>> > barrier
>> > +*/
>>
>> What happens if ep_poll_callback runs before you set epi->state below?
>> It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws,
>> but now it does not appear to do anything.
>>
>> > +   epi->state = EP_STATE_IDLE;
>> > }
>> >
>> > return eventcnt;
>> >  }
>> >
>
> With EPOLLET and improper usage (not hitting EAGAIN), the event now
> has a larger window to be lost (as mentioned in my changelog).
>

What about the case where EPOLLET is not set? The old code did not
drop events in that case.

> As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
> an atomic counter to struct eventpoll (or each epitem) will work?
>

The wakeup_source should stay in sync with the epoll state. I don't
think any additional state is needed.

> If we go with atomic counter in struct eventpoll, is per-epitem
> wakeup_source still necessary?  We have space in epitem now, but
> maybe one day we will might need it.
>

The wakeup_source per epitem is useful for accounting reasons. If
suspend fails, it is useful to know which device caused it.

> Thanks for looking at this patch.
>
> Btw, I'm curious; which applications use EPOLLWAKEUP?
>
> My epoll work is focused on network servers with thousands of clients,
> and I don't think any of them use (or have use for) EPOLLWAKEUP.
> But I will keep EPOLLWAKEUP users in mind when working on epoll :)

EPOLLWAKEUP is only needed on systems that use suspend. I don't know
if it is currently in use, but it is intended to at least replace the
evdev wakelock in the android kernel, but user-space needs to be
updated before we can drop that patch.

-- 
Arve Hjønnevåg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the imx-mxs tree with the arm-soc tree

2013-03-21 Thread Stephen Rothwell
Hi Shawn,

Today's linux-next merge of the imx-mxs tree got a conflict in
arch/arm/mach-imx/mach-imx6q.c between commit da4a686a2cfb ("ARM:
smp_twd: convert to use CLKSRC_OF init") from the arm-soc tree and commit
371287432334 ("ARM: imx: enable anatop suspend/resume") from the imx-mxs
tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc arch/arm/mach-imx/mach-imx6q.c
index b59ddcb,31aee4d..000
--- a/arch/arm/mach-imx/mach-imx6q.c
+++ b/arch/arm/mach-imx/mach-imx6q.c
@@@ -291,8 -257,7 +256,7 @@@ static void __init imx6q_init_irq(void
  static void __init imx6q_timer_init(void)
  {
mx6q_clocks_init();
 -  twd_local_timer_of_register();
 +  clocksource_of_init();
-   imx_print_silicon_rev("i.MX6Q", imx6q_revision());
  }
  
  static const char *imx6q_dt_compat[] __initdata = {


pgpVrkVHyCOYA.pgp
Description: PGP signature


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Rik van Riel

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, );


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.

If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Rik van Riel

On 03/21/2013 09:23 PM, Linus Torvalds wrote:

On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso  wrote:


ipc lock contention:
100 users:  8,74%  (vanilla)3.17% (v3 patchset)
400 users:  21,86% (vanilla)5.23% (v3 patchset)
800 users   84,35% (vanilla)7.39% (v3 patchset)


Ok, I'd call that pretty much "solved". Sure, it's still visible, but
for being a benchmark that apparently does little else than pound on
those sysv semaphores, I think we can consider it pretty much fine.
I'm going to assume that anybody who actually then does any real work
(ie a database) is never going to see even close to this bad
contention.

Good job, Rik. I'm assuming we'll be merging this during the 3.10
merge window, and hopefully the merge conflicts will be sorted out
too. Rik, Peter, can you look at each others patches and see if you
can get that sorted out for Andrew?


Will do.

I will rebase this series on top of what is in linux-next.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Rik van Riel

On 03/21/2013 06:01 PM, Andrew Morton wrote:

On Thu, 21 Mar 2013 17:50:05 -0400 Peter Hurley  
wrote:


On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:

On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel  wrote:


This series makes the sysv semaphore code more scalable,
by reducing the time the semaphore lock is held, and making
the locking more scalable for semaphore arrays with multiple
semaphores.

The first four patches were written by Davidlohr Buesso, and
reduce the hold time of the semaphore lock.

The last three patches change the sysv semaphore code locking
to be more fine grained, providing a performance boost when
multiple semaphores in a semaphore array are being manipulated
simultaneously.


These patches conflict pretty badly with Peter's:


On one point I'm a little confused: my series has been in linux-next for
a while. On what tree is this series based?


It'll be based on mainline.  People often forget to peek into
linux-next when preparing patches.  In the great majority of cases
that's OK.  Occasionally, we lose...


I'll be happy to rebase the series onto linux-next, to make
merging easier for you.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] rbtree_test: add more rbtree integrity checks

2013-03-21 Thread Michel Lespinasse
On Mon, Mar 18, 2013 at 4:21 PM, Davidlohr Bueso  wrote:
> When checking the rbtree, account for more properties:
>
>- Both children of a red node are black.
>- The tree has at least 2**bh(v)-1 internal nodes.

> -   WARN_ON_ONCE(is_red(rb) &&
> -(!rb_parent(rb) || is_red(rb_parent(rb;
> +
> +   if (is_red(rb)) {
> +   /*
> +* root must be black and no path contains two
> +* consecutive red nodes.
> +*/
> +   WARN_ON_ONCE(!rb_parent(rb) || is_red(rb_parent(rb)));
> +
> +   /* both children of a red node are black */
> +   WARN_ON_ONCE(is_red(rb->rb_left) || 
> is_red(rb->rb_right));
> +   }

This seems quite redundant with the previous test - if we're going to
visit each children, then at that point we're going to check that they
can't be black if their parent (the current node) is black. So I don't
see that the tests adds any coverage.

> WARN_ON_ONCE(count != nr_nodes);
> +   WARN_ON_ONCE(count < (1 << black_path_count(rb_last())) - 1);

I like this last check - it can also be seen as a consequence of the
others, but it it's only one line and it nicely sums up what the other
properties are for :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: fix powerpc build error for !CONFIG_KVM

2013-03-21 Thread Stephen Rothwell
Fixes these build error when CONFIG_KVM is not defined:

In file included from arch/powerpc/include/asm/kvm_ppc.h:33:0,
 from arch/powerpc/kernel/setup_64.c:67:
arch/powerpc/include/asm/kvm_book3s.h:65:20: error: field 'pte' has incomplete 
type
arch/powerpc/include/asm/kvm_book3s.h:69:18: error: field 'vcpu' has incomplete 
type
arch/powerpc/include/asm/kvm_book3s.h:98:34: error: 'HPTEG_HASH_NUM_PTE' 
undeclared here (not in a function)
arch/powerpc/include/asm/kvm_book3s.h:99:39: error: 'HPTEG_HASH_NUM_PTE_LONG' 
undeclared here (not in a function)
arch/powerpc/include/asm/kvm_book3s.h:100:35: error: 'HPTEG_HASH_NUM_VPTE' 
undeclared here (not in a function)
arch/powerpc/include/asm/kvm_book3s.h:101:40: error: 'HPTEG_HASH_NUM_VPTE_LONG' 
undeclared here (not in a function)
arch/powerpc/include/asm/kvm_book3s.h:129:4: error: 'struct kvm_run' declared 
inside parameter list [-Werror]
arch/powerpc/include/asm/kvm_book3s.h:129:4: error: its scope is only this 
definition or declaration, which is probably not what you want [-Werror]

... and so on ...

This was introduced by commit f445f11eb2cc265dd47da5b2e864df46cd6e5a82
"KVM: allow host header to be included even for !CONFIG_KVM"

Cc: Kevin Hilman 
Cc: Marcelo Tosatti 
Signed-off-by: Stephen Rothwell 
---
 include/linux/kvm_host.h | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

Kevin, does this still fix the error that commit
f445f11eb2cc265dd47da5b2e864df46cd6e5a82 was fixing?

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a942863..90ebec0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1,8 +1,6 @@
 #ifndef __KVM_HOST_H
 #define __KVM_HOST_H
 
-#if IS_ENABLED(CONFIG_KVM)
-
 /*
  * This work is licensed under the terms of the GNU GPL, version 2.  See
  * the COPYING file in the top-level directory.
@@ -751,6 +749,7 @@ static inline int kvm_deassign_device(struct kvm *kvm,
 }
 #endif /* CONFIG_IOMMU_API */
 
+#if IS_ENABLED(CONFIG_KVM)
 static inline void __guest_enter(void)
 {
/*
@@ -770,6 +769,10 @@ static inline void __guest_exit(void)
vtime_account_system(current);
current->flags &= ~PF_VCPU;
 }
+#else
+static inline void __guest_enter(void) { return; }
+static inline void __guest_exit(void) { return; }
+#endif /* IS_ENABLED(CONFIG_KVM) */
 
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void guest_enter(void);
@@ -1057,8 +1060,4 @@ static inline bool 
kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 }
 
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
-#else
-static inline void __guest_enter(void) { return; }
-static inline void __guest_exit(void) { return; }
-#endif /* IS_ENABLED(CONFIG_KVM) */
 #endif
-- 
1.8.1

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpCJMk4r953g.pgp
Description: PGP signature


Re: PREEMPT_RT vs 'hrtimer: Prevent hrtimer_enqueue_reprogram race'

2013-03-21 Thread Steven Rostedt
On Fri, 2013-03-22 at 03:24 +, Ben Hutchings wrote:

> > Note, I posted a fix on Tuesday:
> > 
> > https://lkml.org/lkml/2013/3/19/369
> 
> Thanks.  I did search GMANE with some obvious terms but I think its
> index is lagging.

It didn't help that my subject had no mention of -rt in it :-(

> I'm rebasing the rt patch series generated with
> 'git format-patch v3.2.39..v3.2.39-rt59-rebase' on top of v3.2.41 (plus
> Debian changes, which introduce some trivial textual conflicts).

Ah, OK, I do it the opposite way. As the stable rt git never rebases, I
always merge the latest stable into my tree. I then create a rebased
tree as well. When I do that, I'm sure I'll end up fixing the patches up
pretty much the same as you have.

Thanks,

-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] serial: of_serial: Handle fifosize property

2013-03-21 Thread Ley Foon Tan
On Thu, 2013-03-21 at 15:24 +0200, Heikki Krogerus wrote:
> Hi,
> 
> On Thu, Mar 21, 2013 at 07:41:39PM +0800, Ley Foon Tan wrote:
> > On Thu, 2013-03-21 at 12:48 +0200, Heikki Krogerus wrote:
> > > + /* Check for fifo size */
> > > + if (of_property_read_u32(np, "fifosize", ) == 0)
> > > + port->fifosize = prop;
> > > +
> > Suggest to use "fifo-size" for the device tree property, to align with
> > other DT properties.
> 
> I was going to, but then I noticed that in some .dtsi files "fifosize"
> is used with uarts. Should I still change it?
I just make the grep for "fifosize" in arch/. It is used by other serial
drivers (not of_serial.c). So, you are safe to change it to "fifo-size".

> 
> > >   port->irq = irq_of_parse_and_map(np, 0);
> > >   port->iotype = UPIO_MEM;
> > >   if (of_property_read_u32(np, "reg-io-width", ) == 0) {
> > 
> > I think you need to remove the UPF_FIXED_TYPE from port-flags as well to
> > use the fifo size from device tree. Otherwise, it will get from the
> > static array in 8250.c.
> 
> No, it's the other way around. It is picked from the array
> conditionally, only in case it was not already set. However, if
> UPF_FIXED_TYPE is removed then autoconfig() will override it.
> 
> Thanks,
> 
Okay, I got what you means now. I think someone updated the 8250.c
recently, previously it is always take from static array.
Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the usb-gadget tree with the usb.current tree

2013-03-21 Thread Stephen Rothwell
Hi Felipe,

Today's linux-next merge of the usb-gadget tree got a conflict in
drivers/usb/gadget/net2280.c between commit 8119b55aed81 ("USB: gadget:
net2280: remove leftover driver->unbind call in error pathway") from the
usb.current tree and commit 68abc94f8de8 ("usb: gadget: net2280: don't 
touch gadget.dev.driver") from the usb-gadget tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc drivers/usb/gadget/net2280.c
index 3bd0f99,691cc65..000
--- a/drivers/usb/gadget/net2280.c
+++ b/drivers/usb/gadget/net2280.c
@@@ -1924,7 -1920,7 +1920,6 @@@ static int net2280_start(struct usb_gad
  err_func:
device_remove_file (>pdev->dev, _attr_function);
  err_unbind:
-   dev->gadget.dev.driver = NULL;
 -  driver->unbind (>gadget);
dev->driver = NULL;
return retval;
  }


pgpxkqywCaU__.pgp
Description: PGP signature


Re: [PATCH 1/3] rbtree_test: use pr_info for module prefix in messages

2013-03-21 Thread Michel Lespinasse
On Thu, Mar 21, 2013 at 7:51 PM, Davidlohr Bueso  wrote:
> On Tue, 2013-03-19 at 11:54 -0600, Shuah Khan wrote:
>> On Tue, Mar 19, 2013 at 11:14 AM, Davidlohr Bueso
>>  wrote:
>> > On Tue, 2013-03-19 at 10:29 -0600, Shuah Khan wrote:
>> >> On Mon, Mar 18, 2013 at 5:20 PM, Davidlohr Bueso  
>> >> wrote:
>> >> > This provides nicer message output. Since it seems more appropriate
>> >> > for the nature of this module, also use KERN_INFO instead of other
>> >> > levels.
>> >>
>> >> Why are you changing the ALERTs to INFO?
>> >
>> > Because of the nature of the messages. They don't justify having a
>> > KERN_ALERT level (requiring immediate attention), and it seems a lot
>> > more suitable to use INFO instead.
>> >
>>
>> Hmm. I see interval_tree_test using the same alerts. It almost looks
>> like the start and end of a test are meant to be alerts. I am not
>> saying it shouldn't be changed, however looking for a stronger reason
>> than "it seems a lot more suitable to use INFO instead". Are there any
>> use-cases in which KERN_ALERTs cause problems?
>>
>
> No 'issue' particularly, just common sense. In any case I have no
> problem reverting the changes back to KERN_ALERT, no big deal.
>
> Andrew, Michel, do you have any preferences? I'm mostly interested in
> patch 3/3, do you have any objections?

Sorry for the late reply - I have a lot of upstream email to catch up to.

No objection to the change but I also have to say I'm not quite sure
what's the motivation - it'd be easier if you had a 0/3 mail to
explain the issue. In particular, I'm not sure if you've been trying
to use the test compiled in rather than as a module (which is all I've
ever built it as myself :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the usb-gadget tree with the usb.current tree

2013-03-21 Thread Stephen Rothwell
Hi Felipe,

Today's linux-next merge of the usb-gadget tree got a conflict in
drivers/usb/gadget/net2272.c between commit eda81bea894e ("usb: gadget:
net2272: finally convert "CONFIG_USB_GADGET_NET2272_DMA"") from the
usb.current tree and commit c36cbfc045bf ("usb: gadget: net2272: remove
unused DMA_ADDR_INVALID") from the usb-gadget tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc drivers/usb/gadget/net2272.c
index 32524b63,03e4104..000
--- a/drivers/usb/gadget/net2272.c
+++ b/drivers/usb/gadget/net2272.c
@@@ -58,8 -58,7 +58,7 @@@ static const char * const ep_name[] = 
"ep-a", "ep-b", "ep-c",
  };
  
- #define DMA_ADDR_INVALID  (~(dma_addr_t)0)
 -#ifdef CONFIG_USB_GADGET_NET2272_DMA
 +#ifdef CONFIG_USB_NET2272_DMA
  /*
   * use_dma: the NET2272 can use an external DMA controller.
   * Note that since there is no generic DMA api, some functions,


pgp91RrxiqNKC.pgp
Description: PGP signature


Re: PREEMPT_RT vs 'hrtimer: Prevent hrtimer_enqueue_reprogram race'

2013-03-21 Thread Ben Hutchings
On Thu, 2013-03-21 at 22:31 -0400, Steven Rostedt wrote:
> On Fri, 2013-03-22 at 01:11 +, Ben Hutchings wrote:
> > Commit b22affe0aef4 'hrtimer: Prevent hrtimer_enqueue_reprogram race'
> > conflicts with the RT patches
> > hrtimer-fixup-hrtimer-callback-changes-for-preempt-r.patch and
> > peter_zijlstra-frob-hrtimer.patch, as they all change
> > hrtimer_enqueue_reprogram().  It seems that the changes in the RT
> > patches now belong in __hrtimer_start_range_ns().
> > 
> > Since I haven't seen any RT releases in a while, here's what I came up
> > with for 3.2-rt:
> 
> Note, I posted a fix on Tuesday:
> 
> https://lkml.org/lkml/2013/3/19/369

Thanks.  I did search GMANE with some obvious terms but I think its
index is lagging.

> I'm waiting for Thomas to give his OK on it before releasing the series.
> He told me he'll have a look at it tomorrow. I've already ran the series
> through all my tests, and will post it immediately after I get the OK.
> Or if there's a issue I will have to fix it and rerun my tests.
> 
> 
> > 
> > ---
> > From: Thomas Gleixner 
> > Date: Fri, 3 Jul 2009 08:44:31 -0500
> > Subject: hrtimer: fixup hrtimer callback changes for preempt-rt
[...]
> > @@ -1011,6 +1023,26 @@ int __hrtimer_start_range_ns(struct hrti
> >  */
> > if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases)
> > && hrtimer_enqueue_reprogram(timer, new_base)) {
> > +#ifdef CONFIG_PREEMPT_RT_BASE
> > +   again:
> 
> What kernel are you working with? I don't see anywhere the "again:"
> within a PREEMPT_RT_BASE block.

I'm rebasing the rt patch series generated with
'git format-patch v3.2.39..v3.2.39-rt59-rebase' on top of v3.2.41 (plus
Debian changes, which introduce some trivial textual conflicts).

So these patches were previously:

commit c495d005449523772e27a22fb74814dc3cebff8e
Author: Thomas Gleixner 
Date:   Fri Jul 3 08:44:31 2009 -0500

hrtimer: fixup hrtimer callback changes for preempt-rt

commit 80cc960e628509c72f63a7327a4dc22707a02b81
Author: Peter Zijlstra 
Date:   Fri Aug 12 17:39:54 2011 +0200

hrtimer: Don't call the timer handler from hrtimer_start

[...]
> This is very similar to what I came up with.
[...]
> Yep, looks like we are on the right track  :-)
[...]

Good to hear.

Ben.

-- 
Ben Hutchings
Make three consecutive correct guesses and you will be considered an expert.


signature.asc
Description: This is a digitally signed message part


Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue

2013-03-21 Thread Eric Wong
Arve Hjønnevåg  wrote:
> On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong  wrote:
> > Changes since v2:
> > * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
> >   now and rely on implicit barriers in other places for reading.
> > * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
> >   much overhead in the ep_send_events loop and could not eliminate
> >   starvation dangers from improper EPOLLET usage (the original code
> >   had this problem, too, the window is just a few cycles larger, now).
> > * minor code cleanups

> > /*
> >  * Activate ep->ws before deactivating epi->ws to prevent
> 
> Does anything deactivate ep->ws now?

Oops, I left that out when I killed ep_scan_ready_list.
But I think we need a different approach to wakeup sources in
this series...

> > +   /*
> > +* reset item state for EPOLLONESHOT and EPOLLET
> > +* no barrier here, rely on ep->mtx release for write 
> > barrier
> > +*/
> 
> What happens if ep_poll_callback runs before you set epi->state below?
> It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws,
> but now it does not appear to do anything.
> 
> > +   epi->state = EP_STATE_IDLE;
> > }
> >
> > return eventcnt;
> >  }
> >

With EPOLLET and improper usage (not hitting EAGAIN), the event now
has a larger window to be lost (as mentioned in my changelog).

As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
an atomic counter to struct eventpoll (or each epitem) will work?

If we go with atomic counter in struct eventpoll, is per-epitem
wakeup_source still necessary?  We have space in epitem now, but
maybe one day we will might need it.

Thanks for looking at this patch.

Btw, I'm curious; which applications use EPOLLWAKEUP?

My epoll work is focused on network servers with thousands of clients,
and I don't think any of them use (or have use for) EPOLLWAKEUP.
But I will keep EPOLLWAKEUP users in mind when working on epoll :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


nfsv4.1 client refuses to suspend

2013-03-21 Thread ycnian
nfsv4.1 client suspending fails with such info

Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, 
wq_busy=0):
nfsv4.1-svc S 88007889f2e0 0  3191  2 0x0080   
 88007b2f3e28 0046 88007b2f2010 000127c0
 880079b08000 000127c0 88007b2f3fd8 000127c0
 88007b2f3fd8 000127c0 81a14410 880079b08000
Call Trace:
 [] schedule+0x64/0x66
 [] nfs41_callback_svc+0x100/0x129 [nfsv4]
 [] ? wake_up_bit+0x2a/0x2a
 [] ? nfs_callback_up+0x548/0x548 [nfsv4]
 [] kthread+0xb5/0xbd
 [] ? kthread_freezable_should_stop+0x65/0x65
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_freezable_should_stop+0x65/0x65

I read such threads
[1] nfs/sunrpc: allow freezing of tasks with NFS calls in flight
[2] LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!
and then modify nfs41_callback_svc(). It works on my machine. I don't know
the details of freezing, so I'm not sure if the modification is reasonable.
This is not a formal patch. Thanks.

Signed-off-by: Yanchuan Nian 
---
 fs/nfs/callback.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index 5088b57..8addb7b 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -138,7 +138,7 @@ nfs41_callback_svc(void *vrqstp)
error);
} else {
spin_unlock_bh(>sv_cb_lock);
-   schedule();
+   freezable_schedule();
}
finish_wait(>sv_cb_waitq, );
}
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] init: fix name of root device in /proc/mounts

2013-03-21 Thread Rob Landley

On 03/20/2013 04:11:25 PM, William Hubbs wrote:

On Wed, Mar 20, 2013 at 02:03:20AM -0500, Rob Landley wrote:
> On 03/19/2013 07:20:17 PM, William Hubbs wrote:
> > On Tue, Mar 19, 2013 at 04:17:11PM -0700, H. Peter Anvin wrote:
> > > On 03/19/2013 03:28 PM, William Hubbs wrote:
> > > > The issue is that /dev/root appears in /proc/mounts if you do  
not
> > > > boot with an initramfs, but /dev/root is not a device node.  
In the

> > > > past, udev created a symbolic link from /dev/root to the
> > > > appropriate block device, but it does not do this any longer.
> > Also,
> > > > devtmpfs does not create this symbolic link.
> > > >
> > > > This is causing bugs with software that depends on the  
existence

> > > > of /dev/root [2] for example.
> > >
> > > Seems okay to me, although even better would be to use the udev  
name

> > > of the device in question.
> >
> > I'm not following what you mean.
> >
> > The problem is that "/dev/root" should not be in /proc/mounts,
> > since there is always another entry that points to the root
> > file system.
>
> What gave you that idea?
>
> wget http://landley.net/aboriginal/bin/system-image-i686.tar.bz2
> extract it and ./run-emulator.sh and in there:
>
> (i686:1) /home # cat /proc/mounts
> rootfs / rootfs rw 0 0
> /dev/root / squashfs ro,relatime 0 0
> proc /proc proc rw,relatime 0 0
> sys /sys sysfs rw,relatime 0 0
> dev /dev devtmpfs rw,relatime,size=63072k,nr_inodes=15768,mode=755  
0 0

> dev/pts /dev/pts devpts rw,relatime,mode=600 0 0
> /tmp /tmp tmpfs rw,relatime 0 0
> /home /home tmpfs rw,relatime 0 0
>
> Userspace can totally determine what /dev/root points to, I made  
mdev
> do it in 2006 (udev started doing so shortly thereafter). Busybox  
git

> commit a7e3d052.:4

There are situations where it doesn't work though -- suppose that root
is btrfs for example.


So you're saying there's a bug in btrfs?


Also, the other message that answered you is correct, the udev
maintainers say we should not be relying on /dev/root at all so to  
make

it work distro packagers have to add a rule themselves.


These udev maintainers?

  http://lkml.indiana.edu/hypermail/linux/kernel/1210.0/01889.html

This is the udev that got Katamari'd into systemd, not one of the forks  
that ran screaming trying not to get sucked into the latest escapade of  
King of All Cosmos?


Look, if you want to add /dev/root to devtmpfs, that makes a certain  
amount of sense. But your patch seems to have missed do_mounts.c doing:


  if (strncmp(root_device_name, "/dev/", 5) == 0)
root_device_name += 5;

Which means that if the user does "root=sda1" on the kernel command  
line you're not passing an absolute path to create_dev():


-   create_dev("/dev/root", ROOT_DEV);
-   mount_block_root("/dev/root", root_mountflags);
+   if (saved_root_name[0]) {
+   create_dev(saved_root_name, ROOT_DEV);

And last I checked that means /proc/mounts will have a relative path in  
it...


I.E. you're modifying kernel code you're not familiar with to fix a  
non-problem caused by your unfamiliarity with the corresponding  
userspace code. I'm really not seeing the upside here.


Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf evsel: Fix format string in perf_evsel__open_strerror()

2013-03-21 Thread Namhyung Kim
From: Namhyung Kim 

In case of EPERM or EACCESS, there was a debious "%s" before actual
string so that an user would see "You may not have permission to
collect %sstats." instead of "You may not ... system-wide stats.".

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/evsel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1adb824610f0..b190d57e926e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1495,7 +1495,7 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel,
switch (err) {
case EPERM:
case EACCES:
-   return scnprintf(msg, size, "%s",
+   return scnprintf(msg, size,
 "You may not have permission to collect %sstats.\n"
 "Consider tweaking /proc/sys/kernel/perf_event_paranoid:\n"
 " -1 - Not paranoid at all\n"
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] rbtree_test: use pr_info for module prefix in messages

2013-03-21 Thread Davidlohr Bueso
On Tue, 2013-03-19 at 11:54 -0600, Shuah Khan wrote:
> On Tue, Mar 19, 2013 at 11:14 AM, Davidlohr Bueso
>  wrote:
> > On Tue, 2013-03-19 at 10:29 -0600, Shuah Khan wrote:
> >> On Mon, Mar 18, 2013 at 5:20 PM, Davidlohr Bueso  
> >> wrote:
> >> > This provides nicer message output. Since it seems more appropriate
> >> > for the nature of this module, also use KERN_INFO instead of other
> >> > levels.
> >>
> >> Why are you changing the ALERTs to INFO?
> >
> > Because of the nature of the messages. They don't justify having a
> > KERN_ALERT level (requiring immediate attention), and it seems a lot
> > more suitable to use INFO instead.
> >
> 
> Hmm. I see interval_tree_test using the same alerts. It almost looks
> like the start and end of a test are meant to be alerts. I am not
> saying it shouldn't be changed, however looking for a stronger reason
> than "it seems a lot more suitable to use INFO instead". Are there any
> use-cases in which KERN_ALERTs cause problems?
> 

No 'issue' particularly, just common sense. In any case I have no
problem reverting the changes back to KERN_ALERT, no big deal.

Andrew, Michel, do you have any preferences? I'm mostly interested in
patch 3/3, do you have any objections?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] bisected: PandaBoard smsc95xx ethernet driver error from USB timeout

2013-03-21 Thread Frank Rowand
On 03/21/13 07:41, Alan Stern wrote:
> On Wed, 20 Mar 2013, Frank Rowand wrote:
> 
>> Hi All,
>>
>> Not quite sure quite where the problem is (USB, OMAP, smsc95xx driver, 
>> other???),
>> so casting the nets wide...
>>
>> The PandaBoard frequently fails to boot with an eth0 error when mounting
>> the root file system via NFS (ethernet driver fails due to a USB timeout;
>> no ethernet means NFS won't work).  A typical set of error messages is:
>>
>> [3.264373] smsc95xx 1-1.1:1.0: usb_probe_interface
>> [3.269500] smsc95xx 1-1.1:1.0: usb_probe_interface - got id
>> [3.275543] smsc95xx v1.0.4
>> [8.078674] smsc95xx 1-1.1:1.0: eth0: register 'smsc95xx' at 
>> usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet, 82:b9:1d:fa:67:0d
>> [8.091003] hub 1-1:1.0: state 7 ports 5 chg  evt 0002
>> [   13.509918] usb 1-1.1: swapper/0 timed out on ep0out len=0/4
>> [   13.515869] smsc95xx 1-1.1:1.0: eth0: Failed to write register index 
>> 0x0108
>> [   13.523559] smsc95xx 1-1.1:1.0: eth0: Failed to write ADDRL: -110
>> [   13.529998] IP-Config: Failed to open eth0
>>
>> I have bisected this to:
>>
>>   commit 18aafe64d75d0e27dae206cacf4171e4e485d285
>>   Author: Alan Stern 
>>   Date:   Wed Jul 11 11:23:04 2012 -0400
>>
>>  USB: EHCI: use hrtimer for the I/O watchdog
> 
> I don't understand how that commit could cause a timeout unless there 
> are at least two other bugs present in your system.
> 
>> Note that to compile this version of the kernel, an additional fix must
>> also be applied:
>>
>>   commit ba5952e0711b14d8d4fe172671f8aa6091ace3ee
>>   Author: Ming Lei 
>>   Date:   Fri Jul 13 17:25:24 2012 +0800
>>
>>  USB: ehci-omap: fix compile failure(v1)
>>
>> The symptom can be worked around by retrying the USB access if a timeout
>> occurs.  This is clearly _not_ the fix, just a hack that I used to
>> investigate the problem:
>>
>>   http://article.gmane.org/gmane.linux.rt.user/9773
>>
>> My kernel configuration is:
>>
>>   arch/arm/configs/omap2plus_defconfig
>>
>>   plus to get the ethernet driver I add:
>>
>> CONFIG_USB_EHCI_HCD
>> CONFIG_USB_NET_SMSC95XX
>>
>> I found the problem on 3.6.11, but have not replicated it on 3.9-rcX
>> yet because my config fails to build on 3.9-rc1 and 3.9-rc2.  I'll try
>> to work on that issue tomorrow.
> 
> Let me know how it works out.

My PandaBoard builds fail on 3.9-rcX due to ARM multiplatform issues.
Either there is something I need to change about the way I build it,
or it is broken (that is a side issue).  My simple expedient was to
hack around multiplatform, and just make it build (patch below if
anyone else wants a _temporary_ hack).

The problem appears to not be present in 3.9-rc3.  In older kernel versions,
the worst case to see the problem was 18 boots.  For 3.9-rc3 I booted 42
times without seeing the problem.

The problem occurs at least up through 3.8.  I'll try to reverse bisect
between 3.8 and 3.9-rc3 to see when the problem disappeared (I'm running
short of time, so no promises for a near term result).

-Frank


This patch is a _temporary_ hack, not fit for man or beast.  Avert
your eyes, do not apply to any respectable repository!

---
 arch/arm/Kconfig  |2   1 + 1 - 0 !
 arch/arm/Makefile |2   2 + 0 - 0 !
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: b/arch/arm/Kconfig
===
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1013,7 +1013,7 @@ config ARCH_MULTI_V7
bool "ARMv7 based platforms (Cortex-A, PJ4, Krait)"
default y
select ARCH_MULTI_V6_V7
-   select ARCH_VEXPRESS
+   select ARCH_VEXPRESS if !ARCH_OMAP2PLUS
select CPU_V7
 
 config ARCH_MULTI_V6_V7
Index: b/arch/arm/Makefile
===
--- a/arch/arm/Makefile
+++ b/arch/arm/Makefile
@@ -227,8 +227,10 @@ else
 MACHINE  :=
 endif
 ifeq ($(CONFIG_ARCH_MULTIPLATFORM),y)
+ifneq ($(CONFIG_ARCH_OMAP2PLUS),y)
 MACHINE  :=
 endif
+endif
 
 machdirs := $(patsubst %,arch/arm/mach-%/,$(machine-y))
 platdirs := $(patsubst %,arch/arm/plat-%/,$(plat-y))

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers: cpufreq: kirkwood: fix coccicheck warnings

2013-03-21 Thread Viresh Kumar
On Fri, Mar 22, 2013 at 5:40 AM, Silviu-Mihai Popescu
 wrote:
> Convert all uses of devm_request_and_ioremap() to the newly introduced
> devm_ioremap_resource() which provides more consistent error handling.
>
> devm_ioremap_resource() provides its own error messages so all explicit
> error messages can be removed from the failure code paths.
>
> Signed-off-by: Silviu-Mihai Popescu 
> ---
>  drivers/cpufreq/kirkwood-cpufreq.c |8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)

Acked-by: Viresh Kumar 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/11] ARM: remove extra timer-sp control register clearing

2013-03-21 Thread Rob Herring
On 03/21/2013 02:23 PM, Russell King - ARM Linux wrote:
> On Wed, Mar 20, 2013 at 05:54:02PM -0500, Rob Herring wrote:
>> From: Rob Herring 
>>
>> The timer-sp initialization code clears the control register before
>> initializing the timers, so every platform doing this is redundant.
>>
>> For unused timers, we should not care what state they are in.
> 
> NAK.  We do care what state they're in.  What if they have their interrupt
> enable bit set, and IRQ is shared with the clock event timer?
> 
> No, this patch is just wrong.

Okay, I can have the timer init function clear the register in the case
that even when the timer is unused.

Rob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: "Eric W. Biederman" 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 17:54:22 -0700

> Vivek Goyal  writes:
> 
>> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
>>
>> [..]
>>> So if starting or end address of PT_LOAD header is not aligned, why
>>> not we simply allocate a page. Copy the relevant data from old memory,
>>> fill rest with zero. That way mmap and read view will be same. There
>>> will be no surprises w.r.t reading old kernel memory beyond what's 
>>> specified by the headers.
>>
>> Copying from old memory might spring surprises w.r.t hw poisoned
>> pages. I guess we will have to disable MCE, read page, enable it
>> back or something like that to take care of these issues.
>>
>> In the past we have recommended makedumpfile to be careful, look
>> at struct pages and make sure we are not reading poisoned pages.
>> But vmcore itself is reading old memory and can run into this
>> issue too.
> 
> Vivek you are overthinking this.
> 
> If there are issues with reading partially exported pages we should
> fix them in kexec-tools or in the kernel where the data is exported.
> 
> In the examples given in the patch what we were looking at were cases
> where the BIOS rightly or wrongly was saying kernel this is my memory
> stay off.  But it was all perfectly healthy memory.
> 
> /proc/vmcore is a simple data dumper and prettifier.  Let's keep it that
> way so that we can predict how it will act when we feed it information.
> /proc/vmcore should not be worrying about or covering up sins elsewhere
> in the system.
> 
> At the level of /proc/vmcore we may want to do something about ensuring
> MCE's don't kill us.  But that is an orthogonal problem.

This is the part of old memory /proc/vmcore must read at its
initialization to generate its meta data, i.e. ELF header, program
header table and ELF note segments. Other memory chunks are part
makedumpfile should decide whether to read or avoid.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PREEMPT_RT vs 'hrtimer: Prevent hrtimer_enqueue_reprogram race'

2013-03-21 Thread Steven Rostedt
On Fri, 2013-03-22 at 01:11 +, Ben Hutchings wrote:
> Commit b22affe0aef4 'hrtimer: Prevent hrtimer_enqueue_reprogram race'
> conflicts with the RT patches
> hrtimer-fixup-hrtimer-callback-changes-for-preempt-r.patch and
> peter_zijlstra-frob-hrtimer.patch, as they all change
> hrtimer_enqueue_reprogram().  It seems that the changes in the RT
> patches now belong in __hrtimer_start_range_ns().
> 
> Since I haven't seen any RT releases in a while, here's what I came up
> with for 3.2-rt:

Note, I posted a fix on Tuesday:

https://lkml.org/lkml/2013/3/19/369

I'm waiting for Thomas to give his OK on it before releasing the series.
He told me he'll have a look at it tomorrow. I've already ran the series
through all my tests, and will post it immediately after I get the OK.
Or if there's a issue I will have to fix it and rerun my tests.


> 
> ---
> From: Thomas Gleixner 
> Date: Fri, 3 Jul 2009 08:44:31 -0500
> Subject: hrtimer: fixup hrtimer callback changes for preempt-rt
> 
> In preempt-rt we can not call the callbacks which take sleeping locks
> from the timer interrupt context.
> 
> Bring back the softirq split for now, until we fixed the signal
> delivery problem for real.
> 
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: Ingo Molnar 
> [bwh: Pull the changes to hrtimer_enqueue_reprogram() up into
>  __hrtimer_start_range_ns(), following changes in
>  commit b22affe0aef4 'hrtimer: Prevent hrtimer_enqueue_reprogram race'
>  backported into 3.2.40]
> Signed-off-by: Ben Hutchings 
> ---
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -111,6 +111,8 @@ struct hrtimer {
>   enum hrtimer_restart(*function)(struct hrtimer *);
>   struct hrtimer_clock_base   *base;
>   unsigned long   state;
> + struct list_headcb_entry;
> + int irqsafe;
>  #ifdef CONFIG_TIMER_STATS
>   int start_pid;
>   void*start_site;
> @@ -147,6 +149,7 @@ struct hrtimer_clock_base {
>   int index;
>   clockid_t   clockid;
>   struct timerqueue_head  active;
> + struct list_headexpired;
>   ktime_t resolution;
>   ktime_t (*get_time)(void);
>   ktime_t softirq_time;
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -589,8 +589,7 @@ static int hrtimer_reprogram(struct hrti
>* When the callback is running, we do not reprogram the clock event
>* device. The timer callback is either running on a different CPU or
>* the callback is executed in the hrtimer_interrupt context. The
> -  * reprogramming is handled either by the softirq, which called the
> -  * callback or at the end of the hrtimer_interrupt.
> +  * reprogramming is handled at the end of the hrtimer_interrupt.
>*/
>   if (hrtimer_callback_running(timer))
>   return 0;
> @@ -625,6 +624,9 @@ static int hrtimer_reprogram(struct hrti
>   return res;
>  }
>  
> +static void __run_hrtimer(struct hrtimer *timer, ktime_t *now);
> +static int hrtimer_rt_defer(struct hrtimer *timer);
> +
>  /*
>   * Initialize the high resolution related parts of cpu_base
>   */
> @@ -730,6 +732,11 @@ static inline int hrtimer_enqueue_reprog
>  }
>  static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) { }
>  static inline void retrigger_next_event(void *arg) { }
> +static inline int hrtimer_reprogram(struct hrtimer *timer,
> + struct hrtimer_clock_base *base)
> +{
> + return 0;
> +}
>  
>  #endif /* CONFIG_HIGH_RES_TIMERS */
>  
> @@ -861,9 +868,9 @@ void hrtimer_wait_for_timer(const struct
>  {
>   struct hrtimer_clock_base *base = timer->base;
>  
> - if (base && base->cpu_base && !hrtimer_hres_active(base->cpu_base))
> + if (base && base->cpu_base && !timer->irqsafe)
>   wait_event(base->cpu_base->wait,
> - !(timer->state & HRTIMER_STATE_CALLBACK));
> +!(timer->state & HRTIMER_STATE_CALLBACK));
>  }
>  
>  #else
> @@ -913,6 +920,11 @@ static void __remove_hrtimer(struct hrti
>   if (!(timer->state & HRTIMER_STATE_ENQUEUED))
>   goto out;
>  
> + if (unlikely(!list_empty(>cb_entry))) {
> + list_del_init(>cb_entry);
> + goto out;
> + }
> +
>   next_timer = timerqueue_getnext(>active);
>   timerqueue_del(>active, >node);
>   if (>node == next_timer) {
> @@ -1011,6 +1023,26 @@ int __hrtimer_start_range_ns(struct hrti
>*/
>   if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases)
>   && hrtimer_enqueue_reprogram(timer, new_base)) {
> +#ifdef CONFIG_PREEMPT_RT_BASE
> + again:

What kernel are you working with? I don't see anywhere the "again:"
within a PREEMPT_RT_BASE block.

> + /*
> +  * Move 

Re: [PATCH 03/11] ARM: timer-sp: convert to use CLKSRC_OF init

2013-03-21 Thread Rob Herring
On 03/21/2013 02:35 PM, Russell King - ARM Linux wrote:
> On Wed, Mar 20, 2013 at 05:54:03PM -0500, Rob Herring wrote:
>> +clk0 = of_clk_get(np, 0);
>> +if (IS_ERR(clk0))
>> +clk0 = NULL;
>> +
>> +/* Get the 2nd clock if the timer has 2 timer clocks */
>> +if (of_count_phandle_with_args(np, "clocks", "#clock-cells") == 3) {
>> +clk1 = of_clk_get(np, 1);
>> +if (IS_ERR(clk1)) {
>> +pr_err("sp804: %s clock not found: %d\n", np->name,
>> +(int)PTR_ERR(clk1));
>> +return;
>> +}
>> +} else
>> +clk1 = clk0;
>> +
>> +irq = irq_of_parse_and_map(np, 0);
>> +if (irq <= 0)
>> +return;
>> +
>> +of_property_read_u32(np, "arm,sp804-has-irq", _num);
>> +if (irq_num == 2)
>> +tmr2_evt = true;
>> +
>> +__sp804_clockevents_init(base + (tmr2_evt ? TIMER_2_BASE : 0),
>> + irq, tmr2_evt ? clk1 : clk0, name);
>> +__sp804_clocksource_and_sched_clock_init(base + (tmr2_evt ? 0 : 
>> TIMER_2_BASE),
>> + name, tmr2_evt ? clk0 : clk1, 
>> 1);
> 
> This just looks totally screwed to me.
> 
> A SP804 cell has two timers, and has one clock input and two clock
> enable inputs.  The clock input is common to both timers.  The timers
> only count on the rising edge of the clock input when the enable
> input is high.  (the APB PCLK also matters too...)
> 
> Now, the clock enable inputs are controlled by the SP810 system
> controller to achieve different clock rates for each.  So, we *can*
> view an SP804 as having two clock inputs.

Exactly. Effectively, the TIMCLKENx are just dividers of the clock input.

> However, the two clock inputs do not depend on whether one or the
> other has an IRQ or not.  Timer 1 is always clocked by TIMCLK &
> TIMCLKEN1.  Timer 2 is always clocked by TIMCLK & TIMCLKEN2.
> 
> Using the logic above, the clocks depend on how the IRQs are wired
> up... really?  Can you see from my description above why that is
> screwed?  What bearing does the IRQ have on the wiring of the
> clock inputs?

No. I'm simply swapping which timer is used for clksrc vs. clkevt based
on the irq connection DT describes. If only timer 2's irq being hooked
up, then timer 2 is the clkevt. Otherwise I always use timer 1 for the
clkevt because I either have a combined interrupt or timer 1 interrupt
hooked up.

Perhaps re-writing it like this would be more clear:

if (irq_num == 2){
__sp804_clockevents_init(base + TIMER_2_BASE, irq, clk1, name);
__sp804_clocksource_and_sched_clock_init(base, name, clk0, 1);
} else {
__sp804_clockevents_init(base, irq, clk0, name);
__sp804_clocksource_and_sched_clock_init(base + TIMER_2_BASE,
name, clk1, 1);
}


Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] cpufreq: ARM big LITTLE: Add generic cpufreq driver and its DT glue

2013-03-21 Thread Viresh Kumar
On 22 March 2013 05:20, Rafael J. Wysocki  wrote:
> Please post a complete update patch if you want me to take it.  I'd also would
> like it to be ACKed by someone involved in the big-LITTLE work on the arch
> side.

Okay.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 2/4] cpufreq: governor: Implement per policy instances of governors

2013-03-21 Thread Viresh Kumar
On 22 March 2013 05:14, Rafael J. Wysocki  wrote:
> On Wednesday, March 20, 2013 10:59:13 AM Viresh Kumar wrote:

>> I have queued all patches i had for 3.10 here:
>>
>> http://git.linaro.org/gitweb?p=people/vireshk/linux.git;a=shortlog;h=refs/heads/for-3.10
>
> OK, applied these to linux-pm.git/bleeding-edge.

Thanks.

> At the moment bleeding-edge and linux-next diverged slightly on cpufreq, but
> I hope the bleeding-edge material won't cause build problems to occur, so I'll
> be able to move it to linux-next shortly.

There shouldn't be any build problems not because i have done all build testing
properly BUT because my tree is under continuously surveillance by Fengguang's
bot. And any problem with my branches is reported very early :)

>> commit f02fca9a2478088c4f7dadf82d998ae007a56285
>> Author: Viresh Kumar 
>> Date:   Wed Mar 20 10:50:33 2013 +0530
>>
>> fixup! cpufreq: governor: Implement per policy instances of governors
>
> I'd actually prefer you to post complete updated patches instead of these
> fixups.  They are real PITA for me and probably for everybody else trying
> to follow the cpufreq development recently.

Hmm... I always thought fixups are way easy to review (and i still
believe that's
true) as they just contain what got changed and so people don't have to review
whole patch again. BUT people who are looking for complete patches to apply
would be annoyed by this and hence i always show them path of my repo
where they can find it. So, what i may do is, post fixups and then resend
patches. So that reviewer knows what changed and others can have complete
patches too.

--
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page

2013-03-21 Thread Xiao Guangrong
On 03/21/2013 09:14 PM, Gleb Natapov wrote:
> On Wed, Mar 20, 2013 at 04:30:24PM +0800, Xiao Guangrong wrote:
>> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
>> kvm_mmu_prepare_zap_page, we that we can free the shadow page out of 
>> mmu-lock.
>>
>> Also, delete the invalid shadow page from the hash list since this page can
>> not be reused anymore. This makes reset mmu-cache more easier - we do not 
>> need
>> to care all hash entries after reset mmu-cache
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c |8 ++--
>>  1 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index dc37512..5578c91 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1472,7 +1472,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
>> *kvm, int nr)
>>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>  {
>>  ASSERT(is_empty_shadow_page(sp->spt));
>> -hlist_del(>hash_link);
>> +
>>  list_del(>link);
>>  free_page((unsigned long)sp->spt);
>>  if (!sp->role.direct)
>> @@ -1660,7 +1660,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>  
>>  #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) 
>> \
>>  for_each_gfn_sp(_kvm, _sp, _gfn)\
>> -if ((_sp)->role.direct || (_sp)->role.invalid) {} else
>> +if ((_sp)->role.direct ||   \
>> +  ((_sp)->role.invalid && WARN_ON(1))) {} else
>>  
>>  /* @sp->gfn should be write-protected at the call site */
>>  static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> @@ -2079,6 +2080,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
>> struct kvm_mmu_page *sp,
>>  unaccount_shadowed(kvm, sp->gfn);
>>  if (sp->unsync)
>>  kvm_unlink_unsync_page(kvm, sp);
>> +
>> +hlist_del_init(>hash_link);
>> +
> Now we delete roots from hash, but leave it on active_mmu_pages list. Is
> this OK?

It is okay i think. Hash-lish is only used to find gfn's shadow page. Invalid 
shadow page
does not contain any useful guest content and will be freed soon after vcpu 
reload.

IIRC, we did it when we used rcu to free shadow pages.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] rtc: rtc-88pm80x: add CONFIG_PM_SLEEP to suspend/resume functions

2013-03-21 Thread Jingoo Han
Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following
build warning when CONFIG_PM_SLEEP is not selected. This is because
sleep PM callbacks defined by SIMPLE_DEV_PM_OPS are only used when
the CONFIG_PM_SLEEP is enabled.

drivers/rtc/rtc-88pm80x.c:238:12: warning: 'pm80x_rtc_suspend' defined but not 
used [-Wunused-function]
drivers/rtc/rtc-88pm80x.c:243:12: warning: 'pm80x_rtc_resume' defined but not 
used [-Wunused-function]

Signed-off-by: Jingoo Han 
---
 drivers/rtc/rtc-88pm80x.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/rtc/rtc-88pm80x.c b/drivers/rtc/rtc-88pm80x.c
index 76f9505..f3742f3 100644
--- a/drivers/rtc/rtc-88pm80x.c
+++ b/drivers/rtc/rtc-88pm80x.c
@@ -234,7 +234,7 @@ static const struct rtc_class_ops pm80x_rtc_ops = {
.alarm_irq_enable = pm80x_rtc_alarm_irq_enable,
 };
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 static int pm80x_rtc_suspend(struct device *dev)
 {
return pm80x_dev_suspend(dev);
-- 
1.7.2.5


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] rtc: rtc-ds1374: add CONFIG_PM_SLEEP to suspend/resume functions

2013-03-21 Thread Jingoo Han
Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following
build warning when CONFIG_PM_SLEEP is not selected. This is because
sleep PM callbacks defined by SIMPLE_DEV_PM_OPS are only used when
the CONFIG_PM_SLEEP is enabled.

drivers/rtc/rtc-ds1374.c:413:12: warning: 'ds1374_suspend' defined but not used 
[-Wunused-function]
drivers/rtc/rtc-ds1374.c:422:12: warning: 'ds1374_resume' defined but not used 
[-Wunused-function]

Signed-off-by: Jingoo Han 
---
 drivers/rtc/rtc-ds1374.c |   10 +++---
 1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/rtc/rtc-ds1374.c b/drivers/rtc/rtc-ds1374.c
index fef7686..67cd1e3 100644
--- a/drivers/rtc/rtc-ds1374.c
+++ b/drivers/rtc/rtc-ds1374.c
@@ -409,7 +409,7 @@ static int ds1374_remove(struct i2c_client *client)
return 0;
 }
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 static int ds1374_suspend(struct device *dev)
 {
struct i2c_client *client = to_i2c_client(dev);
@@ -427,19 +427,15 @@ static int ds1374_resume(struct device *dev)
disable_irq_wake(client->irq);
return 0;
 }
+#endif
 
 static SIMPLE_DEV_PM_OPS(ds1374_pm, ds1374_suspend, ds1374_resume);
 
-#define DS1374_PM (_pm)
-#else
-#define DS1374_PM NULL
-#endif
-
 static struct i2c_driver ds1374_driver = {
.driver = {
.name = "rtc-ds1374",
.owner = THIS_MODULE,
-   .pm = DS1374_PM,
+   .pm = _pm,
},
.probe = ds1374_probe,
.remove = ds1374_remove,
-- 
1.7.2.5


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

2013-03-21 Thread Xiao Guangrong
On 03/22/2013 06:21 AM, Marcelo Tosatti wrote:
> On Wed, Mar 20, 2013 at 04:30:20PM +0800, Xiao Guangrong wrote:
>> Changlog:
>> V2:
>>   - do not reset n_requested_mmu_pages and n_max_mmu_pages
>>   - batch free root shadow pages to reduce vcpu notification and mmu-lock
>> contention
>>   - remove the first patch that introduce kvm->arch.mmu_cache since we only
>> 'memset zero' on hashtable rather than all mmu cache members in this
>> version
>>   - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all
>>
>> * Issue
>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
>> walk and zap all shadow pages one by one, also it need to zap all guest
>> page's rmap and all shadow page's parent spte list. Particularly, things
>> become worse if guest uses more memory or vcpus. It is not good for
>> scalability.
> 
> Xiao, 
> 
> The bulk removal of shadow pages from mmu cache is nerving - it creates
> two codepaths to delete a data structure: the usual, single entry one
> and the bulk one.
> 
> There are two main usecases for kvm_mmu_zap_all(): to invalidate the
> current mmu tree (from kvm_set_memory) and to tear down all pages
> (VM shutdown).
> 
> The first usecase can use your idea of an invalid generation number
> on shadow pages. That is, increment the VM generation number, nuke the root
> pages and thats it. 
> 
> The modifications should be contained to kvm_mmu_get_page() mostly,
> correct? (would also have to keep counters to increase SLAB freeing 
> ratio, relative to number of outdated shadow pages).

Yes.

> 
> And then have codepaths that nuke shadow pages break from the spinlock,

I think this is not needed any more. We can let mmu_notify use the generation
number to invalid all shadow pages, then we only need to free them after
all vcpus down and mmu_notify unregistered - at this point, no lock contention,
we can directly free them.

> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).

BTW, to my honest, i do not think spin_needbreak is a good way - it does
not fix the hot-lock contention and it just occupies more cpu time to avoid
possible soft lock-ups.

Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
create page tables again. zap-all-shadow-page need long time to be finished,
the worst case is, it can not completed forever on intensive vcpu and memory
usage.

I still think the right way to fix this kind of thing is optimization for
mmu-lock.

> That would also solve the current issues without using more memory 
> for pte_list_desc and without the delicate "Reset MMU cache" step.
> 
> What you think?

I agree your point, Marcelo! I will redesign it. Thank you!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] init/Kconfig: make EXPERT as config instead of menuconfig

2013-03-21 Thread zhangwei(Jovi)
There don't have any EXPERT menu guard, and no config item is
included in EXPERT menu, so change it as a config, not menu.

This will make user more clear when they use 'make menuconfig'
or 'make nconfig'.

Signed-off-by: zhangwei(Jovi) 
Cc: Randy Dunlap 
Cc: Michal Marek 
---
 init/Kconfig |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/init/Kconfig b/init/Kconfig
index 5341d72..0495453 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1177,7 +1177,7 @@ config SYSCTL
 config ANON_INODES
bool

-menuconfig EXPERT
+config EXPERT
bool "Configure standard kernel features (expert users)"
# Unhide debug options, to make the on-by-default options visible
select DEBUG_KERNEL
-- 
1.7.9.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] wfcqueue: functions for local append and enqueue

2013-03-21 Thread Mathieu Desnoyers
* Eric Wong (normalper...@yhbt.net) wrote:
> With level-triggered epoll, append/enqueue operations to the
> local/locked queues increase performance by avoiding unnecessary atomic
> operations and barriers.  These are necessary to avoid performance
> regressions when looping through ep_send_events and appending many
> items to a queue.

Sounds like a good idea,

> 
> Signed-off-by: Eric Wong 
> Cc: Mathieu Desnoyers 
> Cc: Lai Jiangshan 
> Cc: Paul E. McKenney 
> Cc: Stephen Hemminger 
> Cc: Davide Libenzi 
> ---
>   Benchmark for this coming with updated epoll patches.
> 
>  include/linux/wfcqueue.h | 43 +++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
> index 9464a0c..7eb2aaa 100644
> --- a/include/linux/wfcqueue.h
> +++ b/include/linux/wfcqueue.h
> @@ -205,6 +205,49 @@ static inline bool wfcq_enqueue(struct wfcq_head *head,
>  }
>  
>  /*
> + * __wfcq_append_local: append one local queue to another local queue
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + *
> + * Returns false if the queue was empty prior to adding the node.
> + * Returns true otherwise.
> + */
> +static inline bool __wfcq_append_local(struct wfcq_head *head,

Following the rest of the header, we could use:

___wfcq_append() for this function,

> + struct wfcq_tail *tail,
> + struct wfcq_node *new_head,
> + struct wfcq_node *new_tail)
> +{
> + struct wfcq_node *old_tail;
> +
> + old_tail = tail->p;
> + tail->p = new_tail;
> + old_tail->next = new_head;
> +
> + /*
> +  * Return false if queue was empty prior to adding the node,
> +  * else return true.
> +  */
> + return old_tail != >node;
> +}
> +
> +/*
> + * wfcq_enqueue_local: enqueue a node into a local wait-free queue
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + *
> + * Returns false if the queue was empty prior to adding the node.
> + * Returns true otherwise.
> + */
> +static inline bool wfcq_enqueue_local(struct wfcq_head *head,

and:

__wfcq_enqueue()

we should also update the "Synchronization table" at the beginning of
the file accordingly.

Thoughts ?

Thanks,

Mathieu

> + struct wfcq_tail *tail,
> + struct wfcq_node *new_tail)
> +{
> + return __wfcq_append_local(head, tail, new_tail, new_tail);
> +}
> +
> +/*
>   * ___wfcq_busy_wait: busy-wait.
>   */
>  static inline void ___wfcq_busy_wait(void)
> -- 
> 1.8.2.rc3.2.g89ce8d6
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: Initial fsck has landed

2013-03-21 Thread Dave Chinner
On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
> On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley  wrote:
> > I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then
> > nothing. Is this project dead or not?
> 
> Not. We haven't done much about updating tux3.org lately, however you
> will find plenty of activity here:
> 
>  https://github.com/OGAWAHirofumi/tux3/tree/master/user
> 
> You will also find fairly comprehensive updates on where we are and
> where this is going, here:
> 
>  http://phunq.net/pipermail/tux3/
> 
> At the moment we're being pretty quiet because of being in the middle
> of developing the next-gen directory index. Not such a small task, as
> you might imagine.

Hi Daniel,

The "next-gen directory index" comment made me curious. I wanted to
know if there's anything I could learn from what you are doing and
whether anything of your new algorithms could be applied to, say,
the XFS directory structure to improve it.

I went looking for design docs and found this:

http://phunq.net/pipermail/tux3/2013-January/001938.html

In a word: Disappointment.

Compared to the XFS directory structure, the most striking
architectural similarity that I see is this:

"the file bteee[sic] effectively is a second directory index
that imposes a stable ordering on directory blocks".

That was the key architectural innovation in the XFS directory
structure that allowed it to provide the correct seekdir/telldir/NFS
readdir semantics and still scale. i.e. virtually mapped directory
entries. I explained this layout recently here:

http://marc.info/?l=linux-ext4=136081996316453=2
http://marc.info/?l=linux-ext4=136082221117399=2
http://marc.info/?l=linux-ext4=136089526928538=2

We could swap the relevant portions of your PHTree design doc with
my comments (and vice versa) and both sets of references would still
make perfect sense. :P

Further, the PHTree description of tag based freespace tracking is
rather close to how XFS uses tags to track free space regions,
including the fact that XFS can be lazy at updating global free
space indexes.  The global freespace tree indexing is slightly
different to the XFS method - it's closer to the original V1 dir
code in XFS (that didn't scale at all well) than the current code.
However, that's really a fine detail compared to all the major
structural and algorithmic similarities.

Hence it appears to me that at a fundamental level PHTree is just a
re-implementation of the XFS directory architecture. It's definitely
a *major* step forward from HTree, but it can hardly be considered
revolutionary or "next-gen". It's not even state of the art. Hence:
disappointment.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hwrng: exynos - add CONFIG_PM_RUNTIME to suspend/resume functions

2013-03-21 Thread Jingoo Han
Add CONFIG_PM_RUNTIME to suspend/resume functions to fix the build
error. It is because UNIVERSAL_DEV_PM_OPS macro is related to both
CONFIG_PM_SLEEP and CONFIG_PM_RUNTIME.

drivers/char/hw_random/exynos-rng.c:167:8: error: 'exynos_rng_runtime_suspend' 
undeclared here (not in a function)
drivers/char/hw_random/exynos-rng.c:167:8: error: 'exynos_rng_runtime_resume' 
undeclared here (not in a function)

Signed-off-by: Jingoo Han 
Reported-by: David Rientjes 
Cc: Herbert Xu 
---
 drivers/char/hw_random/exynos-rng.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/char/hw_random/exynos-rng.c 
b/drivers/char/hw_random/exynos-rng.c
index b7e48a2..402ccfb 100644
--- a/drivers/char/hw_random/exynos-rng.c
+++ b/drivers/char/hw_random/exynos-rng.c
@@ -144,7 +144,7 @@ static int exynos_rng_remove(struct platform_device *pdev)
return 0;
 }
 
-#ifdef CONFIG_PM_SLEEP
+#if defined(CONFIG_PM_SLEEP) || defined(CONFIG_PM_RUNTIME)
 static int exynos_rng_runtime_suspend(struct device *dev)
 {
struct platform_device *pdev = to_platform_device(dev);
-- 
1.7.2.5


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/2] ACPI,acpi_memhotplug: Remove acpi_memory_info->failed bit

2013-03-21 Thread Yasuaki Ishimatsu

acpi_memory_info has enabled bit and failed bit for controlling memory
hotplug. But we don't need to keep both bits.

The patch removes acpi_memory_info->failed bit.

Signed-off-by: yasuaki ishimatsu 
---

v3 : Continue to memory hot remove when (!info->enabled) case.
v2 : Changed a based kernel from linux-3.9-rc2 to linux-pm.git/bleeding-edge.

---
 drivers/acpi/acpi_memhotplug.c |   15 ++-
 1 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index ea78988..5e6301e 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -73,7 +73,6 @@ struct acpi_memory_info {
unsigned short caching; /* memory cache attribute */
unsigned short write_protect;   /* memory read/write attribute */
unsigned int enabled:1;
-   unsigned int failed:1;
 };
 
 struct acpi_memory_device {

@@ -201,10 +200,8 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
 * returns -EEXIST. If add_memory() returns the other error, it
 * means that this memory block is not used by the kernel.
 */
-   if (result && result != -EEXIST) {
-   info->failed = 1;
+   if (result && result != -EEXIST)
continue;
-   }
 
 		info->enabled = 1;
 
@@ -238,16 +235,8 @@ static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)

nid = acpi_get_node(mem_device->device->handle);
 
 	list_for_each_entry_safe(info, n, _device->res_list, list) {

-   if (info->failed)
-   /* The kernel does not use this memory block */
-   continue;
-
if (!info->enabled)
-   /*
-* The kernel uses this memory block, but it may be not
-* managed by us.
-*/
-   return -EBUSY;
+   continue;
 
 		if (nid < 0)

nid = memory_add_physaddr_to_nid(info->start_addr);

acpi_memory_info has enabled bit and failed bit for controlling memory
hotplug. But we don't need to keep both bits.

The patch removes acpi_memory_info->failed bit.

Signed-off-by: yasuaki ishimatsu 
---

v3 : Continue to memory hot remove when (!info->enabled) case.
v2 : Changed a based kernel from linux-3.9-rc2 to linux-pm.git/bleeding-edge.

---
 drivers/acpi/acpi_memhotplug.c |   15 ++-
 1 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index ea78988..5e6301e 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -73,7 +73,6 @@ struct acpi_memory_info {
 	unsigned short caching;	/* memory cache attribute */
 	unsigned short write_protect;	/* memory read/write attribute */
 	unsigned int enabled:1;
-	unsigned int failed:1;
 };
 
 struct acpi_memory_device {
@@ -201,10 +200,8 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		 * returns -EEXIST. If add_memory() returns the other error, it
 		 * means that this memory block is not used by the kernel.
 		 */
-		if (result && result != -EEXIST) {
-			info->failed = 1;
+		if (result && result != -EEXIST)
 			continue;
-		}
 
 		info->enabled = 1;
 
@@ -238,16 +235,8 @@ static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 	nid = acpi_get_node(mem_device->device->handle);
 
 	list_for_each_entry_safe(info, n, _device->res_list, list) {
-		if (info->failed)
-			/* The kernel does not use this memory block */
-			continue;
-
 		if (!info->enabled)
-			/*
-			 * The kernel uses this memory block, but it may be not
-			 * managed by us.
-			 */
-			return -EBUSY;
+			continue;
 
 		if (nid < 0)
 			nid = memory_add_physaddr_to_nid(info->start_addr);
-- 
1.7.5.1



Re: VFS deadlock ?

2013-03-21 Thread Al Viro
On Thu, Mar 21, 2013 at 06:33:35PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 6:22 PM, Al Viro  wrote:
> >
> > In theory, we can make vfs_rmdir() and vfs_unlink() check the presense of
> > the corresponding method before locking the victim; that would suffice to
> > kludge around that mess on procfs.  Along with ->d_inode comparison in
> > lock_rename() it *might* suffice.
> 
> Hmm, yes. Maybe we can do that as a stopgap, backport that, and leave
> any bigger changes for the development tree. That would make the issue
> less urgent, never mind all the other worries about backporting
> complicated patches for subtle issues.
> 
> I realize you aren't entirely thrilled about it, but we actually
> already seem to do that check in both vfs_rmdir().and vfs_unlink()
> before getting the child i_mutex.  I wonder if that is because we've
> already seen lockdep splats for this case...

Yeah, I went to do such patch after sending the previous mail and noticed
that we already did it that way.  Simplicity of error recovery was probably
more important consideration there - I honestly don't remember the reasoning
in such details; it had been a decade or so...  So lock_rename() doing
->d_inode comparison (with dire comment re not expecting that to be sufficient
for anything other than this bug in procfs) will probably suffice for fs/namei.c
part of it; I'm still looking at dcache.c side of things...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue

2013-03-21 Thread Arve Hjønnevåg
On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong  wrote:
> This is still not a proper commit, I've lightly tested this.
>
> Replace the spinlock-protected linked list ready list with wfcqueue.
>
> This improves performance under heavy, multi-threaded workloads with
> multiple threads calling epoll_wait.
>
> Using my multi-threaded EPOLLONESHOT microbenchmark, performance is
> nearly doubled:
>
> $ eponeshotmt -t 4 -w 4 -f 10 -c 100
>
> Before:
> real0m 9.58s
> user0m 1.22s
> sys 0m 37.08s
>
> After:
> real0m 5.00s
> user0m 1.07s
> sys 0m 18.92s
>
> ref: http://yhbt.net/eponeshotmt.c
>
> Unfortunately, there are still regressions for the common,
> single threaded, Level Trigger use case.
>
> Things changed/removed:
>
> * ep->ovflist - is no longer needed, the main ready list continues
>   to be appended to while we iterate through the transaction list.
>
> * ep_scan_ready_list - not enough generic code between users
>   anymore to warrant this.  ep_poll_readyevents_proc (used for
>   poll) is read-only, using __wfcq_for_each, while
>   ep_send_events (used for epoll_wait) dequeues and needs
>   __wfcq_for_each_safe.
>
> * ep->lock renamed to ep->wqlock; this only protects waitqueues now
>   (we use trylock to further avoid redundant wakeups)
>
> * EPOLL_CTL_DEL/close() on a ready file will not immediately release
>   epitem memory, epoll_wait() must be called since there's no way to
>   delete a ready item from wfcqueue in O(1) time.  In practice this
>   should not be a problem, any valid app using epoll must call
>   epoll_wait occasionally.  Unfreed epitems still count against
>   max_user_watches to protect against local DoS.  This should be the
>   only possibly-noticeable change (in case there's an app that blindly
>   adds/deletes things from the rbtree but never calls epoll_wait)
>
> Changes since v1:
> * fixed memory leak with pre-existing zombies in ep_free
> * updated to use the latest wfcqueue.h APIs
> * switched to using __wfcq_splice and a global transaction list
>   (this is like the old txlist in ep_scan_ready_list)
> * redundant wakeups avoided in ep_notify_waiters:
>   - only attempt a wakeup when an item is enqueued the first time
>   - use spin_trylock_irqsave when attempting notification, since a
> failed lock means either another task is already waking, or
> ep_poll is already running and will check anyways upon releasing
> wqlock, anyways.
> * explicitly cache-aligned rdltail in SMP
> * added ep_item_state for atomically reading epi->state with barrier
>   (avoids WARN_ON in ep_send_events)
> * reverted epi->nwait removal, it was not necessary
>   sizeof(epitem) is still <= 128 bytes on 64-bit machines
>
> Changes since v2:
> * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
>   now and rely on implicit barriers in other places for reading.
> * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
>   much overhead in the ep_send_events loop and could not eliminate
>   starvation dangers from improper EPOLLET usage (the original code
>   had this problem, too, the window is just a few cycles larger, now).
> * minor code cleanups
>
> Lightly-tested-by: Eric Wong 
> Cc: Davide Libenzi 
> Cc: Al Viro 
> Cc: Andrew Morton 
> Cc: Mathieu Desnoyers 
> ---
>
...
> @@ -967,8 +951,6 @@ static struct epitem *ep_find(struct eventpoll *ep, 
> struct file *file, int fd)
>   */
>  static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, 
> void *key)
>  {
> -   int pwake = 0;
> -   unsigned long flags;
> struct epitem *epi = ep_item_from_wait(wait);
> struct eventpoll *ep = epi->ep;
>
> @@ -983,7 +965,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned 
> mode, int sync, void *k
> list_del_init(>task_list);
> }
>
> -   spin_lock_irqsave(>lock, flags);
> +   /* pairs with smp_mb in ep_modify */
> +   smp_rmb();
>
> /*
>  * If the event mask does not contain any poll(2) event, we consider 
> the
> @@ -992,7 +975,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned 
> mode, int sync, void *k
>  * until the next EPOLL_CTL_MOD will be issued.
>  */
> if (!(epi->event.events & ~EP_PRIVATE_BITS))
> -   goto out_unlock;
> +   return 1;
>
> /*
>  * Check the events coming with the callback. At this stage, not
> @@ -1001,52 +984,14 @@ static int ep_poll_callback(wait_queue_t *wait, 
> unsigned mode, int sync, void *k
>  * test for "key" != NULL before the event match test.
>  */
> if (key && !((unsigned long) key & epi->event.events))
> -   goto out_unlock;
> -
> -   /*
> -* If we are transferring events to userspace, we can hold no locks
> -* (because we're accessing user memory, and because of linux 
> f_op->poll()
> -* semantics). All the events that happen during that period of time 
> are
> -* 

Re: Status of union-mount?

2013-03-21 Thread David Howells
Sedat Dilek  wrote:

> Hmmm, sorry for asking, but when do you plan to offer a "working"
> union-mount (u-m)?

It's a maze of twisty locking problems - some of which also apply to things
like overlayfs:-(

> What's the status of the user-space tools or are they no more needed?

You need to be able to tell mount(2) that you want a union.  This is currently
done with a mount flag, but it might be portable to something in the mount
option string.

> AFAICS the original authors patched e2fsprogs etc. (see Valerie's old
> homepage [1]).

Yeah... I guess fsck programs need to be able to handle whiteout and fallthru
directory entries.

> >> Where does the development happen - in [1]?
> >
> > On a git tree on my PC - which is occasionally mirrored in [1] when I've got
> > it working.
> >
> 
> Development on your local workstation does not look like you do an
> open development.

Excuse me.  But it's quite hard to develop this on a remote git tree.
Further, I prefer not to push partially working stuff to my git tree, lest
someone pull it, try playing with it and have their fs eaten.

If someone wants it, I can mail the partially working stuff to them, but not
many people ask.

> So, it's currently only you doing the work on u-m?

Almost entirely, yes.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ACPI,acpi_memhotplug: Remove acpi_memory_info->failed bit

2013-03-21 Thread Yasuaki Ishimatsu

Hi Toshi,

2013/03/22 9:29, Toshi Kani wrote:

On Thu, 2013-03-21 at 13:39 +0900, Yasuaki Ishimatsu wrote:

acpi_memory_info has enabled bit and failed bit for controlling memory
hotplug. But we don't need to keep both bits.

The patch removes acpi_memory_info->failed bit.

Signed-off-by: yasuaki ishimatsu 
---

v2 : Changed a based kernel from linux-3.9-rc2 to linux-pm.git/bleeding-edge.

---
   drivers/acpi/acpi_memhotplug.c |   13 +
   1 files changed, 1 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index ea78988..597cd65 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -73,7 +73,6 @@ struct acpi_memory_info {
unsigned short caching; /* memory cache attribute */
unsigned short write_protect;   /* memory read/write attribute */
unsigned int enabled:1;
-   unsigned int failed:1;
   };

   struct acpi_memory_device {
@@ -201,10 +200,8 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
 * returns -EEXIST. If add_memory() returns the other error, it
 * means that this memory block is not used by the kernel.
 */
-   if (result && result != -EEXIST) {
-   info->failed = 1;
+   if (result && result != -EEXIST)
continue;
-   }

info->enabled = 1;

@@ -238,15 +235,7 @@ static int acpi_memory_remove_memory(struct 
acpi_memory_device *mem_device)
nid = acpi_get_node(mem_device->device->handle);

list_for_each_entry_safe(info, n, _device->res_list, list) {
-   if (info->failed)
-   /* The kernel does not use this memory block */
-   continue;
-
if (!info->enabled)
-   /*
-* The kernel uses this memory block, but it may be not
-* managed by us.
-*/
return -EBUSY;


Shouldn't this case (!info->enabled) continue since it is the same as
info->failed before?  -EBUSY was previously used for the -EEXIST case,
which is no longer a failure case with this patchset.


You are right. It is my mitake. We need to continue to hot remove memory.
I'll update soon.

Thanks,
Yasuaki Ishimatsu



Thanks,
-Toshi


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 6:22 PM, Al Viro  wrote:
>
> In theory, we can make vfs_rmdir() and vfs_unlink() check the presense of
> the corresponding method before locking the victim; that would suffice to
> kludge around that mess on procfs.  Along with ->d_inode comparison in
> lock_rename() it *might* suffice.

Hmm, yes. Maybe we can do that as a stopgap, backport that, and leave
any bigger changes for the development tree. That would make the issue
less urgent, never mind all the other worries about backporting
complicated patches for subtle issues.

I realize you aren't entirely thrilled about it, but we actually
already seem to do that check in both vfs_rmdir().and vfs_unlink()
before getting the child i_mutex.  I wonder if that is because we've
already seen lockdep splats for this case...

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-21 Thread Alex Shi
On 03/21/2013 06:27 PM, Preeti U Murthy wrote:
>> > did you close all of background system services?
>> > In theory the rq->avg.runnable_avg_sum should be zero if there is no
>> > task a bit long, otherwise there are some bugs in kernel.
> Could you explain why rq->avg.runnable_avg_sum should be zero? What if
> some kernel thread ran on this run queue and is now finished? Its
> utilisation would be say x.How would that ever drop to 0,even if nothing
> ran on it later?

the value get from decay_load():
 sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
in decay_load it is possible to be set zero.

and /proc/sched_debug also approve this:

  .tg_runnable_contrib   : 0
  .tg->runnable_avg  : 50
  .avg->runnable_avg_sum : 0
  .avg->runnable_avg_period  : 47507


-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86, microcode_intel_early: Mark apply_microcode_early() as cpuinit

2013-03-21 Thread tip-bot for H. Peter Anvin
Commit-ID:  f564c24103f87dc740c1c293c975565ac46b12ef
Gitweb: http://git.kernel.org/tip/f564c24103f87dc740c1c293c975565ac46b12ef
Author: H. Peter Anvin 
AuthorDate: Thu, 21 Mar 2013 17:32:36 -0700
Committer:  H. Peter Anvin 
CommitDate: Thu, 21 Mar 2013 17:32:36 -0700

x86, microcode_intel_early: Mark apply_microcode_early() as cpuinit

Add missing __cpuinit annotation to apply_microcode_early().

Reported-by: Shaun Ruffell 
Cc: Fenghua Yu 
Link: http://lkml.kernel.org/r/20130320170310.ga23...@digium.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/microcode_intel_early.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c 
b/arch/x86/kernel/microcode_intel_early.c
index 5992ee8..d893e8e 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -659,8 +659,8 @@ static inline void __cpuinit print_ucode(struct 
ucode_cpu_info *uci)
 }
 #endif
 
-static int apply_microcode_early(struct mc_saved_data *mc_saved_data,
-struct ucode_cpu_info *uci)
+static int __cpuinit apply_microcode_early(struct mc_saved_data *mc_saved_data,
+  struct ucode_cpu_info *uci)
 {
struct microcode_intel *mc_intel;
unsigned int val[2];
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] PCI: Handle device quirks when accessing sysfs resource entries

2013-03-21 Thread Greg KH
On Thu, Mar 21, 2013 at 06:51:31PM -0600, Robert Hancock wrote:
> On 03/20/2013 10:35 PM, Myron Stowe wrote:
> >Sysfs includes entries to memory regions that back a PCI device's BARs.
> >The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
> >providing direct access to the device's registers.  File permissions
> >prevent random users from accessing the device's registers through these
> >files, but don't stop a privileged app that chooses to ignore the purpose
> >of these files from doing so.
> >
> >There are devices with abnormally strict restrictions with respect to
> >accessing their registers; aspects that are typically handled by the
> >device's driver.  When these access restrictions are not followed - as
> >when a userspace app such as "udevadm info --attribute-walk
> >--path=/sys/..." parses though reading all the device's sysfs entries - it
> >can cause such devices to fail.
> >
> >This patch introduces a quirking mechanism that can be used to detect
> >accesses that do no meet the device's restrictions, letting a device
> >specific method intervene and decide how to progress.
> >
> >Reported-by: Xiangliang Yu 
> >Signed-off-by: Myron Stowe 
> 
> I honestly don't think there's much point in even attempting this
> strategy. This list of devices in the quirk can't possibly be
> complete. It would likely be easier to enumerate a white-list of
> devices that can deal with their IO ports being read willy-nilly
> than a blacklist of those that don't, as there's likely countless
> devices that fall into this category. Even if they don't choke as
> badly as these ones do, it's quite likely that bad behavior will
> result.
> 
> I think there's a few things that need to be done:
> 
> -Fix the bug in udevadm that caused it to trawl through these files
> willy-nilly,

There's no "bug" in udevadm, the user explicitly asked for it to read
all of those files.  Just like grep or bash could be used to ask to read
those files.

If the kernel is going to provide files to userspace, the kernel can't
suddenly get upset if userspace actually reads those files.

Fix the kernel here please.

> -Fix the kernel so that access through these files complies with the
> kernel's mechanisms for claiming IO/memory regions to prevent access
> conflicts (i.e. opening these files should claim the resource region
> they refer to, and should fail with EBUSY or something if another
> process or a kernel driver is using it).

Yes, this is a good solution.

> -Reconsider whether supporting read/write on the resource files for
> IO port regions like these makes any sense. Obviously mmap isn't
> very practical for IO port access on x86 but you could even do
> something like an ioctl for this purpose. Not very many pieces of
> software would need to access these files so it's likely OK if the
> API is a bit ugly. That would prevent something like grepping
> through sysfs from generating port accesses to random devices.

Also a good solution.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] vfs: Report a mount r/o if the superblock is

2013-03-21 Thread Shea Levy
Any feedback on this?

On Mar 14, 2013, at 12:09 PM, Shea Levy  wrote:

> By calling mount(2) with MS_REMOUNT | MS_BIND on a non-bind readonly
> mountpoint, it is possible to have a readonly mount without MNT_READONLY
> in its mnt_flags. Currently, /proc//mountinfo and statfs will
> report such a mount as r/w, even though for all intents and purposes it
> is still readonly.
> 
> This patchset fixes show_mountinfo and statfs to report such mounts as
> readonly.
> 
> Signed-off-by: Shea Levy 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso  wrote:
>
> ipc lock contention:
> 100 users:  8,74%  (vanilla)3.17% (v3 patchset)
> 400 users:  21,86% (vanilla)5.23% (v3 patchset)
> 800 users   84,35% (vanilla)7.39% (v3 patchset)

Ok, I'd call that pretty much "solved". Sure, it's still visible, but
for being a benchmark that apparently does little else than pound on
those sysv semaphores, I think we can consider it pretty much fine.
I'm going to assume that anybody who actually then does any real work
(ie a database) is never going to see even close to this bad
contention.

Good job, Rik. I'm assuming we'll be merging this during the 3.10
merge window, and hopefully the merge conflicts will be sorted out
too. Rik, Peter, can you look at each others patches and see if you
can get that sorted out for Andrew?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Al Viro
On Thu, Mar 21, 2013 at 05:22:59PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 5:12 PM, Al Viro  wrote:
> >
> > What we should do, IMO, is to turn /proc//net into a honest symlink -
> > to ../nets//net.  Hell, might even make it a magical symlink
> > instead...
> 
> Ok, having seen the error of my ways, I'm starting to agree with you..
>  How painful would that be? Especially since we'd need to backport
> it..

Not sure; right now I'm looking through the guts of what procfs had become.
Unfortunately, there are fairly subtle interactions with other shit -
tomoyo, etc.  Sigh...

BTW, the variant with d_ancestor() modification is also not enough -
/proc/1/net and /proc/2/net have different inodes, so for the pair
(/proc/net/1, /proc/2/net/stat) d_ancestor() won't trigger
even with this change.  And we have /proc/net/1 < /proc/net/1/stat,
since the latter is a subdirectory of the former.  With /proc/net/{1,2}/stat
having the same inode...

In theory, we can make vfs_rmdir() and vfs_unlink() check the presense of
the corresponding method before locking the victim; that would suffice to
kludge around that mess on procfs.  Along with ->d_inode comparison in
lock_rename() it *might* suffice.  OTOH, there are places in fs/dcache.c
where we rely on the lack of such aliases; they might or might not trigger
in case of procfs.

We are talking about the violation of fundamental assert used in
correctness analysis all over the place, unfortunately.  The right fix
is to restore it; I'll try to come up with something that could be
reasonably easily backported - the kludge above is a fallback in case if
no real fix turns out to be easy to backport.  Assuming that this kludge
is sufficient, that is...  For 3.9 and later we *definitely* want to
restore that assertion.

PS: Once more, with feeling, to everyone even thinking of pulling something
like that again:
Hardlinks to directories do not work.  Don't do that, or we'll be
sorry, and then so will you.
A Very Peeved BOFH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: fix memcg_cache_name() to use cgroup_name()

2013-03-21 Thread Li Zefan
On 2013/3/21 18:22, Michal Hocko wrote:
> On Thu 21-03-13 10:08:49, Michal Hocko wrote:
>> On Thu 21-03-13 09:22:21, Li Zefan wrote:
>>> As cgroup supports rename, it's unsafe to dereference dentry->d_name
>>> without proper vfs locks. Fix this by using cgroup_name().
>>>
>>> Signed-off-by: Li Zefan 
>>> ---
>>>
>>> This patch depends on "cgroup: fix cgroup_path() vs rename() race",
>>> which has been queued for 3.10.
>>>
>>> ---
>>>  mm/memcontrol.c | 15 +++
>>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 53b8201..72be5c9 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3217,17 +3217,16 @@ void mem_cgroup_destroy_cache(struct kmem_cache 
>>> *cachep)
>>>  static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache 
>>> *s)
>>>  {
>>> char *name;
>>> -   struct dentry *dentry;
>>> +
>>> +   name = (char *)__get_free_page(GFP_TEMPORARY);
>>
>> Ouch. Can we use a static temporary buffer instead?
> 
>> This is called from workqueue context so we do not have to be afraid
>> of the deep call chain.
> 
> Bahh, I was thinking about two things at the same time and that is how
> it ends... I meant a temporary buffer on the stack. But a separate
> allocation sounds even easier.
> 

Actually I don't care much about which way to take. Use on-stack buffer (if 
stack
usage is not a concern) or local static buffer (caller already held 
memcg_cache_mutex)
is simplest.

But why it's bad to allocate a page for temp use?

>> It is also not a hot path AFAICS.
>>
>> Even GFP_ATOMIC for kasprintf would be an improvement IMO.
> 
> What about the following (not even compile tested because I do not have
> cgroup_name in my tree yet):

No, it won't compile. ;)

> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f608546..ede0382 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3370,13 +3370,18 @@ static char *memcg_cache_name(struct mem_cgroup 
> *memcg, struct kmem_cache *s)
>   struct dentry *dentry;
>  
>   rcu_read_lock();
> - dentry = rcu_dereference(memcg->css.cgroup->dentry);
> + name = kasprintf(GFP_ATOMIC, "%s(%d:%s)", s->name,
> +  memcg_cache_id(memcg), cgroup_name(memcg->css.cgroup));
>   rcu_read_unlock();
>  
> - BUG_ON(dentry == NULL);
> -
> - name = kasprintf(GFP_KERNEL, "%s(%d:%s)", s->name,
> -  memcg_cache_id(memcg), dentry->d_name.name);
> + if (!name) {
> + name = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + rcu_read_lock();
> + name = snprintf(name, PAGE_SIZE, "%s(%d:%s)", s->name,
> + memcg_cache_id(memcg),
> + cgroup_name(memcg->css.cgroup));
> + rcu_read_unlock();
> + }
>  
>   return name;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/7] ipc,sem: fine grained locking for semtimedop

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Introduce finer grained locking for semtimedop, to handle the
> common case of a program wanting to manipulate one semaphore
> from an array with multiple semaphores.
> 
> If the call is a semop manipulating just one semaphore in
> an array with multiple semaphores, only take the lock for
> that semaphore itself.
> 
> If the call needs to manipulate multiple semaphores, or
> another caller is in a transaction that manipulates multiple
> semaphores, the sem_array lock is taken, as well as all the
> locks for the individual semaphores.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
>   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
> threads   patches rwlock patches  v3 patches
> 10610652  726325  1783589 2142206
> 20341570  365699  1520453 1977878
> 30288102  307037  1498167 2037995
> 40290714  305955  1612665 2256484
> 50288620  312890  1733453 2650292
> 60289987  306043  1649360 2388008
> 70291298  306347  1723167 2717486
> 80290948  305662  1729545 2763582
> 90290996  306680  1736021 2757524
> 100   292243  306700  1773700 3059159
> 
> Signed-off-by: Rik van Riel 
> Suggested-by: Linus Torvalds 

Acked-by: Davidlohr Bueso 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] ipc,sem: have only one list in struct sem_queue

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Having only one list in struct sem_queue, and only queueing simple
> semaphore operations on the list for the semaphore involved, allows
> us to introduce finer grained locking for semtimedop.
> 
> Signed-off-by: Rik van Riel 

Acked-by: Davidlohr Bueso 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/7] ipc,sem: open code and rename sem_lock

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Rename sem_lock to sem_obtain_lock, so we can introduce a sem_lock
> function later that only locks the sem_array and does nothing else.
> 
> Open code the locking from ipc_lock in sem_obtain_lock, so we can
> introduce finer grained locking for the sem_array in the next patch.
> 
> Signed-off-by: Rik van Riel 

Acked-by: Davidlohr Bueso 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
>   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
> threads   patches rwlock patches  v3 patches
> 10610652  726325  1783589 2142206
> 20341570  365699  1520453 1977878
> 30288102  307037  1498167 2037995
> 40290714  305955  1612665 2256484
> 50288620  312890  1733453 2650292
> 60289987  306043  1649360 2388008
> 70291298  306347  1723167 2717486
> 80290948  305662  1729545 2763582
> 90290996  306680  1736021 2757524
> 100   292243  306700  1773700 3059159
> 

After testing these patches with my Oracle Swingbench DSS workload, I
can say that there are significant improvements. The ipc lock contention
was reduced drastically, specially with higher amounts of benchmark
users. As a result, the overall %sys time went down as well.
Furthermore, throughput (in transactions per second) was increased.

TPS:
100 users: 1257.21 (vanilla)2805.06 (v3 patchset)
400 users: 1437.57 (vanilla)2664.67 (v3 patchset)
800 users: 1236.89 (vanilla)2750.73 (v3 patchset)

ipc lock contention:
100 users:  8,74%  (vanilla)3.17% (v3 patchset)
400 users:  21,86% (vanilla)5.23% (v3 patchset)
800 users   84,35% (vanilla)7.39% (v3 patchset) 

As seen with perf, the ipc lock isn't even the main source of contention
anymore. Also, no matter how many benchmark users,  the lock's user is
mostly semctl_main() .

100 users:
3.17%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock  
  
 |
 --- _raw_spin_lock
|  
|--50.53%-- sem_lock
|  |  
|  |--82.60%-- semctl_main
|   --17.40%-- sys_semtimedop

400 users:
5.23%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock  
  
 |
 --- _raw_spin_lock
|  
|--75.81%-- sem_lock
|  |  
|  |--94.09%-- semctl_main
|   --5.91%-- sys_semtimedop


800 users:
 7.39%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock 
   
 |
 --- _raw_spin_lock
|  
|--81.71%-- sem_lock
|  |  
|  |--64.98%-- semctl_main
|   --35.02%-- sys_semtimedop


Thanks,
Davidlohr


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


PREEMPT_RT vs 'hrtimer: Prevent hrtimer_enqueue_reprogram race'

2013-03-21 Thread Ben Hutchings
Commit b22affe0aef4 'hrtimer: Prevent hrtimer_enqueue_reprogram race'
conflicts with the RT patches
hrtimer-fixup-hrtimer-callback-changes-for-preempt-r.patch and
peter_zijlstra-frob-hrtimer.patch, as they all change
hrtimer_enqueue_reprogram().  It seems that the changes in the RT
patches now belong in __hrtimer_start_range_ns().

Since I haven't seen any RT releases in a while, here's what I came up
with for 3.2-rt:

---
From: Thomas Gleixner 
Date: Fri, 3 Jul 2009 08:44:31 -0500
Subject: hrtimer: fixup hrtimer callback changes for preempt-rt

In preempt-rt we can not call the callbacks which take sleeping locks
from the timer interrupt context.

Bring back the softirq split for now, until we fixed the signal
delivery problem for real.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Ingo Molnar 
[bwh: Pull the changes to hrtimer_enqueue_reprogram() up into
 __hrtimer_start_range_ns(), following changes in
 commit b22affe0aef4 'hrtimer: Prevent hrtimer_enqueue_reprogram race'
 backported into 3.2.40]
Signed-off-by: Ben Hutchings 
---
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -111,6 +111,8 @@ struct hrtimer {
enum hrtimer_restart(*function)(struct hrtimer *);
struct hrtimer_clock_base   *base;
unsigned long   state;
+   struct list_headcb_entry;
+   int irqsafe;
 #ifdef CONFIG_TIMER_STATS
int start_pid;
void*start_site;
@@ -147,6 +149,7 @@ struct hrtimer_clock_base {
int index;
clockid_t   clockid;
struct timerqueue_head  active;
+   struct list_headexpired;
ktime_t resolution;
ktime_t (*get_time)(void);
ktime_t softirq_time;
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -589,8 +589,7 @@ static int hrtimer_reprogram(struct hrti
 * When the callback is running, we do not reprogram the clock event
 * device. The timer callback is either running on a different CPU or
 * the callback is executed in the hrtimer_interrupt context. The
-* reprogramming is handled either by the softirq, which called the
-* callback or at the end of the hrtimer_interrupt.
+* reprogramming is handled at the end of the hrtimer_interrupt.
 */
if (hrtimer_callback_running(timer))
return 0;
@@ -625,6 +624,9 @@ static int hrtimer_reprogram(struct hrti
return res;
 }
 
+static void __run_hrtimer(struct hrtimer *timer, ktime_t *now);
+static int hrtimer_rt_defer(struct hrtimer *timer);
+
 /*
  * Initialize the high resolution related parts of cpu_base
  */
@@ -730,6 +732,11 @@ static inline int hrtimer_enqueue_reprog
 }
 static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) { }
 static inline void retrigger_next_event(void *arg) { }
+static inline int hrtimer_reprogram(struct hrtimer *timer,
+   struct hrtimer_clock_base *base)
+{
+   return 0;
+}
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
 
@@ -861,9 +868,9 @@ void hrtimer_wait_for_timer(const struct
 {
struct hrtimer_clock_base *base = timer->base;
 
-   if (base && base->cpu_base && !hrtimer_hres_active(base->cpu_base))
+   if (base && base->cpu_base && !timer->irqsafe)
wait_event(base->cpu_base->wait,
-   !(timer->state & HRTIMER_STATE_CALLBACK));
+  !(timer->state & HRTIMER_STATE_CALLBACK));
 }
 
 #else
@@ -913,6 +920,11 @@ static void __remove_hrtimer(struct hrti
if (!(timer->state & HRTIMER_STATE_ENQUEUED))
goto out;
 
+   if (unlikely(!list_empty(>cb_entry))) {
+   list_del_init(>cb_entry);
+   goto out;
+   }
+
next_timer = timerqueue_getnext(>active);
timerqueue_del(>active, >node);
if (>node == next_timer) {
@@ -1011,6 +1023,26 @@ int __hrtimer_start_range_ns(struct hrti
 */
if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases)
&& hrtimer_enqueue_reprogram(timer, new_base)) {
+#ifdef CONFIG_PREEMPT_RT_BASE
+   again:
+   /*
+* Move softirq based timers away from the rbtree in
+* case it expired already. Otherwise we would have a
+* stale base->first entry until the softirq runs.
+*/
+   if (!hrtimer_rt_defer(timer)) {
+   ktime_t now = ktime_get();
+
+   __run_hrtimer(timer, );
+   /*
+* __run_hrtimer might have requeued timer and
+* it could be base->first again.
+*/
+   if (>node == new_base->active.next &&
+   

Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors

2013-03-21 Thread Konrad Rzeszutek Wilk
On Fri, Mar 08, 2013 at 06:07:08PM +0100, Roger Pau Monné wrote:
> On 05/03/13 22:46, Konrad Rzeszutek Wilk wrote:
> > On Tue, Mar 05, 2013 at 06:07:57PM +0100, Roger Pau Monné wrote:
> >> On 04/03/13 21:41, Konrad Rzeszutek Wilk wrote:
> >>> On Thu, Feb 28, 2013 at 11:28:55AM +0100, Roger Pau Monne wrote:
>  Indirect descriptors introduce a new block operation
>  (BLKIF_OP_INDIRECT) that passes grant references instead of segments
>  in the request. This grant references are filled with arrays of
>  blkif_request_segment_aligned, this way we can send more segments in a
>  request.
> 
>  The proposed implementation sets the maximum number of indirect grefs
>  (frames filled with blkif_request_segment_aligned) to 256 in the
>  backend and 64 in the frontend. The value in the frontend has been
>  chosen experimentally, and the backend value has been set to a sane
>  value that allows expanding the maximum number of indirect descriptors
>  in the frontend if needed.
> >>>
> >>> So we are still using a similar format of the form:
> >>>
> >>> , etc.
> >>>
> >>> Why not utilize a layout that fits with the bio sg? That way
> >>> we might not even have to do the bio_alloc call and instead can
> >>> setup an bio (and bio-list) with the appropiate offsets/list?
> 
> I think we can already do this without changing the structure of the
> segments, we could just allocate a bio big enough to hold all the
> segments and queue them up (provided that the underlying storage device
> supports bios of this size).
> 
> bio = bio_alloc(GFP_KERNEL, nseg);
> if (unlikely(bio == NULL))
>   goto fail_put_bio;
> biolist[nbio++] = bio;
> bio->bi_bdev= preq.bdev;
> bio->bi_private = pending_req;
> bio->bi_end_io  = end_block_io_op;
> bio->bi_sector  = preq.sector_number;
> 
> for (i = 0; i < nseg; i++) {
>   rc = bio_add_page(bio, pages[i], seg[i].nsec << 9,
>   seg[i].buf & ~PAGE_MASK);
>   if (rc == 0)
>   goto fail_put_bio;
> }
> 
> This seems to work with Linux blkfront/blkback, and I guess biolist in
> blkback only has one bio all the time.

> 
> >>> Meaning that the format of the indirect descriptors is:
> >>>
> >>> 
> 
> Don't we need a length parameter? Also, next_index will be current+1,
> because we already send the segments sorted (using for_each_sg) in blkfront.
> 
> >>>
> >>> We already know what the first_sec and last_sect are - they
> >>> are basically: sector_number +  nr_segments * (whatever the sector size 
> >>> is) + offset
> >>
> >> This will of course be suitable for Linux, but what about other OSes, I
> >> know they support the traditional first_sec, last_sect (because it's
> >> already implemented), but I don't know how much work will it be for them
> >> to adopt this. If we have to do such a change I will have to check first
> >> that other frontend/backend can handle this easily also, I wouldn't like
> >> to simplify this for Linux by making it more difficult to implement in
> >> other OSes...
> > 
> > I would think that most OSes use the same framework. The ones that
> > are of notable interest are the Windows and BSD. Lets CC James here
> 
> Maybe I'm missing something here, but I don't see a really big benefit
> of using this new structure for segments instead of the current one.

The DIF/DIX requires that the bio layout going in blkfront and then
emerging on the other side in the SAS/SCSI/SATA drivers must be the same.

That means when you have a bio-vec, for example, where there are
five pages linked - the first four have 512 bytes of data (say in the middle
of the page - so 2048 -> 2560 are occupied, the rest is not). The total
is 2048 bytes, and the last page contains 32 bytes (four CRC checksums, each
8 bytes).

If we coalesce any of the five pages in one, then we need to (when we
take the request out of the ring) in the backend, to reconstruct these
five pages. 

My thought was that with the fsect, lsect as they exist now, we will be 
tempted to just colesce four sectors in a page and just make lsect = fsect + 4.

That however is _not_ what we are doing now - I think. We look to recreate
the layout exactly as the READ/WRITE requests are set to xen-blkfront.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RESEND] gpio-sch: Allow for more than 8 lines in the resume well

2013-03-21 Thread Darren Hart
The E6xx (TunnelCreek) CPUs have 9 GPIO lines in the resume well. Update
the resume functions to allow for more than 8 GPIO lines, using the core
functions as a template.

Cc:  # 3.4.x
Cc:  # 3.8.x
Cc: Grant Likely 
Cc: Linus Walleij 
Signed-off-by: Darren Hart 
---
 drivers/gpio/gpio-sch.c | 37 +++--
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/gpio/gpio-sch.c b/drivers/gpio/gpio-sch.c
index edae963..7e7b52b 100644
--- a/drivers/gpio/gpio-sch.c
+++ b/drivers/gpio/gpio-sch.c
@@ -125,13 +125,17 @@ static int sch_gpio_resume_direction_in(struct gpio_chip 
*gc,
unsigned gpio_num)
 {
u8 curr_dirs;
+   unsigned short offset, bit;
 
spin_lock(_lock);
 
-   curr_dirs = inb(gpio_ba + RGIO);
+   offset = RGIO + gpio_num / 8;
+   bit = gpio_num % 8;
+
+   curr_dirs = inb(gpio_ba + offset);
 
-   if (!(curr_dirs & (1 << gpio_num)))
-   outb(curr_dirs | (1 << gpio_num) , gpio_ba + RGIO);
+   if (!(curr_dirs & (1 << bit)))
+   outb(curr_dirs | (1 << bit), gpio_ba + offset);
 
spin_unlock(_lock);
return 0;
@@ -139,22 +143,31 @@ static int sch_gpio_resume_direction_in(struct gpio_chip 
*gc,
 
 static int sch_gpio_resume_get(struct gpio_chip *gc, unsigned gpio_num)
 {
-   return !!(inb(gpio_ba + RGLV) & (1 << gpio_num));
+   unsigned short offset, bit;
+
+   offset = RGLV + gpio_num / 8;
+   bit = gpio_num % 8;
+
+   return !!(inb(gpio_ba + offset) & (1 << bit));
 }
 
 static void sch_gpio_resume_set(struct gpio_chip *gc,
unsigned gpio_num, int val)
 {
u8 curr_vals;
+   unsigned short offset, bit;
 
spin_lock(_lock);
 
-   curr_vals = inb(gpio_ba + RGLV);
+   offset = RGLV + gpio_num / 8;
+   bit = gpio_num % 8;
+
+   curr_vals = inb(gpio_ba + offset);
 
if (val)
-   outb(curr_vals | (1 << gpio_num), gpio_ba + RGLV);
+   outb(curr_vals | (1 << bit), gpio_ba + offset);
else
-   outb((curr_vals & ~(1 << gpio_num)), gpio_ba + RGLV);
+   outb((curr_vals & ~(1 << bit)), gpio_ba + offset);
 
spin_unlock(_lock);
 }
@@ -163,14 +176,18 @@ static int sch_gpio_resume_direction_out(struct gpio_chip 
*gc,
unsigned gpio_num, int val)
 {
u8 curr_dirs;
+   unsigned short offset, bit;
 
sch_gpio_resume_set(gc, gpio_num, val);
 
+   offset = RGIO + gpio_num / 8;
+   bit = gpio_num % 8;
+
spin_lock(_lock);
 
-   curr_dirs = inb(gpio_ba + RGIO);
-   if (curr_dirs & (1 << gpio_num))
-   outb(curr_dirs & ~(1 << gpio_num), gpio_ba + RGIO);
+   curr_dirs = inb(gpio_ba + offset);
+   if (curr_dirs & (1 << bit))
+   outb(curr_dirs & ~(1 << bit), gpio_ba + offset);
 
spin_unlock(_lock);
return 0;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 01/02] MIPS: Build uasm-generated code only once to avoid CPU Hotplug problem

2013-03-21 Thread Huacai Chen
On Thu, Mar 21, 2013 at 11:53 PM, David Daney  wrote:
> On 03/20/2013 04:14 PM, David Daney wrote:
>>
>> On 03/17/2013 05:49 AM, Huacai Chen wrote:
>>>
>>> This and the next patch resolve memory corruption problems while CPU
>>> hotplug. Without these patches, memory corruption can triggered easily
>>> as below:
>>>
> [...]
>
>>
>> We were seeing the same crashes, this patch set seems to fix the problem.
>>
>> Acked-by: David Daney 
>
>
> On second thought...
>
>
>
>>
>>> ---
>>>   arch/mips/include/asm/cpu-features.h   |3 +++
>>>   .../asm/mach-loongson/cpu-feature-overrides.h  |1 +
>>>   arch/mips/mm/page.c|   10 ++
>>>   arch/mips/mm/tlbex.c   |   10 --
>>>   4 files changed, 22 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/mips/include/asm/cpu-features.h
>>> b/arch/mips/include/asm/cpu-features.h
>>> index 1a57e8b..e5ec8fc 100644
>>> --- a/arch/mips/include/asm/cpu-features.h
>>> +++ b/arch/mips/include/asm/cpu-features.h
>>> @@ -113,6 +113,9 @@
>>>   #ifndef cpu_has_pindexed_dcache
>>>   #define cpu_has_pindexed_dcache (cpu_data[0].dcache.flags &
>>> MIPS_CACHE_PINDEX)
>>>   #endif
>>> +#ifndef cpu_has_local_ebase
>>> +#define cpu_has_local_ebase1
>
>
>
> This really should default to 0 and only be set for (??who knows what??).
The original code before this patch assume all MIPS has a local ebase.
To minimize the modification, we default it to 1 (but I don't know
which CPU has local ebase).

>
> David Daney
>
>
>
>>> +#endif
>>>
>>>   /*
>>>* I-Cache snoops remote store. This only matters on SMP.  Some
>>> multiprocessors
>>> diff --git
>>> a/arch/mips/include/asm/mach-loongson/cpu-feature-overrides.h
>>> b/arch/mips/include/asm/mach-loongson/cpu-feature-overrides.h
>>> index 75fd8c0..c0f3ef4 100644
>>> --- a/arch/mips/include/asm/mach-loongson/cpu-feature-overrides.h
>>> +++ b/arch/mips/include/asm/mach-loongson/cpu-feature-overrides.h
>>> @@ -57,5 +57,6 @@
>>>   #define cpu_has_vint0
>>>   #define cpu_has_vtag_icache0
>>>   #define cpu_has_watch1
>>> +#define cpu_has_local_ebase0
>>>
>>>   #endif /* __ASM_MACH_LOONGSON_CPU_FEATURE_OVERRIDES_H */
>>> diff --git a/arch/mips/mm/page.c b/arch/mips/mm/page.c
>>> index a29fba5..4eb8dcf 100644
>>> --- a/arch/mips/mm/page.c
>>> +++ b/arch/mips/mm/page.c
>>> @@ -247,6 +247,11 @@ void __cpuinit build_clear_page(void)
>>>   struct uasm_label *l = labels;
>>>   struct uasm_reloc *r = relocs;
>>>   int i;
>>> +static atomic_t run_once = ATOMIC_INIT(0);
>>> +
>>> +if (atomic_xchg(_once, 1)) {
>>> +return;
>>> +}
>>>
>>>   memset(labels, 0, sizeof(labels));
>>>   memset(relocs, 0, sizeof(relocs));
>>> @@ -389,6 +394,11 @@ void __cpuinit build_copy_page(void)
>>>   struct uasm_label *l = labels;
>>>   struct uasm_reloc *r = relocs;
>>>   int i;
>>> +static atomic_t run_once = ATOMIC_INIT(0);
>>> +
>>> +if (atomic_xchg(_once, 1)) {
>>> +return;
>>> +}
>>>
>>>   memset(labels, 0, sizeof(labels));
>>>   memset(relocs, 0, sizeof(relocs));
>>> diff --git a/arch/mips/mm/tlbex.c b/arch/mips/mm/tlbex.c
>>> index 820e661..6bc28b4 100644
>>> --- a/arch/mips/mm/tlbex.c
>>> +++ b/arch/mips/mm/tlbex.c
>>> @@ -2162,8 +2162,11 @@ void __cpuinit build_tlb_refill_handler(void)
>>>   case CPU_TX3922:
>>>   case CPU_TX3927:
>>>   #ifndef CONFIG_MIPS_PGD_C0_CONTEXT
>>> -build_r3000_tlb_refill_handler();
>>> +if (cpu_has_local_ebase)
>>> +build_r3000_tlb_refill_handler();
>>>   if (!run_once) {
>>> +if (!cpu_has_local_ebase)
>>> +build_r3000_tlb_refill_handler();
>>>   build_r3000_tlb_load_handler();
>>>   build_r3000_tlb_store_handler();
>>>   build_r3000_tlb_modify_handler();
>>> @@ -2192,9 +2195,12 @@ void __cpuinit build_tlb_refill_handler(void)
>>>   build_r4000_tlb_load_handler();
>>>   build_r4000_tlb_store_handler();
>>>   build_r4000_tlb_modify_handler();
>>> +if (!cpu_has_local_ebase)
>>> +build_r4000_tlb_refill_handler();
>>>   run_once++;
>>>   }
>>> -build_r4000_tlb_refill_handler();
>>> +if (cpu_has_local_ebase)
>>> +build_r4000_tlb_refill_handler();
>>>   }
>>>   }
>>>
>>>
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
Vivek Goyal  writes:

> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
>
> [..]
>> So if starting or end address of PT_LOAD header is not aligned, why
>> not we simply allocate a page. Copy the relevant data from old memory,
>> fill rest with zero. That way mmap and read view will be same. There
>> will be no surprises w.r.t reading old kernel memory beyond what's 
>> specified by the headers.
>
> Copying from old memory might spring surprises w.r.t hw poisoned
> pages. I guess we will have to disable MCE, read page, enable it
> back or something like that to take care of these issues.
>
> In the past we have recommended makedumpfile to be careful, look
> at struct pages and make sure we are not reading poisoned pages.
> But vmcore itself is reading old memory and can run into this
> issue too.

Vivek you are overthinking this.

If there are issues with reading partially exported pages we should
fix them in kexec-tools or in the kernel where the data is exported.

In the examples given in the patch what we were looking at were cases
where the BIOS rightly or wrongly was saying kernel this is my memory
stay off.  But it was all perfectly healthy memory.

/proc/vmcore is a simple data dumper and prettifier.  Let's keep it that
way so that we can predict how it will act when we feed it information.
/proc/vmcore should not be worrying about or covering up sins elsewhere
in the system.

At the level of /proc/vmcore we may want to do something about ensuring
MCE's don't kill us.  But that is an orthogonal problem.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] PCI: Handle device quirks when accessing sysfs resource entries

2013-03-21 Thread Robert Hancock

On 03/20/2013 10:35 PM, Myron Stowe wrote:

Sysfs includes entries to memory regions that back a PCI device's BARs.
The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
providing direct access to the device's registers.  File permissions
prevent random users from accessing the device's registers through these
files, but don't stop a privileged app that chooses to ignore the purpose
of these files from doing so.

There are devices with abnormally strict restrictions with respect to
accessing their registers; aspects that are typically handled by the
device's driver.  When these access restrictions are not followed - as
when a userspace app such as "udevadm info --attribute-walk
--path=/sys/..." parses though reading all the device's sysfs entries - it
can cause such devices to fail.

This patch introduces a quirking mechanism that can be used to detect
accesses that do no meet the device's restrictions, letting a device
specific method intervene and decide how to progress.

Reported-by: Xiangliang Yu 
Signed-off-by: Myron Stowe 


I honestly don't think there's much point in even attempting this 
strategy. This list of devices in the quirk can't possibly be complete. 
It would likely be easier to enumerate a white-list of devices that can 
deal with their IO ports being read willy-nilly than a blacklist of 
those that don't, as there's likely countless devices that fall into 
this category. Even if they don't choke as badly as these ones do, it's 
quite likely that bad behavior will result.


I think there's a few things that need to be done:

-Fix the bug in udevadm that caused it to trawl through these files 
willy-nilly,


-Fix the kernel so that access through these files complies with the 
kernel's mechanisms for claiming IO/memory regions to prevent access 
conflicts (i.e. opening these files should claim the resource region 
they refer to, and should fail with EBUSY or something if another 
process or a kernel driver is using it).


-Reconsider whether supporting read/write on the resource files for IO 
port regions like these makes any sense. Obviously mmap isn't very 
practical for IO port access on x86 but you could even do something like 
an ioctl for this purpose. Not very many pieces of software would need 
to access these files so it's likely OK if the API is a bit ugly. That 
would prevent something like grepping through sysfs from generating port 
accesses to random devices.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 11:27:51 -0400

> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
> 
> [..]
>> So if starting or end address of PT_LOAD header is not aligned, why
>> not we simply allocate a page. Copy the relevant data from old memory,
>> fill rest with zero. That way mmap and read view will be same. There
>> will be no surprises w.r.t reading old kernel memory beyond what's 
>> specified by the headers.
> 
> Copying from old memory might spring surprises w.r.t hw poisoned
> pages. I guess we will have to disable MCE, read page, enable it
> back or something like that to take care of these issues.
> 
> In the past we have recommended makedumpfile to be careful, look
> at struct pages and make sure we are not reading poisoned pages.
> But vmcore itself is reading old memory and can run into this
> issue too.

Yes, that has been already implemented in makedumpfile.

Not only copying, but also mmaping poisoned pages might be problematic
due to hardware cache prefetch performed by creation of page table to
the poisoned pages. Or MCE disables the prefetch? I'm not sure but
I'll investigate this. makedumpfile might also take care of calling
mmap.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: build failure after merge of the tty tree

2013-03-21 Thread Stephen Rothwell
Hi Greg,

On Thu, 21 Mar 2013 16:47:24 -0700 Greg KH  wrote:
>
> On Fri, Mar 22, 2013 at 10:28:08AM +1100, Stephen Rothwell wrote:
> > 
> > Except, of course, commit 27b351c is not in your tty tree :-(
> 
> Which was causing me lots of confusion :)
> 
> I've merged it in there now, and reverted it, so all should be good.

Thanks.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgp_qUp_BmzMI.pgp
Description: PGP signature


Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark indicating the end of ELF note buffer

2013-03-21 Thread Eric W. Biederman
HATAYAMA Daisuke  writes:

> From: Vivek Goyal 
> Subject: Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark 
> indicating the end of ELF note buffer
> Date: Thu, 21 Mar 2013 10:36:56 -0400
>
>> And in our case we don't know the size of ELF note. Kernel is not
>> exporting the size. So kexec-tools is putting an upper limit of 1024
>> and putting that value in p_memsz and p_filesz fields.
>> 
>> Given the fact that we are reserving elf notes at boot. That means
>> we know the size of ELF notes. It should make sense to export it
>> to user space and let kexec-tools put right values.
>> 
>
> Anyway, I think of this issue as beyond the scope of what I'm working
> here...

Agreed.  It is independent and can be fixed independently.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] arch: remove KCORE_ELF again

2013-03-21 Thread Paul Bolle
The Kconfig symbol KCORE_ELF was removed in v2.6.0, but reappeared in two
architectures. It is useless. Remove it again.

Signed-off-by: Paul Bolle 
---
0) Untested.

1) Sent as one patch. Feel free to tell me to split it up in two
patches.

 arch/tile/Kconfig   |  5 -
 arch/xtensa/Kconfig | 15 ---
 2 files changed, 20 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 2f8ccff..18e2a81 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -411,11 +411,6 @@ endmenu
 
 menu "Executable file formats"
 
-# only elf supported
-config KCORE_ELF
-   def_bool y
-   depends on PROC_FS
-
 source "fs/Kconfig.binfmt"
 
 endmenu
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index b09de49..d10b159 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -252,21 +252,6 @@ endmenu
 
 menu "Executable file formats"
 
-# only elf supported
-config KCORE_ELF
-   def_bool y
-depends on PROC_FS
-help
-  If you enabled support for /proc file system then the file
-  /proc/kcore will contain the kernel core image in ELF format. This
-  can be used in gdb:
-
-  $ cd /usr/src/linux ; gdb vmlinux /proc/kcore
-
-  This is especially useful if you have compiled the kernel with the
-  "-g" option to preserve debugging information. It is mainly used
- for examining kernel data structures on the live kernel.
-
 source "fs/Kconfig.binfmt"
 
 endmenu
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ACPI,acpi_memhotplug: Remove acpi_memory_info->failed bit

2013-03-21 Thread Toshi Kani
On Thu, 2013-03-21 at 13:39 +0900, Yasuaki Ishimatsu wrote:
> acpi_memory_info has enabled bit and failed bit for controlling memory
> hotplug. But we don't need to keep both bits.
> 
> The patch removes acpi_memory_info->failed bit.
> 
> Signed-off-by: yasuaki ishimatsu 
> ---
> 
> v2 : Changed a based kernel from linux-3.9-rc2 to linux-pm.git/bleeding-edge.
> 
> ---
>   drivers/acpi/acpi_memhotplug.c |   13 +
>   1 files changed, 1 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index ea78988..597cd65 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -73,7 +73,6 @@ struct acpi_memory_info {
>   unsigned short caching; /* memory cache attribute */
>   unsigned short write_protect;   /* memory read/write attribute */
>   unsigned int enabled:1;
> - unsigned int failed:1;
>   };
>   
>   struct acpi_memory_device {
> @@ -201,10 +200,8 @@ static int acpi_memory_enable_device(struct 
> acpi_memory_device *mem_device)
>* returns -EEXIST. If add_memory() returns the other error, it
>* means that this memory block is not used by the kernel.
>*/
> - if (result && result != -EEXIST) {
> - info->failed = 1;
> + if (result && result != -EEXIST)
>   continue;
> - }
>   
>   info->enabled = 1;
>   
> @@ -238,15 +235,7 @@ static int acpi_memory_remove_memory(struct 
> acpi_memory_device *mem_device)
>   nid = acpi_get_node(mem_device->device->handle);
>   
>   list_for_each_entry_safe(info, n, _device->res_list, list) {
> - if (info->failed)
> - /* The kernel does not use this memory block */
> - continue;
> -
>   if (!info->enabled)
> - /*
> -  * The kernel uses this memory block, but it may be not
> -  * managed by us.
> -  */
>   return -EBUSY;

Shouldn't this case (!info->enabled) continue since it is the same as
info->failed before?  -EBUSY was previously used for the -EEXIST case,
which is no longer a failure case with this patchset.

Thanks,
-Toshi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kirkwood: fix coccicheck warnings

2013-03-21 Thread Rafael J. Wysocki
On Friday, March 22, 2013 02:18:38 AM Silviu Popescu wrote:
> On Fri, Mar 22, 2013 at 1:48 AM, Rafael J. Wysocki  wrote:
> > On Monday, March 11, 2013 09:35:19 AM Silviu-Mihai Popescu wrote:
> >> Convert all uses of devm_request_and_ioremap() to the newly introduced
> >> devm_ioremap_resource() which provides more consistent error handling.
> >>
> >> devm_ioremap_resource() provides its own error messages so all explicit
> >> error messages can be removed from the failure code paths.
> >>
> >> Signed-off-by: Silviu-Mihai Popescu 
> >
> > If I'm supposed to take these changes, please split them into separate 
> > patches
> > for cpufreq, cpuidle and thermal (which would be for Rui BTW).
> 
> There seems to be an existing patch for thermal[1], as Rui pointed out 
> himself.
> I have split the changes in two patches, as you have requested, and resent 
> them.
> 
> [1] http://marc.info/?l=linux-pm=136238017027514=2

Thanks a lot!

Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] ARM: context tracking: add exception support

2013-03-21 Thread Kevin Hilman
Hi Russell,

Russell King - ARM Linux  writes:

> On Wed, Mar 20, 2013 at 05:01:58PM -0700, Kevin Hilman wrote:
>> Add ARM support for the context tracking subsystem by instrumenting
>> exception entry/exit points.
>> 
>> Special thanks to Mats Liljegren for testing, collaboration and adding
>> support for exceptions/faults that were missing in early test versions.
>
> Not sure all of these are a good idea or are correct...

Thanks for the review.

>> @@ -405,7 +406,9 @@ asmlinkage void __exception do_undefinstr(struct pt_regs 
>> *regs)
>>  unsigned int instr;
>>  siginfo_t info;
>>  void __user *pc;
>> +enum ctx_state prev_state;
>>  
>> +prev_state = exception_enter();
>>  pc = (void __user *)instruction_pointer(regs);
>>  
>>  if (processor_mode(regs) == SVC_MODE) {
>> @@ -433,8 +436,10 @@ asmlinkage void __exception do_undefinstr(struct 
>> pt_regs *regs)
>>  goto die_sig;
>>  }
>>  
>> -if (call_undef_hook(regs, instr) == 0)
>> +if (call_undef_hook(regs, instr) == 0) {
>> +exception_exit(prev_state);
>>  return;
>> +}
>>  
>>  die_sig:
>>  #ifdef CONFIG_DEBUG_USER
>> @@ -451,12 +456,17 @@ die_sig:
>>  info.si_addr  = pc;
>>  
>>  arm_notify_die("Oops - undefined instruction", regs, , 0, 6);
>> +exception_exit(prev_state);
>
> So, FP emulation and VFP support happens via a separate path.  Does this
> also need to be instrumented?

Yes, those will need to be instrumented too. I haven't looked at the
floating point stuff (and am obviously not testing any FP userspace
currently.)  Thanks for the reminder, I'll look at that next (feel free
to point me in the right direction if you have suggestions about where
to best instrument those.  I've not yet looked closely at how either are
handled.)

>>  }
>>  
>>  asmlinkage void do_unexp_fiq (struct pt_regs *regs)
>>  {
>> +enum ctx_state prev_state;
>> +
>> +prev_state = exception_enter();
>>  printk("Hmm.  Unexpected FIQ received, but trying to continue\n");
>>  printk("You may have a hardware problem...\n");
>> +exception_exit(prev_state);
>
> Not a good idea.  If we get here chances are things are really broken.
>
>>  }
>>  
>>  /*
>> @@ -467,6 +477,9 @@ asmlinkage void do_unexp_fiq (struct pt_regs *regs)
>>   */
>>  asmlinkage void bad_mode(struct pt_regs *regs, int reason)
>>  {
>> +enum ctx_state prev_state;
>> +
>> +prev_state = exception_enter();
>>  console_verbose();
>>  
>>  printk(KERN_CRIT "Bad mode in %s handler detected\n", handler[reason]);
>
> Same here.  If we get here, we're probably about to die a horrid death.

Yeah, I was wondering if I should bother with these "about to die"
scenarios.  I'll drop them since they may cause more problems than
they're worth.

>> @@ -746,7 +759,9 @@ baddataabort(int code, unsigned long instr, struct 
>> pt_regs *regs)
>>  {
>>  unsigned long addr = instruction_pointer(regs);
>>  siginfo_t info;
>> +enum ctx_state prev_state;
>>  
>> +prev_state = exception_enter();
>>  #ifdef CONFIG_DEBUG_USER
>>  if (user_debug & UDBG_BADABORT) {
>>  printk(KERN_ERR "[%d] %s: bad data abort: code %d instr 
>> 0x%08lx\n",
>> @@ -762,6 +777,7 @@ baddataabort(int code, unsigned long instr, struct 
>> pt_regs *regs)
>>  info.si_addr  = (void __user *)addr;
>>  
>>  arm_notify_die("unknown data abort code", regs, , instr, 0);
>> +exception_exit(prev_state);
>>  }
>>  
>>  void __readwrite_bug(const char *fn)
>> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
>> index 5dbf13f..759b70d 100644
>> --- a/arch/arm/mm/fault.c
>> +++ b/arch/arm/mm/fault.c
>> @@ -19,6 +19,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -424,9 +425,15 @@ do_translation_fault(unsigned long addr, unsigned int 
>> fsr,
>>  pgd_t *pgd, *pgd_k;
>>  pud_t *pud, *pud_k;
>>  pmd_t *pmd, *pmd_k;
>> -
>> -if (addr < TASK_SIZE)
>> -return do_page_fault(addr, fsr, regs);
>> +enum ctx_state prev_state;
>> +
>> +prev_state = exception_enter();
>> +if (addr < TASK_SIZE) {
>> +int ret;
>> +ret = do_page_fault(addr, fsr, regs);
>> +exception_exit(prev_state);
>> +return ret;
>> +}
>>  
>>  if (user_mode(regs))
>>  goto bad_area;
>> @@ -472,10 +479,12 @@ do_translation_fault(unsigned long addr, unsigned int 
>> fsr,
>>  goto bad_area;
>>  
>>  copy_pmd(pmd, pmd_k);
>> +exception_exit(prev_state);
>>  return 0;
>>  
>>  bad_area:
>>  do_bad_area(addr, fsr, regs);
>> +exception_exit(prev_state);
>>  return 0;
>>  }
>>  #else   /* CONFIG_MMU */
>> @@ -494,7 +503,12 @@ do_translation_fault(unsigned long addr, unsigned int 
>> fsr,
>>  static int
>>  do_sect_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>>  {
>> +enum ctx_state prev_state;
>> +
>> +   

Re: [PATCH] cpufreq/intel_pstate: Add function to check that all MSR's are valid

2013-03-21 Thread Rafael J. Wysocki
On Thursday, March 21, 2013 01:08:03 AM Rafael J. Wysocki wrote:
> On Wednesday, March 20, 2013 11:28:49 AM Dirk Brandewie wrote:
> > On 03/20/2013 11:28 AM, Rafael J. Wysocki wrote:
> > > On Wednesday, March 20, 2013 09:17:24 AM dirk.brande...@gmail.com wrote:
> > >> From: Dirk Brandewie 
> > >>
> > >> Some VMs seem to try to implement some MSRs but not all the registers
> > >> the driver needs.  Check to make sure all the MSR that we need are
> > >> available. If any of the required MSRs are not available refuse to
> > >> load.
> > >>
> > >> Signed-off-by: Dirk Brandewie 
> > >
> > > Is this needed for v3.9?  Any pointers to bug reports etc.?
> > >
> > 
> > Sorry I saw right after I sent the mail that the bug report was missing
> > 
> >  https://bugzilla.redhat.com/show_bug.cgi?id=922923
> >  Reported-by: Josh Stone 
> > 
> > Would you like me to spin the patch?
> 
> No, thanks, this information is sufficient.

Applied to linux-pm.git/bleeding-edge and will be moved to linux-next after
build testing.

Thanks,
Rafael


> > >> ---
> > >>   drivers/cpufreq/intel_pstate.c |   26 ++
> > >>   1 files changed, 26 insertions(+), 0 deletions(-)
> > >>
> > >> diff --git a/drivers/cpufreq/intel_pstate.c 
> > >> b/drivers/cpufreq/intel_pstate.c
> > >> index f6dd1e7..cd9c5f4 100644
> > >> --- a/drivers/cpufreq/intel_pstate.c
> > >> +++ b/drivers/cpufreq/intel_pstate.c
> > >> @@ -752,6 +752,29 @@ static struct cpufreq_driver intel_pstate_driver = {
> > >>
> > >>   static int __initdata no_load;
> > >>
> > >> +static int intel_pstate_msrs_not_valid(void)
> > >> +{
> > >> +/* Check that all the msr's we are using are valid. */
> > >> +u64 aperf, mperf, tmp;
> > >> +
> > >> +rdmsrl(MSR_IA32_APERF, aperf);
> > >> +rdmsrl(MSR_IA32_MPERF, mperf);
> > >> +
> > >> +if (!intel_pstate_min_pstate() ||
> > >> +!intel_pstate_max_pstate() ||
> > >> +!intel_pstate_turbo_pstate())
> > >> +return -ENODEV;
> > >> +
> > >> +rdmsrl(MSR_IA32_APERF, tmp);
> > >> +if (!(tmp - aperf))
> > >> +return -ENODEV;
> > >> +
> > >> +rdmsrl(MSR_IA32_MPERF, tmp);
> > >> +if (!(tmp - mperf))
> > >> +return -ENODEV;
> > >> +
> > >> +return 0;
> > >> +}
> > >>   static int __init intel_pstate_init(void)
> > >>   {
> > >>  int cpu, rc = 0;
> > >> @@ -764,6 +787,9 @@ static int __init intel_pstate_init(void)
> > >>  if (!id)
> > >>  return -ENODEV;
> > >>
> > >> +if (intel_pstate_msrs_not_valid())
> > >> +return -ENODEV;
> > >> +
> > >>  pr_info("Intel P-state driver initializing.\n");
> > >>
> > >>  all_cpu_data = vmalloc(sizeof(void *) * num_possible_cpus());
> > >>
> > 
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark indicating the end of ELF note buffer

2013-03-21 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark indicating 
the end of ELF note buffer
Date: Thu, 21 Mar 2013 10:36:56 -0400

> On Wed, Mar 20, 2013 at 08:54:25PM -0700, Eric W. Biederman wrote:
> 
> [..]
>> > Also, it's possible to get size of a whole part of ELF note segments
>> > from p_memsz or p_filesz, and gdb and binutils are reading the note
>> > segments until reaching the size.
>> 
>> Agreed.  Except in our weird case where we generate the notes on the
>> fly, and generate the NOTE segment header much earlier.
> 
> And in our case we don't know the size of ELF note. Kernel is not
> exporting the size. So kexec-tools is putting an upper limit of 1024
> and putting that value in p_memsz and p_filesz fields.
> 
> Given the fact that we are reserving elf notes at boot. That means
> we know the size of ELF notes. It should make sense to export it
> to user space and let kexec-tools put right values.
> 
> In fact looks like /sys/kernel/vmcoreinfo is exporting two values. Address
> and size. (This is kind of violation of sysfs poilcy of one value per
> file). But for per cpu notes, we are exporting only address and not
> size.

IIRC, Greg Norman pointed out this violation of vmcoreinfo file when
he found some monthes ago.

> 
> /sys/devices/system/cpu/cpu/crash_notes
> 
> May be we should export another file
> 
> /sys/devices/system/cpu/cpu/crash_notes_size
> 
> and let kexec-tools parse it.

Anyway, I think of this issue as beyond the scope of what I'm working
here...

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to get position of program header table

2013-03-21 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to 
get position of program header table
Date: Thu, 21 Mar 2013 10:12:02 -0400

> On Thu, Mar 21, 2013 at 11:50:41AM +0900, HATAYAMA Daisuke wrote:
>> From: "Eric W. Biederman" 
>> Subject: Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to 
>> get position of program header table
>> Date: Tue, 19 Mar 2013 14:44:16 -0700
>> 
>> > HATAYAMA Daisuke  writes:
>> > 
>> >> Currently, the code assumes that position of program header table is
>> >> next to ELF header. But future change can break the assumption on
>> >> kexec-tools and the 1st kernel. To avoid worst case, reference e_phoff
>> >> member explicitly to get position of program header table in
>> >> file-offset.
>> > 
>> > In principle this looks good.  However when I read this it looks like
>> > you are going a little too far.
>> > 
>> > You are changing not only the reading of the supplied headers, but
>> > you are changing the generation of the new new headers that describe
>> > the data provided by /proc/vmcore.
>> > 
>> > I get lost in following this after you mangle merge_note_headers.
>> > 
>> > In principle removing silly assumptions seems reasonable, but I think
>> > it is completely orthogonal to the task of maping vmcore mmapable.
>> > 
>> > I think it is fine to claim that the assumptions made here in vmcore are
>> > part of the kexec on panic ABI at this point, which would generally make
>> > this change unnecessary.
>> 
>> This was suggested by Vivek. He prefers generic one.
>> 
>> Vivek, do you agree to this? Or is it better to re-post this and other
>> clean-up patches as another one separately to this patch set?
> 
> Given the fact that current code has been working, I am fine to just
> re-post and take care of mmap() related issues. And we can take care
> of cleaning up of some assumptions about PT_NOTE headers later. Trying
> to club large cleanup with mmap() patches is making it hard to review.
> 

I see. I'll post the clean-up series separately.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 5:12 PM, Al Viro  wrote:
>
> What we should do, IMO, is to turn /proc//net into a honest symlink -
> to ../nets//net.  Hell, might even make it a magical symlink
> instead...

Ok, having seen the error of my ways, I'm starting to agree with you..
 How painful would that be? Especially since we'd need to backport
it..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Al Viro
On Fri, Mar 22, 2013 at 12:12:57AM +, Al Viro wrote:

> See the posting upthread.  We could try to kludge around that as well
> (e.g. have d_ancestor() compare ->d_inode instead of dentries themselves),
> but I really think it's a lousy idea only inviting further abuse.
> 
> What we should do, IMO, is to turn /proc//net into a honest symlink -
> to ../nets//net.  Hell, might even make it a magical symlink
> instead...

BTW, the root cause is that what used to be /proc/net became per-process.
So Eric (IIRC) had added /proc//net.  Only they are not really per-process
- they are per-netns.  And instead of putting those per-ns trees elsewhere and
having /proc//net resolve to the right one, we got them as directories,
with each entry hardlinked between all /proc//net for processes from
the same netns.  Including the subdirectory ones.  Oops...

Another variant is to keep cross-hardlinks for non-directories and duplicate
directory dentries/inodes as we do for /proc//net themselves.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
Thinking some more..

On Thu, Mar 21, 2013 at 5:15 PM, Linus Torvalds
 wrote:
>
> Hmm. But again, that can't actually happen here. We're in /proc. You
> can't move the entries around.

.. this wasn't a good argument, because we will take the locks before
we do that.

> Also, we only changed the locking order
> for the "inode is identical" case where we take only *one* lock, we
> didn't change it for the cases where we take multiple locks (and order
> them topologically).

.. and this isn't a good argument either, because your argument was
that you can get the deadlock by always taking two directories, and
never hitting the alias case itself.

Hmm.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kirkwood: fix coccicheck warnings

2013-03-21 Thread Silviu Popescu
On Fri, Mar 22, 2013 at 1:48 AM, Rafael J. Wysocki  wrote:
> On Monday, March 11, 2013 09:35:19 AM Silviu-Mihai Popescu wrote:
>> Convert all uses of devm_request_and_ioremap() to the newly introduced
>> devm_ioremap_resource() which provides more consistent error handling.
>>
>> devm_ioremap_resource() provides its own error messages so all explicit
>> error messages can be removed from the failure code paths.
>>
>> Signed-off-by: Silviu-Mihai Popescu 
>
> If I'm supposed to take these changes, please split them into separate patches
> for cpufreq, cpuidle and thermal (which would be for Rui BTW).

There seems to be an existing patch for thermal[1], as Rui pointed out himself.
I have split the changes in two patches, as you have requested, and resent them.

[1] http://marc.info/?l=linux-pm=136238017027514=2

Thanks,
Silviu Popescu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cpuimx27 and mbimx27: prepend CONFIG_ to Kconfig macro

2013-03-21 Thread Paul Bolle
Commit 2d66c7803595da0d4bcd949825d598575f5de9e6 ("cpuimx27 and mbimx27:
allow fine control of UART4 and SDHC2 usage") added the Kconfig symbol
MACH_EUKREA_CPUIMX27_USEUART4. But it forgot to prepend CONFIG_ to the
use of its macro. Add that prefix now.

Signed-off-by: Paul Bolle 
---
Untested. This needs testing, obviously. Or a rethink, because these
typos have been in the tree since v2.6.36!

 arch/arm/mach-imx/eukrea_mbimx27-baseboard.c | 4 ++--
 arch/arm/mach-imx/mach-cpuimx27.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-imx/eukrea_mbimx27-baseboard.c 
b/arch/arm/mach-imx/eukrea_mbimx27-baseboard.c
index b4c7002..b2f08bf 100644
--- a/arch/arm/mach-imx/eukrea_mbimx27-baseboard.c
+++ b/arch/arm/mach-imx/eukrea_mbimx27-baseboard.c
@@ -46,7 +46,7 @@ static const int eukrea_mbimx27_pins[] __initconst = {
PE10_PF_UART3_CTS,
PE11_PF_UART3_RTS,
/* UART4 */
-#if !defined(MACH_EUKREA_CPUIMX27_USEUART4)
+#if !defined(CONFIG_MACH_EUKREA_CPUIMX27_USEUART4)
PB26_AF_UART4_RTS,
PB28_AF_UART4_TXD,
PB29_AF_UART4_CTS,
@@ -306,7 +306,7 @@ void __init eukrea_mbimx27_baseboard_init(void)
 
imx27_add_imx_uart1(_pdata);
imx27_add_imx_uart2(_pdata);
-#if !defined(MACH_EUKREA_CPUIMX27_USEUART4)
+#if !defined(CONFIG_MACH_EUKREA_CPUIMX27_USEUART4)
imx27_add_imx_uart3(_pdata);
 #endif
 
diff --git a/arch/arm/mach-imx/mach-cpuimx27.c 
b/arch/arm/mach-imx/mach-cpuimx27.c
index 1465593..ea50870 100644
--- a/arch/arm/mach-imx/mach-cpuimx27.c
+++ b/arch/arm/mach-imx/mach-cpuimx27.c
@@ -48,7 +48,7 @@ static const int eukrea_cpuimx27_pins[] __initconst = {
PE14_PF_UART1_CTS,
PE15_PF_UART1_RTS,
/* UART4 */
-#if defined(MACH_EUKREA_CPUIMX27_USEUART4)
+#if defined(CONFIG_MACH_EUKREA_CPUIMX27_USEUART4)
PB26_AF_UART4_RTS,
PB28_AF_UART4_TXD,
PB29_AF_UART4_CTS,
@@ -272,7 +272,7 @@ static void __init eukrea_cpuimx27_init(void)
/* SDHC2 can be used for Wifi */
imx27_add_mxc_mmc(1, NULL);
 #endif
-#if defined(MACH_EUKREA_CPUIMX27_USEUART4)
+#if defined(CONFIG_MACH_EUKREA_CPUIMX27_USEUART4)
/* in which case UART4 is also used for Bluetooth */
imx27_add_imx_uart3(_pdata);
 #endif
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 5:08 PM, Al Viro  wrote:
>
> Not really.  Do that and yes, this deadlock goes away.  But the locking
> order in general goes to hell - we order directory inodes by "which dentry
> is an ancestor of another?"  So we have no warranty that we won't get
> alias1/foo/bar/baz < alias2/foo.  Take rename_lock() on those two and
> have it race with rmdir alias2/foo/bar/baz (locks alias2/foo/bar, then
> alias2/foo/bar/baz) and rmdir alias2/foo/bar (locks alias2/foo and
> alias2/foo/bar).  Oops - we have a cycle now...

Hmm. But again, that can't actually happen here. We're in /proc. You
can't move the entries around. Also, we only changed the locking order
for the "inode is identical" case where we take only *one* lock, we
didn't change it for the cases where we take multiple locks (and order
them topologically).

So I agree that we need to avoid aliased directories in the *general*
case. I'm just arguing that for the specific case of /proc, we should
be ok. No?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers: cpuidle: kirkwood: fix coccicheck warnings

2013-03-21 Thread Silviu-Mihai Popescu
Convert all uses of devm_request_and_ioremap() to the newly introduced
devm_ioremap_resource() which provides more consistent error handling.

devm_ioremap_resource() provides its own error messages so all explicit
error messages can be removed from the failure code paths.

Signed-off-by: Silviu-Mihai Popescu 
---
 drivers/cpuidle/cpuidle-kirkwood.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-kirkwood.c 
b/drivers/cpuidle/cpuidle-kirkwood.c
index 670aa1e..53aad73 100644
--- a/drivers/cpuidle/cpuidle-kirkwood.c
+++ b/drivers/cpuidle/cpuidle-kirkwood.c
@@ -66,9 +66,9 @@ static int kirkwood_cpuidle_probe(struct platform_device 
*pdev)
if (res == NULL)
return -EINVAL;
 
-   ddr_operation_base = devm_request_and_ioremap(>dev, res);
-   if (!ddr_operation_base)
-   return -EADDRNOTAVAIL;
+   ddr_operation_base = devm_ioremap_resource(>dev, res);
+   if (IS_ERR(ddr_operation_base))
+   return PTR_ERR(ddr_operation_base);
 
device = _cpu(kirkwood_cpuidle_device, smp_processor_id());
device->state_count = KIRKWOOD_MAX_STATES;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Al Viro
On Thu, Mar 21, 2013 at 05:01:49PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 4:58 PM, Linus Torvalds
>  wrote:
> >
> > So yes, it's against the rules, and we get that deadlock right now,
> > but one solution would be to just allow this particular case. The
> > patch for the deadlock looks dead simple:
> 
> It should go without saying that that whitespace-damaged patch is
> entirely untested. But especially since we need to back-port whatever
> fix, it would be good if we could make the fix be something as simple
> as this. Because I don't believe we really want to backport some big
> network-namespace reorganization.
> 
> This is, of course, all assuming that hardlinked directories are ok if
> we can just guarantee the absence of loops. If there's some other
> reason why they'd be problematic, we're screwed.

See the posting upthread.  We could try to kludge around that as well
(e.g. have d_ancestor() compare ->d_inode instead of dentries themselves),
but I really think it's a lousy idea only inviting further abuse.

What we should do, IMO, is to turn /proc//net into a honest symlink -
to ../nets//net.  Hell, might even make it a magical symlink
instead...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers: cpufreq: kirkwood: fix coccicheck warnings

2013-03-21 Thread Silviu-Mihai Popescu
Convert all uses of devm_request_and_ioremap() to the newly introduced
devm_ioremap_resource() which provides more consistent error handling.

devm_ioremap_resource() provides its own error messages so all explicit
error messages can be removed from the failure code paths.

Signed-off-by: Silviu-Mihai Popescu 
---
 drivers/cpufreq/kirkwood-cpufreq.c |8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/kirkwood-cpufreq.c 
b/drivers/cpufreq/kirkwood-cpufreq.c
index 0e83e3c..6052476 100644
--- a/drivers/cpufreq/kirkwood-cpufreq.c
+++ b/drivers/cpufreq/kirkwood-cpufreq.c
@@ -175,11 +175,9 @@ static int kirkwood_cpufreq_probe(struct platform_device 
*pdev)
dev_err(>dev, "Cannot get memory resource\n");
return -ENODEV;
}
-   priv.base = devm_request_and_ioremap(>dev, res);
-   if (!priv.base) {
-   dev_err(>dev, "Cannot ioremap\n");
-   return -EADDRNOTAVAIL;
-   }
+   priv.base = devm_ioremap_resource(>dev, res);
+   if (IS_ERR(priv.base))
+   return PTR_ERR(priv.base);
 
np = of_find_node_by_path("/cpus/cpu@0");
if (!np)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Rik,
On 03/21/2013 08:52 AM, Rik van Riel wrote:

On 03/20/2013 12:18 PM, Michal Hocko wrote:

On Sun 17-03-13 13:04:07, Mel Gorman wrote:
[...]

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..4835a7a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t 
*pgdat, int order, long remaining,

  }

  /*
+ * kswapd shrinks the zone by the number of pages required to reach
+ * the high watermark.
+ */
+static void kswapd_shrink_zone(struct zone *zone,
+   struct scan_control *sc,
+   unsigned long lru_pages)
+{
+unsigned long nr_slab;
+struct reclaim_state *reclaim_state = current->reclaim_state;
+struct shrink_control shrink = {
+.gfp_mask = sc->gfp_mask,
+};
+
+/* Reclaim above the high watermark. */
+sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));


OK, so the cap is at high watermark which sounds OK to me, although I
would expect balance_gap being considered here. Is it not used
intentionally or you just wanted to have a reasonable upper bound?

I am not objecting to that it just hit my eyes.


This is the maximum number of pages to reclaim, not the point
at which to stop reclaiming.


What's the difference between the maximum number of pages to reclaim and 
the point at which to stop reclaiming?




I assume Mel chose this value because it guarantees that enough
pages will have been freed, while also making sure that the value
is scaled according to zone size (keeping pressure between zones
roughly equal).



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Al Viro
On Thu, Mar 21, 2013 at 04:58:41PM -0700, Linus Torvalds wrote:

> And the only other reason we don't want to allow it is to make sure
> you can't have directory loops etc, afaik, and again, for this
> particular case of /proc, we happen to be ok.

Not really.  Do that and yes, this deadlock goes away.  But the locking
order in general goes to hell - we order directory inodes by "which dentry
is an ancestor of another?"  So we have no warranty that we won't get
alias1/foo/bar/baz < alias2/foo.  Take rename_lock() on those two and
have it race with rmdir alias2/foo/bar/baz (locks alias2/foo/bar, then
alias2/foo/bar/baz) and rmdir alias2/foo/bar (locks alias2/foo and
alias2/foo/bar).  Oops - we have a cycle now...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] clk: Add notifier support in clk_prepare/clk_unprepare

2013-03-21 Thread Colin Cross
On Thu, Mar 21, 2013 at 3:36 PM, Mike Turquette  wrote:
> To my knowledge, devfreq performs one task: implements an algorithm
> (typically one that loops/polls) and applies this heuristic towards a
> dvfs transition.
>
> It is a policy layer, a high level layer.  It should not be used as a
> lower-level mechanism.  Please correct me if my understanding is wrong.
>
> I think the very idea of the clk framework calling into devfreq is
> backwards.  Ideally a devfreq driver would call clk_set_rate as part of
> it's target callback.  This is analogous to a cpufreq .target callback
> which calls clk_set_rate and regulator_set_voltage.  Can you imagine the
> clock framework cross-calling into cpufreq when clk_set_rate is called?
> I think that would be strange.
>
> I think that all of this discussion highlights the fact that there is a
> missing piece of infrastructure.  It isn't devfreq or clock rate-change
> notifiers.  It is that there is not a dvfs mechanism which neatly builds
> on top of these lower-level frameworks (clocks & regulators).  Clearly
> some higher-level abstraction layer is needed.

I went through all of this on Tegra2.  For a while I had a
dvfs_set_rate api for drivers that needed to modify the voltage when
they updated a clock, but I ended up dropping it.  Drivers rarely care
about the voltage, all they want to do is set their clock rate.  The
voltage necessary to support that clock is an implementation detail of
the silicon that is irrelevant to the driver (I know TI liked to
specify voltage/frequency combos for the blocks, but their chips still
had to support running at a lower clock speed for the voltage than
specified in the OPP because that case always occurs during a dvfs
change).

For Tegra2, before clk_prepare/clk_unprepare existed, I hacked dvfs
into the clk framework by using a mixture of mutex locked clocks and
spinlock locked clocks.  The main issue is accidentally recursive
locking the main clock locks when the call path is
clk->dvfs->regulator set->i2c->clk.  I think if you could guarantee
that clocks required for dvfs were always in the "prepared" state
(maybe a flag on the clock, kind of like WQ_MEM_RECLAIM marks
"special" workqueues, or just have the machine call clk_prepare), and
that clk_prepare on an already-prepared clock avoided taking the mutex
(atomic op fastpath plus mutex slow path?), then the existing
notifiers would be perfect for dvfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Johannes,
On 03/21/2013 11:57 PM, Johannes Weiner wrote:

On Sun, Mar 17, 2013 at 01:04:07PM +, Mel Gorman wrote:

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority. In
many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
reclaimed pages but in the event kswapd scans a large number of pages it
cannot reclaim, it will raise the priority and potentially discard a large
percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
effect is a reclaim "spike" where a large percentage of memory is suddenly
freed. It would be bad enough if this was just unused memory but because
of how anon/file pages are balanced it is possible that applications get
pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will will overshoot due to it not being a hard limit as

will -> still?


shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd will
reclaim excessively if it is to balance zones for high-order allocations.

I don't really understand this last sentence.  Is the excessive
reclaim a result of the patch, a description of what's happening
now...?


Signed-off-by: Mel Gorman 

Nice, thank you.  Using the high watermark for larger zones is more
reasonable than my hack that just always went with SWAP_CLUSTER_MAX,
what with inter-zone LRU cycle time balancing and all.

Acked-by: Johannes Weiner 


One offline question, how to understand this in function balance_pgdat:
/*
 * Do some background aging of the anon list, to give
 * pages a chance to be referenced before reclaiming.
 */
age_acitve_anon(zone, );


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-03-21 Thread Kevin Hilman
Gleb Natapov  writes:

> On Thu, Mar 21, 2013 at 02:33:13PM -0500, Scott Wood wrote:
>> On 03/21/2013 02:16:00 PM, Gleb Natapov wrote:
>> >On Thu, Mar 21, 2013 at 01:42:34PM -0500, Scott Wood wrote:
>> >> On 03/21/2013 09:27:14 AM, Kevin Hilman wrote:
>> >> >Gleb Natapov  writes:
>> >> >
>> >> >> On Wed, Mar 20, 2013 at 06:58:41PM -0500, Scott Wood wrote:
>> >> >>> Why can't the entirety kvm_host.h be included regardless of
>> >> >>> CONFIG_KVM, just like most other feature-specific headers?  Why
>> >> >>> can't the if/else just go around the functions that you want to
>> >> >stub
>> >> >>> out for non-KVM builds?
>> >> >>>
>> >> >> Kevin,
>> >> >>
>> >> >>  What compilation failure this patch fixes? I presume
>> >something ARM
>> >> >> related.
>> >> >
>> >> >Not specficially ARM related, but more context tracking related
>> >since
>> >> >kernel/context_tracking.c pulls in kvm_host.h, which attempts to
>> >> >pull in
>> >> > which may not exist on some platforms.
>> >> >
>> >> >At least for ARM, KVM support was added in v3.9 so this patch can
>> >> >probably be dropped since the non-KVM builds on ARM now work.
>> >But any
>> >> >platform without the  will still be broken when
>> >trying to
>> >> >build the context tracker.
>> >>
>> >> Maybe other platforms should get empty asm/kvm*.h files.  Is there
>> >> anything from those files that the linux/kvm*.h headers need to
>> >> build?
>> >>
>> >arch things. kvm_vcpu_arch, kvm_arch_memory_slot, kvm_arch etc.
>> 
>> Could define them as empty structs.
>> 
> Isn't is simpler for kernel/context_tracking.c to define empty
> __guest_enter()/__guest_exit() if !CONFIG_KVM.

I proposed something like that in an earlier version but Frederic asked
me to propose a fix to the KVM headers instead.

Just in case fixing the context tracking subsystem is preferred, 
the patch below fixes the problem also.

Kevin

>From f22995a262144d0d61705fa72134694d911283eb Mon Sep 17 00:00:00 2001
From: Kevin Hilman 
Date: Thu, 21 Mar 2013 16:57:14 -0700
Subject: [PATCH] context_tracking: fix !CONFIG_KVM compile: add stub guest
 enter/exit

When KVM is not enabled, or not available on a platform, the KVM
headers should not be included.  Instead, just define stub
__guest_[enter|exit] functions.

Cc: Frederic Weisbecker 
Signed-off-by: Kevin Hilman 
---
 kernel/context_tracking.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 65349f0..64b0f80 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -15,12 +15,18 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 
+#if IS_ENABLED(CONFIG_KVM)
+#include 
+#else
+#define __guest_enter()
+#define __guest_exit()
+#endif
+
 DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE
.active = true,
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 4:58 PM, Linus Torvalds
 wrote:
>
> So yes, it's against the rules, and we get that deadlock right now,
> but one solution would be to just allow this particular case. The
> patch for the deadlock looks dead simple:

It should go without saying that that whitespace-damaged patch is
entirely untested. But especially since we need to back-port whatever
fix, it would be good if we could make the fix be something as simple
as this. Because I don't believe we really want to backport some big
network-namespace reorganization.

This is, of course, all assuming that hardlinked directories are ok if
we can just guarantee the absence of loops. If there's some other
reason why they'd be problematic, we're screwed.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS deadlock ?

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 4:36 PM, Al Viro  wrote:
>
> Some netns-related idiocy.  Oh, shit...
>
> al@duke:~/linux/trees/vfs$ ls -lid /proc/{1,2}/net/stat
> 4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/1/net/stat
> 4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/2/net/stat
>
> Eric, would you mind explaining WTF is going on here?  Again, WE CAN NOT
> HAVE SEVERAL DENTRIES OVER THE SAME DIRECTORY INODE.  Ever.  We do that,
> we are fucked.

Hmm. That certainly explains the situation, but it leaves me wondering
whether the simplest solution to this is not to say "ok, let's allow
it in this case".

The locking is already per-inode, so we can literally change the code
that checks "if same dentry" to "if same inode" instead.

And the only other reason we don't want to allow it is to make sure
you can't have directory loops etc, afaik, and again, for this
particular case of /proc, we happen to be ok.

So yes, it's against the rules, and we get that deadlock right now,
but one solution would be to just allow this particular case. The
patch for the deadlock looks dead simple:

diff --git a/fs/namei.c b/fs/namei.c
index 57ae9c8c66bf..435002f99bd8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2277,7 +2277,7 @@ struct dentry *lock_rename(struct dentry
*p1, struct dentry *p2)
 {
 struct dentry *p;

-if (p1 == p2) {
+if (p1->d_inode == p2->d_inode) {
 mutex_lock_nested(>d_inode->i_mutex, I_MUTEX_PARENT);
 return NULL;
 }
@@ -2306,7 +2306,7 @@ struct dentry *lock_rename(struct dentry
*p1, struct dentry *p2)
 void unlock_rename(struct dentry *p1, struct dentry *p2)
 {
 mutex_unlock(>d_inode->i_mutex);
-if (p1 != p2) {
+if (p1->d_inode != p2->d_inode) {
 mutex_unlock(>d_inode->i_mutex);
 mutex_unlock(>d_inode->i_sb->s_vfs_rename_mutex);
 }

Are there any other reasons why these kinds of "hardlinked
directories" would cause problems?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >