Re: 2.6.25-rc2: wpa_supplicant BUGs kernel in rwlock recursion

2008-02-16 Thread Guillaume Chazarain
On Feb 16, 2008 6:14 PM, Alessandro Suardi <[EMAIL PROTECTED]> wrote:
> Feb 16 16:51:49 sandman kernel: BUG: rwlock recursion on CPU#0,

Same thing here, bisected it to:

commit 45b503548210fe6f23e92b856421c2a3f05fd034
Author: Laszlo Attila Toth  balabit.hu>
Date:   Tue Feb 12 22:42:09 2008 -0800

[RTNETLINK]: Send a single notification on device state changes.

The revert applies cleanly and fixes the problem.
Rafael has more details in http://lkml.org/lkml/2008/2/15/542.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc2: wpa_supplicant BUGs kernel in rwlock recursion

2008-02-16 Thread Guillaume Chazarain
On Feb 16, 2008 6:14 PM, Alessandro Suardi [EMAIL PROTECTED] wrote:
 Feb 16 16:51:49 sandman kernel: BUG: rwlock recursion on CPU#0,

Same thing here, bisected it to:

commit 45b503548210fe6f23e92b856421c2a3f05fd034
Author: Laszlo Attila Toth panther at balabit.hu
Date:   Tue Feb 12 22:42:09 2008 -0800

[RTNETLINK]: Send a single notification on device state changes.

The revert applies cleanly and fixes the problem.
Rafael has more details in http://lkml.org/lkml/2008/2/15/542.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Help debugging filesystem activity?

2008-02-11 Thread Guillaume Chazarain
On Feb 11, 2008 2:17 PM, rzryyvzy <[EMAIL PROTECTED]> wrote:
> $ cat /proc/fs/vfs/reading_files
>
> $ cat /proc/fs/vfs/writing_files

You can try:

# echo 1 > /proc/sys/vm/block_dump
# dmesg

HTH.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Help debugging filesystem activity?

2008-02-11 Thread Guillaume Chazarain
On Feb 11, 2008 2:17 PM, rzryyvzy [EMAIL PROTECTED] wrote:
 $ cat /proc/fs/vfs/reading_files

 $ cat /proc/fs/vfs/writing_files

You can try:

# echo 1  /proc/sys/vm/block_dump
# dmesg

HTH.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 019/233] proc: fix the threaded /proc/self

2008-02-08 Thread Guillaume Chazarain
On Feb 8, 2008 1:18 PM,  <[EMAIL PROTECTED]> wrote:
> Long ago when the CLONE_THREAD support first went it someone thought it
> would be wise to point /proc/self at /proc/ instead of /proc/.

The last message about this conversation is:

http://lkml.org/lkml/2007/12/1/172

So I thought we would end up with a new file, in order to make the
change discoverable.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 019/233] proc: fix the threaded /proc/self

2008-02-08 Thread Guillaume Chazarain
On Feb 8, 2008 1:18 PM,  [EMAIL PROTECTED] wrote:
 Long ago when the CLONE_THREAD support first went it someone thought it
 would be wise to point /proc/self at /proc/tgid instead of /proc/pid.

The last message about this conversation is:

http://lkml.org/lkml/2007/12/1/172

So I thought we would end up with a new file, in order to make the
change discoverable.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-02-03 Thread Guillaume Chazarain
On Jan 4, 2008 4:19 PM, Al Viro <[EMAIL PROTECTED]> wrote:
> Umm...  Actually, m_next() and m_stop() both appear to be too convoluted.
>
> * m_next() never gets v == NULL
> * the only reason why we do that mmput et.al. both from ->next() and
>   ->stop() is that we try to avoid having priv->mm; why bother?
> * why the _hell_ is proc_maps_private defined in include/linux/proc_fs.h,
>   of all places?
> * while we are at it, why is it in any header at all?  Having that sucker
>   in task_mmu.c and task_nommu.c would be more than enough (and we'd avoid
>   that ifdef in definition, while we are at it).
>
> How about this:

Hi Al,

Any update on this patch?
As you completely rewrote it, I thought you would take care of pushing
it forward.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-02-03 Thread Guillaume Chazarain
On Jan 4, 2008 4:19 PM, Al Viro [EMAIL PROTECTED] wrote:
 Umm...  Actually, m_next() and m_stop() both appear to be too convoluted.

 * m_next() never gets v == NULL
 * the only reason why we do that mmput et.al. both from -next() and
   -stop() is that we try to avoid having priv-mm; why bother?
 * why the _hell_ is proc_maps_private defined in include/linux/proc_fs.h,
   of all places?
 * while we are at it, why is it in any header at all?  Having that sucker
   in task_mmu.c and task_nommu.c would be more than enough (and we'd avoid
   that ifdef in definition, while we are at it).

 How about this:

Hi Al,

Any update on this patch?
As you completely rewrote it, I thought you would take care of pushing
it forward.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: remove unused code in set_cyc2ns_scale()

2008-01-31 Thread Guillaume Chazarain
On 1/31/08, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> hm, this is not a pure elimination of dead code, this will change
> behavior. For example we wont call sched_clock_idle_sleep_event() on
> !cpu_khz now. Hm?

Oops, indeed I overlooked that. OTOH, I can't see how it can happen
(in 32 bit at least), and even if it happens it should not have any
effect. But I'll keep this check to avoid making this case illegal.

Thanks for the review.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-31 Thread Guillaume Chazarain
On 1/31/08, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> Does something like this help?

I made it compile by open coding undefined macros instead of
refactoring the whole file.
But it didn't affect wake up latencies.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in work_resched

2008-01-31 Thread Guillaume Chazarain
On 1/31/08, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> works for me :-( (x86_64 rawhide userspace)

i386, !SMP, Fedora 8 here.

> Could you send your .config?

Here we go:

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24
# Thu Jan 31 12:33:36 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_SUPPORTS_OPROFILE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-gc"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_BLOCK=y
CONFIG_LBD=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_LSF is not set
CONFIG_BLK_DEV_BSG=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=m
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_CLASSIC_RCU=y
# CONFIG_PREEMPT_RCU is not set

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
# CONFIG_SMP is not set
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_X86_RDC321X is not set
# CONFIG_X86_VSMP is not set
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y

Re: Hang in work_resched

2008-01-31 Thread Guillaume Chazarain
On Jan 31, 2008 9:55 AM, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> Does this patch from thomas fix it as well?

Unfortunately, not.

For information, reverting just the first part of the offending commit
(sl->timer.cb_mode) fixed the problem, while reverting only the second
part (if (!hrtimer_active(>timer))) had no effect.

Also, I found a trivially reproductible testcase : sleep 0.

It hangs in nanosleep({0, 0}).

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in work_resched

2008-01-31 Thread Guillaume Chazarain
On Jan 31, 2008 9:55 AM, Peter Zijlstra [EMAIL PROTECTED] wrote:
 Does this patch from thomas fix it as well?

Unfortunately, not.

For information, reverting just the first part of the offending commit
(sl-timer.cb_mode) fixed the problem, while reverting only the second
part (if (!hrtimer_active(t-timer))) had no effect.

Also, I found a trivially reproductible testcase : sleep 0.

It hangs in nanosleep({0, 0}).

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in work_resched

2008-01-31 Thread Guillaume Chazarain
On 1/31/08, Peter Zijlstra [EMAIL PROTECTED] wrote:
 works for me :-( (x86_64 rawhide userspace)

i386, !SMP, Fedora 8 here.

 Could you send your .config?

Here we go:

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24
# Thu Jan 31 12:33:36 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_SUPPORTS_OPROFILE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=-gc
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_BLOCK=y
CONFIG_LBD=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_LSF is not set
CONFIG_BLK_DEV_BSG=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=m
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=cfq
CONFIG_CLASSIC_RCU=y
# CONFIG_PREEMPT_RCU is not set

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
# CONFIG_SMP is not set
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_X86_RDC321X is not set
# CONFIG_X86_VSMP is not set
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4

Re: High wake up latencies with FAIR_USER_SCHED

2008-01-31 Thread Guillaume Chazarain
On 1/31/08, Peter Zijlstra [EMAIL PROTECTED] wrote:
 Does something like this help?

I made it compile by open coding undefined macros instead of
refactoring the whole file.
But it didn't affect wake up latencies.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: remove unused code in set_cyc2ns_scale()

2008-01-31 Thread Guillaume Chazarain
On 1/31/08, Ingo Molnar [EMAIL PROTECTED] wrote:
 hm, this is not a pure elimination of dead code, this will change
 behavior. For example we wont call sched_clock_idle_sleep_event() on
 !cpu_khz now. Hm?

Oops, indeed I overlooked that. OTOH, I can't see how it can happen
(in 32 bit at least), and even if it happens it should not have any
effect. But I'll keep this check to avoid making this case illegal.

Thanks for the review.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in work_resched

2008-01-30 Thread Guillaume Chazarain
On Jan 29, 2008 11:30 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote:
>  ===
> gnome-termina S 0027 0  2201  1
>f6711fb0 00200082 cb330d62 0027 f664105c 0b1e  cb331880
>0027 f660d780 009e3840 080ab7d8 080ab298 f6711000 c0103e7e 009e3840
>000e0002 0002 080ab7d8 080ab298 bfb41be8 080ab7d8 007b c010007b
> Call Trace:
>  [] work_resched+0x5/0x16
>  ===
>
> This corresponds to the cli instruction:
> c0103e7e:   fa  cli

I bisected it, and the resulting commit is appended. Rerverting this
commit applies cleanly on today's git
(dd430ca20c40ecccd6954a7efd13d4398f507728) and makes the hang go away
-:)


commit 37bb6cb4097e29ffee970065b74499cbf10603a3
Author: Peter Zijlstra <[EMAIL PROTECTED]>
Date:   Fri Jan 25 21:08:32 2008 +0100

hrtimer: unlock hrtimer_wakeup

hrtimer_wakeup creates a

  base->lock
rq->lock

lock dependancy. Avoid this by switching to HRTIMER_CB_IRQSAFE_NO_SOFTIRQ
which doesn't hold base->lock.

This fully untangles hrtimer locks from the scheduler locks, and allows
hrtimer usage in the scheduler proper.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 061ae28..bd5d6b5 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1293,7 +1293,7 @@ void hrtimer_init_sleeper(struct hrtimer_sleeper
*sl, struct task_struct *task)
sl->timer.function = hrtimer_wakeup;
sl->task = task;
 #ifdef CONFIG_HIGH_RES_TIMERS
-   sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_RESTART;
+   sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
 #endif
 }

@@ -1304,6 +1304,8 @@ static int __sched do_nanosleep(struct
hrtimer_sleeper *t, enum hrtimer_mode mod
do {
set_current_state(TASK_INTERRUPTIBLE);
hrtimer_start(>timer, t->timer.expires, mode);
+   if (!hrtimer_active(>timer))
+   t->task = NULL;

if (likely(t->task))
schedule();


-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in work_resched

2008-01-30 Thread Guillaume Chazarain
On Jan 29, 2008 11:30 PM, Guillaume Chazarain [EMAIL PROTECTED] wrote:
  ===
 gnome-termina S 0027 0  2201  1
f6711fb0 00200082 cb330d62 0027 f664105c 0b1e  cb331880
0027 f660d780 009e3840 080ab7d8 080ab298 f6711000 c0103e7e 009e3840
000e0002 0002 080ab7d8 080ab298 bfb41be8 080ab7d8 007b c010007b
 Call Trace:
  [c0103e7e] work_resched+0x5/0x16
  ===

 This corresponds to the cli instruction:
 c0103e7e:   fa  cli

I bisected it, and the resulting commit is appended. Rerverting this
commit applies cleanly on today's git
(dd430ca20c40ecccd6954a7efd13d4398f507728) and makes the hang go away
-:)


commit 37bb6cb4097e29ffee970065b74499cbf10603a3
Author: Peter Zijlstra [EMAIL PROTECTED]
Date:   Fri Jan 25 21:08:32 2008 +0100

hrtimer: unlock hrtimer_wakeup

hrtimer_wakeup creates a

  base-lock
rq-lock

lock dependancy. Avoid this by switching to HRTIMER_CB_IRQSAFE_NO_SOFTIRQ
which doesn't hold base-lock.

This fully untangles hrtimer locks from the scheduler locks, and allows
hrtimer usage in the scheduler proper.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
Signed-off-by: Ingo Molnar [EMAIL PROTECTED]

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 061ae28..bd5d6b5 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1293,7 +1293,7 @@ void hrtimer_init_sleeper(struct hrtimer_sleeper
*sl, struct task_struct *task)
sl-timer.function = hrtimer_wakeup;
sl-task = task;
 #ifdef CONFIG_HIGH_RES_TIMERS
-   sl-timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_RESTART;
+   sl-timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
 #endif
 }

@@ -1304,6 +1304,8 @@ static int __sched do_nanosleep(struct
hrtimer_sleeper *t, enum hrtimer_mode mod
do {
set_current_state(TASK_INTERRUPTIBLE);
hrtimer_start(t-timer, t-timer.expires, mode);
+   if (!hrtimer_active(t-timer))
+   t-task = NULL;

if (likely(t-task))
schedule();


-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: remove unused code in set_cyc2ns_scale()

2008-01-29 Thread Guillaume Chazarain
This should be fold into:
4f95bd6e2b21a8c724357463f8341502d47aba13
x86: scale cyc_2_nsec according to CPU frequency

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
---
 arch/x86/kernel/tsc_32.c |   14 +-
 arch/x86/kernel/tsc_64.c |   14 +-
 2 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 43517e3..e05e221 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -83,20 +83,16 @@ DEFINE_PER_CPU(unsigned long, cyc2ns);
 
 static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   unsigned long flags, prev_scale, *scale;
-   unsigned long long tsc_now, ns_now;
+   unsigned long flags, *scale;
+
+   if (!cpu_khz)
+   return;
 
local_irq_save(flags);
sched_clock_idle_sleep_event();
 
scale = _cpu(cyc2ns, cpu);
-
-   rdtscll(tsc_now);
-   ns_now = __cycles_2_ns(tsc_now);
-
-   prev_scale = *scale;
-   if (cpu_khz)
-   *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
 
/*
 * Start smoothly with the new frequency:
diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 947554d..e0e9d4f 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -44,20 +44,16 @@ DEFINE_PER_CPU(unsigned long, cyc2ns);
 
 static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   unsigned long flags, prev_scale, *scale;
-   unsigned long long tsc_now, ns_now;
+   unsigned long flags, *scale;
+
+   if (!cpu_khz)
+   return;
 
local_irq_save(flags);
sched_clock_idle_sleep_event();
 
scale = _cpu(cyc2ns, cpu);
-
-   rdtscll(tsc_now);
-   ns_now = __cycles_2_ns(tsc_now);
-
-   prev_scale = *scale;
-   if (cpu_khz)
-   *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
 
sched_clock_idle_wakeup_event(0);
local_irq_restore(flags);
-- 
1.5.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-29 Thread Guillaume Chazarain
On Jan 29, 2008 6:47 AM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:

> IMHO this is expected results and if someone really needs to cut down
> this latency, they can reduce sysctl_sched_latency (which will be bad
> from perf standpoint, as we will cause more cache thrashing with that).

Thank you very much for the detailed explanation Srivatsa, that made a
lot of sense. Unfortunately, it means I'll disable FAIR_USER_SCHED as
I initially thought these latencies were caused by my local patches
that give each group a load proportional to the max load of its
elements. Anyway, I don't absolutely need a fair user scheduler on my
laptop, but low latencies in the default configuration are nice to
have.

I just thought about something to restore low latencies with
FAIR_GROUP_SCHED, but it's possibly utter nonsense, so bear with me
;-) The idea would be to reverse the trees upside down. The scheduler
would only see tasks (on the leaves) so could apply its interactivity
magic, but the hierarchical groups would be used to compute dynamic
loads for each task according to their position in the tree:

- now:
  - we schedule each level of the tree starting from the root

- with my proposition:
  - we schedule tasks like with !FAIR_GROUP_SCHED, but
calc_delta_fair() would traverse the tree starting from the leaves to
compute the dynamic load.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-29 Thread Guillaume Chazarain
On Jan 29, 2008 6:47 AM, Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:

 IMHO this is expected results and if someone really needs to cut down
 this latency, they can reduce sysctl_sched_latency (which will be bad
 from perf standpoint, as we will cause more cache thrashing with that).

Thank you very much for the detailed explanation Srivatsa, that made a
lot of sense. Unfortunately, it means I'll disable FAIR_USER_SCHED as
I initially thought these latencies were caused by my local patches
that give each group a load proportional to the max load of its
elements. Anyway, I don't absolutely need a fair user scheduler on my
laptop, but low latencies in the default configuration are nice to
have.

I just thought about something to restore low latencies with
FAIR_GROUP_SCHED, but it's possibly utter nonsense, so bear with me
;-) The idea would be to reverse the trees upside down. The scheduler
would only see tasks (on the leaves) so could apply its interactivity
magic, but the hierarchical groups would be used to compute dynamic
loads for each task according to their position in the tree:

- now:
  - we schedule each level of the tree starting from the root

- with my proposition:
  - we schedule tasks like with !FAIR_GROUP_SCHED, but
calc_delta_fair() would traverse the tree starting from the leaves to
compute the dynamic load.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: remove unused code in set_cyc2ns_scale()

2008-01-29 Thread Guillaume Chazarain
This should be fold into:
4f95bd6e2b21a8c724357463f8341502d47aba13
x86: scale cyc_2_nsec according to CPU frequency

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
---
 arch/x86/kernel/tsc_32.c |   14 +-
 arch/x86/kernel/tsc_64.c |   14 +-
 2 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 43517e3..e05e221 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -83,20 +83,16 @@ DEFINE_PER_CPU(unsigned long, cyc2ns);
 
 static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   unsigned long flags, prev_scale, *scale;
-   unsigned long long tsc_now, ns_now;
+   unsigned long flags, *scale;
+
+   if (!cpu_khz)
+   return;
 
local_irq_save(flags);
sched_clock_idle_sleep_event();
 
scale = per_cpu(cyc2ns, cpu);
-
-   rdtscll(tsc_now);
-   ns_now = __cycles_2_ns(tsc_now);
-
-   prev_scale = *scale;
-   if (cpu_khz)
-   *scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   *scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
 
/*
 * Start smoothly with the new frequency:
diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 947554d..e0e9d4f 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -44,20 +44,16 @@ DEFINE_PER_CPU(unsigned long, cyc2ns);
 
 static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   unsigned long flags, prev_scale, *scale;
-   unsigned long long tsc_now, ns_now;
+   unsigned long flags, *scale;
+
+   if (!cpu_khz)
+   return;
 
local_irq_save(flags);
sched_clock_idle_sleep_event();
 
scale = per_cpu(cyc2ns, cpu);
-
-   rdtscll(tsc_now);
-   ns_now = __cycles_2_ns(tsc_now);
-
-   prev_scale = *scale;
-   if (cpu_khz)
-   *scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   *scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
 
sched_clock_idle_wakeup_event(0);
local_irq_restore(flags);
-- 
1.5.3.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-28 Thread Guillaume Chazarain
Unfortunately it seems to not be completely fixed, with this script:

#!/usr/bin/python

import os
import time

SLEEP_TIME = 0.1
SAMPLES = 5
PRINT_DELAY = 0.5

def print_wakeup_latency():
times = []
last_print = 0
while True:
start = time.time()
time.sleep(SLEEP_TIME)
end = time.time()
times.insert(0, end - start - SLEEP_TIME)
del times[SAMPLES:]
if end > last_print + PRINT_DELAY:
copy = times[:]
copy.sort()
print '%f ms' % (copy[len(copy)/2] * 1000)
last_print = end

if os.fork() == 0:
if os.fork() == 0:
os.setuid(1)
while True:
pass
else:
os.setuid(2)
while True:
pass
else:
os.setuid(1)
print_wakeup_latency()

I get seemingly unpredictable latencies (with or without the patch applied):

# ./sched.py
14.810944 ms
19.829893 ms
1.968050 ms
8.021021 ms
-0.017977 ms
4.926109 ms
11.958027 ms
5.995893 ms
1.992130 ms
0.007057 ms
0.217819 ms
-0.004864 ms
5.907202 ms
6.547832 ms
-0.012970 ms
0.209951 ms
-0.002003 ms
4.989052 ms

Without FAIR_USER_SCHED, latencies are consistently in the noise.
Also, I forgot to mention that I'm on a single CPU.

Thanks for the help.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-28 Thread Guillaume Chazarain
Hi Srivatsa,

On Jan 28, 2008 3:31 AM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> Given that sysctl_sched_wakeup_granularity is set to 10ms by default,
> this doesn't sound abnormal.

Indeed, by lowering sched_wakeup_granularity I get much better
latencies, but lowering sched_latency seems to be more effective.

> NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
> not group-level entities. With the patch attached, I could see that wakeup
> latencies with FAIR_USER_SCHED are restored to the same level as
> !FAIR_USER_SCHED.

Thanks for the patch, it works perfectly.

> However I am not sure whether that is the way to go. We want to let one group 
> of
> tasks running as much as possible until the fairness/wakeup-latency threshold 
> is
> exceeded. If someone does want better wakeup latencies between groups too, 
> they
> can always tune sysctl_sched_wakeup_granularity.

Having an inconsistency here between FAIR_USER_SCHED and
!FAIR_USER_SCHED sounds strange, but Ingo took the patch, so I'm happy
:-)

Thanks for the replies.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-28 Thread Guillaume Chazarain
Unfortunately it seems to not be completely fixed, with this script:

#!/usr/bin/python

import os
import time

SLEEP_TIME = 0.1
SAMPLES = 5
PRINT_DELAY = 0.5

def print_wakeup_latency():
times = []
last_print = 0
while True:
start = time.time()
time.sleep(SLEEP_TIME)
end = time.time()
times.insert(0, end - start - SLEEP_TIME)
del times[SAMPLES:]
if end  last_print + PRINT_DELAY:
copy = times[:]
copy.sort()
print '%f ms' % (copy[len(copy)/2] * 1000)
last_print = end

if os.fork() == 0:
if os.fork() == 0:
os.setuid(1)
while True:
pass
else:
os.setuid(2)
while True:
pass
else:
os.setuid(1)
print_wakeup_latency()

I get seemingly unpredictable latencies (with or without the patch applied):

# ./sched.py
14.810944 ms
19.829893 ms
1.968050 ms
8.021021 ms
-0.017977 ms
4.926109 ms
11.958027 ms
5.995893 ms
1.992130 ms
0.007057 ms
0.217819 ms
-0.004864 ms
5.907202 ms
6.547832 ms
-0.012970 ms
0.209951 ms
-0.002003 ms
4.989052 ms

Without FAIR_USER_SCHED, latencies are consistently in the noise.
Also, I forgot to mention that I'm on a single CPU.

Thanks for the help.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-28 Thread Guillaume Chazarain
Hi Srivatsa,

On Jan 28, 2008 3:31 AM, Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:
 Given that sysctl_sched_wakeup_granularity is set to 10ms by default,
 this doesn't sound abnormal.

Indeed, by lowering sched_wakeup_granularity I get much better
latencies, but lowering sched_latency seems to be more effective.

 NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
 not group-level entities. With the patch attached, I could see that wakeup
 latencies with FAIR_USER_SCHED are restored to the same level as
 !FAIR_USER_SCHED.

Thanks for the patch, it works perfectly.

 However I am not sure whether that is the way to go. We want to let one group 
 of
 tasks running as much as possible until the fairness/wakeup-latency threshold 
 is
 exceeded. If someone does want better wakeup latencies between groups too, 
 they
 can always tune sysctl_sched_wakeup_granularity.

Having an inconsistency here between FAIR_USER_SCHED and
!FAIR_USER_SCHED sounds strange, but Ingo took the patch, so I'm happy
:-)

Thanks for the replies.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


High wake up latencies with FAIR_USER_SCHED

2008-01-27 Thread Guillaume Chazarain
Hi,

I noticed some strangely high wake up latencies with FAIR_USER_SCHED
using this script:

#!/usr/bin/python

import os
import time

SLEEP_TIME = 0.1
SAMPLES = 100
PRINT_DELAY = 0.5

def print_wakeup_latency():
times = []
last_print = 0
while True:
start = time.time()
time.sleep(SLEEP_TIME)
end = time.time()
times.insert(0, end - start - SLEEP_TIME)
del times[SAMPLES:]
if end > last_print + PRINT_DELAY:
copy = times[:]
copy.sort()
print '%f ms' % (copy[len(copy)/2] * 1000)
last_print = end

if os.fork() == 0:
os.setuid(1)
for i in xrange(2):
if os.fork() == 0:
while True:
pass
else:
os.setuid(2) # <-- here
print_wakeup_latency()

We have two busy loops with UID=1.
And UID=2 maintains the running median of its wake up latency.
I get these latencies:

# ./sched.py
4.300022 ms
4.801178 ms
4.604006 ms
4.606867 ms
4.604006 ms
4.606867 ms
4.604006 ms
4.606867 ms
4.606867 ms
4.676008 ms
4.604006 ms
4.604006 ms
4.606867 ms

Disabling FAIR_USER_SCHED restores wake up latencies in the noise:

# ./sched.py
-0.156975 ms
-0.067091 ms
-0.022984 ms
-0.022984 ms
-0.022030 ms
-0.022030 ms
-0.022030 ms
-0.021076 ms
-0.015831 ms
-0.015831 ms
-0.016069 ms
-0.015831 ms

Strangely enough, another way to restore normal latencies is to change
setuid(2) to setuid(1), that is, putting the latency measurement in
the same group as the two busy loops.

Thanks in advance for any help.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


High wake up latencies with FAIR_USER_SCHED

2008-01-27 Thread Guillaume Chazarain
Hi,

I noticed some strangely high wake up latencies with FAIR_USER_SCHED
using this script:

#!/usr/bin/python

import os
import time

SLEEP_TIME = 0.1
SAMPLES = 100
PRINT_DELAY = 0.5

def print_wakeup_latency():
times = []
last_print = 0
while True:
start = time.time()
time.sleep(SLEEP_TIME)
end = time.time()
times.insert(0, end - start - SLEEP_TIME)
del times[SAMPLES:]
if end  last_print + PRINT_DELAY:
copy = times[:]
copy.sort()
print '%f ms' % (copy[len(copy)/2] * 1000)
last_print = end

if os.fork() == 0:
os.setuid(1)
for i in xrange(2):
if os.fork() == 0:
while True:
pass
else:
os.setuid(2) # -- here
print_wakeup_latency()

We have two busy loops with UID=1.
And UID=2 maintains the running median of its wake up latency.
I get these latencies:

# ./sched.py
4.300022 ms
4.801178 ms
4.604006 ms
4.606867 ms
4.604006 ms
4.606867 ms
4.604006 ms
4.606867 ms
4.606867 ms
4.676008 ms
4.604006 ms
4.604006 ms
4.606867 ms

Disabling FAIR_USER_SCHED restores wake up latencies in the noise:

# ./sched.py
-0.156975 ms
-0.067091 ms
-0.022984 ms
-0.022984 ms
-0.022030 ms
-0.022030 ms
-0.022030 ms
-0.021076 ms
-0.015831 ms
-0.015831 ms
-0.016069 ms
-0.015831 ms

Strangely enough, another way to restore normal latencies is to change
setuid(2) to setuid(1), that is, putting the latency measurement in
the same group as the two busy loops.

Thanks in advance for any help.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Dropping some patches from sched-devel

2008-01-25 Thread Guillaume Chazarain
On Jan 25, 2008 5:58 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> sure, done.

Thanks.

> what method are you using of determining quality?

I was talking about code quality: adding a dependency on jiffies does
not seems like a good idea. But also, about the clock quality, I was
focusing on getting rid of underflows and overflows so relaxed the
checks. But I realized all these underflows are definitely needed. I
mean, the conversion from TSC to sched_clock always rounds to lower,
so overtime it lags a bit.

> Could you perhaps try
> to automate it? (even better would be some self-test within the kernel
> that detects badness)

I find the overflow/underflow/warps checks you added in the first
place to be sufficent. Not sure we want to add more tests to
differentiate between normal and abnormal drifts.

Thanks for your prompt reply.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Dropping some patches from sched-devel

2008-01-25 Thread Guillaume Chazarain
Hi Ingo,

Can I talk you into dropping these patches of mine from sched-devel
(or not send them to Linus):

da0f9440cdcb1edd5424de91f326de83de3fe5f9 sched: make sure jiffies is
up to date before calling __update_rq_clock()
6eb300ad38fef6db4efe177067a65aaa771596da sched: fix rq->clock
overflows detection with CONFIG_NO_HZ

They are not of good enough quality, and I'm working on a better approach.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Dropping some patches from sched-devel

2008-01-25 Thread Guillaume Chazarain
On Jan 25, 2008 5:58 PM, Ingo Molnar [EMAIL PROTECTED] wrote:
 sure, done.

Thanks.

 what method are you using of determining quality?

I was talking about code quality: adding a dependency on jiffies does
not seems like a good idea. But also, about the clock quality, I was
focusing on getting rid of underflows and overflows so relaxed the
checks. But I realized all these underflows are definitely needed. I
mean, the conversion from TSC to sched_clock always rounds to lower,
so overtime it lags a bit.

 Could you perhaps try
 to automate it? (even better would be some self-test within the kernel
 that detects badness)

I find the overflow/underflow/warps checks you added in the first
place to be sufficent. Not sure we want to add more tests to
differentiate between normal and abnormal drifts.

Thanks for your prompt reply.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Dropping some patches from sched-devel

2008-01-25 Thread Guillaume Chazarain
Hi Ingo,

Can I talk you into dropping these patches of mine from sched-devel
(or not send them to Linus):

da0f9440cdcb1edd5424de91f326de83de3fe5f9 sched: make sure jiffies is
up to date before calling __update_rq_clock()
6eb300ad38fef6db4efe177067a65aaa771596da sched: fix rq-clock
overflows detection with CONFIG_NO_HZ

They are not of good enough quality, and I'm working on a better approach.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Guillaume Chazarain <[EMAIL PROTECTED]> wrote:

> FYI, I'm currently trying to track down where rq->clock started to
> overflow with nohz=off, and it seems to be before 2.6.23, so my patches
> are not at fault ;-) Or maybe I am dreaming and it was always
> overflowing. Investigating ...

And the winner is:

commit 529c77261bccd9d37f110f58b0753d95beaa9fa2
Author: Ingo Molnar <[EMAIL PROTECTED]>
Date:   Fri Aug 10 23:05:11 2007 +0200

sched: improve rq-clock overflow logic

improve the rq-clock overflow logic: limit the absolute rq->clock
delta since the last scheduler tick, instead of limiting the delta
itself.

tested by Arjan van de Ven - whole laptop was misbehaving due to
an incorrectly calibrated cpu_khz confusing sched_clock().

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Signed-off-by: Arjan van de Ven <[EMAIL PROTECTED]>

diff --git a/kernel/sched.c b/kernel/sched.c
index b0afd8d..6247e4a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -263,6 +263,7 @@ struct rq {
 
unsigned int clock_warps, clock_overflows;
unsigned int clock_unstable_events;
+   u64 tick_timestamp;
 
atomic_t nr_iowait;
 
@@ -341,8 +342,11 @@ static void __update_rq_clock(struct rq *rq)
/*
 * Catch too large forward jumps too:
 */
-   if (unlikely(delta > 2*TICK_NSEC)) {
-   clock++;
+   if (unlikely(clock + delta > rq->tick_timestamp + TICK_NSEC)) {
+   if (clock < rq->tick_timestamp + TICK_NSEC)
+   clock = rq->tick_timestamp + TICK_NSEC;
+   else
+   clock++;
rq->clock_overflows++;
} else {
if (unlikely(delta > rq->clock_max_delta))
@@ -3308,9 +3312,16 @@ void scheduler_tick(void)
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
+   u64 next_tick = rq->tick_timestamp + TICK_NSEC;
 
spin_lock(>lock);
__update_rq_clock(rq);
+   /*
+* Let rq->clock advance by at least TICK_NSEC:
+*/
+   if (unlikely(rq->clock < next_tick))
+   rq->clock = next_tick;
+   rq->tick_timestamp = rq->clock;
update_cpu_load(rq);
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);


Seems like I originally was not the only one seeing 2 jiffies jumps ;-)
I'll adapt my patches.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> ok. I have applied all but this one

Hmm, I couldn't find them in mingo/linux-2.6-sched-devel.git.

> i think it's much simpler to do what i have below. Could you try it on 
> your box? Or if it is using ACPI idle - in that case the callbacks 
> should already be there and there should be no need for further fixups.
> 
> Subject: x86: idle wakeup event in the HLT loop

I use ACPI, so this patch has no effect.

FYI, I'm currently trying to track down where rq->clock started to
overflow with nohz=off, and it seems to be before 2.6.23, so my patches
are not at fault ;-) Or maybe I am dreaming and it was always
overflowing. Investigating ...

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] block: fix blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> Correction: it was not a high res time source, it was "the scheduler's 
> per-cpu, non-exported, non-coherent, warps-and-jumps-like-hell high-res 
> timesource that was intentionally called the _sched_ clock" ;-)

I think the warts of cpu_clock() are fixable, except maybe
unsynchronization on SMP which is harder.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
David Dillow <[EMAIL PROTECTED]> wrote:

> Patched kernel, nohz=off:
>   .clock_underflows  : 213887

A little bit of warning about these patches, they are WIP, that's why I
did not send them earlier. It regress nohz=off.

A bit of context: these patches aim at making sure cpu_clock() on my
laptop (cpufreq enabled) never overflows/underflows/warps with
CONFIG_NOHZ enabled. With these patches, I have a few hundreds
overflows and underflows during early bootup, and then nothing :-)

Ingo Molnar <[EMAIL PROTECTED]> wrote:

> they are from the scheduler git tree (except the first debug patch), but 
> queued up for v2.6.25 at the moment.

You are talking about "x86: scale cyc_2_nsec according to CPU
frequency" here, but I don't think it is at stakes here as David has:

> CONFIG_CPU_FREQ is not set

Let me review my patches myself to give a bit of context:

> sched: monitor clock underflows in /proc/sched_debug

This, I'd like to have it in .25 just for convenience.

> x86: scale cyc_2_nsec according to CPU frequency

You already know that one ;-)

> sched: fix rq->clock warps on frequency changes

This is a bugfix for .25 once the previous patch is applied. I don't
think it helps David, but it could help blktrace users with cpufreq
enabled.

> sched: Fix rq->clock overflows detection with CONFIG_NO_HZ

I think this one is the most important for David, but unfortunately it
has some problems.

> +static inline u64 max_skipped_ticks(struct rq *rq)
> +{
> + return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen + 2 : 1;
> +}

Here, I initially wrote rq->last_tick_seen + 1 but experiments showed
that +2 was needed as I really saw deltas of 2 milliseconds.

These patches have two objectives:
 - taking into account that jiffies are not always incremented by 1
thanks to nohz
 - as the tick is stopped and restarted it may not tick at the exact
expected moment, so allow a window of 1 jiffie. If the tick occurs
during the right jiffy, we know the TSC is more precise than the tick
so don't correct the clock.

And the problem is that I seem to need a window of 2 jiffies, so I need
some help.

> sched: make sure jiffies is up to date before calling __update_rq_clock()

This is one is needed too but I'm less confident in its validity.

> scheduler_tick() is not called every jiffies

This one is a bit ugly and seems to break nohz=off.

> - if (unlikely(rq->clock < next_tick)) {
> + if (unlikely(rq->clock < next_tick - nohz_on(cpu) * TICK_NSEC)) {

No, I'm not proud of this :-(

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
David Dillow [EMAIL PROTECTED] wrote:

 Patched kernel, nohz=off:
   .clock_underflows  : 213887

A little bit of warning about these patches, they are WIP, that's why I
did not send them earlier. It regress nohz=off.

A bit of context: these patches aim at making sure cpu_clock() on my
laptop (cpufreq enabled) never overflows/underflows/warps with
CONFIG_NOHZ enabled. With these patches, I have a few hundreds
overflows and underflows during early bootup, and then nothing :-)

Ingo Molnar [EMAIL PROTECTED] wrote:

 they are from the scheduler git tree (except the first debug patch), but 
 queued up for v2.6.25 at the moment.

You are talking about x86: scale cyc_2_nsec according to CPU
frequency here, but I don't think it is at stakes here as David has:

 CONFIG_CPU_FREQ is not set

Let me review my patches myself to give a bit of context:

 sched: monitor clock underflows in /proc/sched_debug

This, I'd like to have it in .25 just for convenience.

 x86: scale cyc_2_nsec according to CPU frequency

You already know that one ;-)

 sched: fix rq-clock warps on frequency changes

This is a bugfix for .25 once the previous patch is applied. I don't
think it helps David, but it could help blktrace users with cpufreq
enabled.

 sched: Fix rq-clock overflows detection with CONFIG_NO_HZ

I think this one is the most important for David, but unfortunately it
has some problems.

 +static inline u64 max_skipped_ticks(struct rq *rq)
 +{
 + return nohz_on(cpu_of(rq)) ? jiffies - rq-last_tick_seen + 2 : 1;
 +}

Here, I initially wrote rq-last_tick_seen + 1 but experiments showed
that +2 was needed as I really saw deltas of 2 milliseconds.

These patches have two objectives:
 - taking into account that jiffies are not always incremented by 1
thanks to nohz
 - as the tick is stopped and restarted it may not tick at the exact
expected moment, so allow a window of 1 jiffie. If the tick occurs
during the right jiffy, we know the TSC is more precise than the tick
so don't correct the clock.

And the problem is that I seem to need a window of 2 jiffies, so I need
some help.

 sched: make sure jiffies is up to date before calling __update_rq_clock()

This is one is needed too but I'm less confident in its validity.

 scheduler_tick() is not called every jiffies

This one is a bit ugly and seems to break nohz=off.

 - if (unlikely(rq-clock  next_tick)) {
 + if (unlikely(rq-clock  next_tick - nohz_on(cpu) * TICK_NSEC)) {

No, I'm not proud of this :-(

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] block: fix blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Ingo Molnar [EMAIL PROTECTED] wrote:

 Correction: it was not a high res time source, it was the scheduler's 
 per-cpu, non-exported, non-coherent, warps-and-jumps-like-hell high-res 
 timesource that was intentionally called the _sched_ clock ;-)

I think the warts of cpu_clock() are fixable, except maybe
unsynchronization on SMP which is harder.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Ingo Molnar [EMAIL PROTECTED] wrote:

 ok. I have applied all but this one

Hmm, I couldn't find them in mingo/linux-2.6-sched-devel.git.

 i think it's much simpler to do what i have below. Could you try it on 
 your box? Or if it is using ACPI idle - in that case the callbacks 
 should already be there and there should be no need for further fixups.
 
 Subject: x86: idle wakeup event in the HLT loop

I use ACPI, so this patch has no effect.

FYI, I'm currently trying to track down where rq-clock started to
overflow with nohz=off, and it seems to be before 2.6.23, so my patches
are not at fault ;-) Or maybe I am dreaming and it was always
overflowing. Investigating ...

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Guillaume Chazarain [EMAIL PROTECTED] wrote:

 FYI, I'm currently trying to track down where rq-clock started to
 overflow with nohz=off, and it seems to be before 2.6.23, so my patches
 are not at fault ;-) Or maybe I am dreaming and it was always
 overflowing. Investigating ...

And the winner is:

commit 529c77261bccd9d37f110f58b0753d95beaa9fa2
Author: Ingo Molnar [EMAIL PROTECTED]
Date:   Fri Aug 10 23:05:11 2007 +0200

sched: improve rq-clock overflow logic

improve the rq-clock overflow logic: limit the absolute rq-clock
delta since the last scheduler tick, instead of limiting the delta
itself.

tested by Arjan van de Ven - whole laptop was misbehaving due to
an incorrectly calibrated cpu_khz confusing sched_clock().

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
Signed-off-by: Arjan van de Ven [EMAIL PROTECTED]

diff --git a/kernel/sched.c b/kernel/sched.c
index b0afd8d..6247e4a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -263,6 +263,7 @@ struct rq {
 
unsigned int clock_warps, clock_overflows;
unsigned int clock_unstable_events;
+   u64 tick_timestamp;
 
atomic_t nr_iowait;
 
@@ -341,8 +342,11 @@ static void __update_rq_clock(struct rq *rq)
/*
 * Catch too large forward jumps too:
 */
-   if (unlikely(delta  2*TICK_NSEC)) {
-   clock++;
+   if (unlikely(clock + delta  rq-tick_timestamp + TICK_NSEC)) {
+   if (clock  rq-tick_timestamp + TICK_NSEC)
+   clock = rq-tick_timestamp + TICK_NSEC;
+   else
+   clock++;
rq-clock_overflows++;
} else {
if (unlikely(delta  rq-clock_max_delta))
@@ -3308,9 +3312,16 @@ void scheduler_tick(void)
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq-curr;
+   u64 next_tick = rq-tick_timestamp + TICK_NSEC;
 
spin_lock(rq-lock);
__update_rq_clock(rq);
+   /*
+* Let rq-clock advance by at least TICK_NSEC:
+*/
+   if (unlikely(rq-clock  next_tick))
+   rq-clock = next_tick;
+   rq-tick_timestamp = rq-clock;
update_cpu_load(rq);
if (curr != rq-idle) /* FIXME: needed? */
curr-sched_class-task_tick(rq, curr);


Seems like I originally was not the only one seeing 2 jiffies jumps ;-)
I'll adapt my patches.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread Guillaume Chazarain
David Dillow <[EMAIL PROTECTED]> wrote:

> At the moment, I'm not sure how to track this farther, or how to fix it
> properly. Any advice would be appreciated.

Just out of curiosity, could you try the appended cumulative patch and
report .clock_warps, .clock_overflows and .clock_underflows as you did.

Thanks.

commit 20fa02359d971bdb820d238184fabd42d8018e4f
Author: Guillaume Chazarain <[EMAIL PROTECTED]>
Date:   Thu Jan 10 23:36:43 2008 +0100

sched: monitor clock underflows in /proc/sched_debug

We monitor clock overflows, let's also monitor clock underflows.

    Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 37cf07a..cab9756 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -317,7 +317,7 @@ struct rq {
u64 clock, prev_clock_raw;
s64 clock_max_delta;
 
-   unsigned int clock_warps, clock_overflows;
+   unsigned int clock_warps, clock_overflows, clock_underflows;
u64 idle_clock;
unsigned int clock_deep_idle_events;
u64 tick_timestamp;
@@ -3485,8 +3485,10 @@ void scheduler_tick(void)
/*
 * Let rq->clock advance by at least TICK_NSEC:
 */
-   if (unlikely(rq->clock < next_tick))
+   if (unlikely(rq->clock < next_tick)) {
rq->clock = next_tick;
+   rq->clock_underflows++;
+   }
rq->tick_timestamp = rq->clock;
update_cpu_load(rq);
if (curr != rq->idle) /* FIXME: needed? */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 80fbbfc..9e5de09 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -179,6 +179,7 @@ static void print_cpu(struct seq_file *m, int cpu)
PN(prev_clock_raw);
P(clock_warps);
P(clock_overflows);
+   P(clock_underflows);
P(clock_deep_idle_events);
PN(clock_max_delta);
P(cpu_load[0]);

commit c146421cae64bb626714dc951fa39b55d2f819c1
Author: Guillaume Chazarain <[EMAIL PROTECTED]>
Date:   Wed Jan 2 14:10:17 2008 +0100

commit 60c6397ce4e8c9fd7feaeaef4167ace71c3949c8

x86: scale cyc_2_nsec according to CPU frequency

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [EMAIL PROTECTED]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..00bb4c1 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  * [EMAIL PROTECTED] "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   unsigned long flags, prev_scale, *scale;
+   unsigned long long tsc_now, ns_now;
+
+   local_irq_save(flags);
+   sched_clock_idle_sleep_event();
+
+   scale = _cpu(cyc2ns, cpu);
+
+   rdtscll(tsc_now);
+   ns_now = __cycles_2_ns(tsc_now);
+
+   prev_scale = *scale;
+   if (cpu_khz)
+   *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+   /*
+* Start smoothly with the new frequency:
+*/
+   sched_clock_idle_wakeup_event(0);
+   local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_block *nb, unsigned 
long val, void *data)
ref_freq, freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
-   set_cyc2ns_scale(cpu_khz);
+   preempt_disable();
+   set_cyc2ns_scale(cpu_khz, smp_processor_id());
+   preempt_enable();
/*
 * TSC based sched_clock turns
 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_reliable(void) { }
 
 void __init tsc_init(void)
 {
+   int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);
 
-   set_cyc2ns_scale(cpu_khz);
+   /*
+* Secondary CPUs d

Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread Guillaume Chazarain
David Dillow [EMAIL PROTECTED] wrote:

 At the moment, I'm not sure how to track this farther, or how to fix it
 properly. Any advice would be appreciated.

Just out of curiosity, could you try the appended cumulative patch and
report .clock_warps, .clock_overflows and .clock_underflows as you did.

Thanks.

commit 20fa02359d971bdb820d238184fabd42d8018e4f
Author: Guillaume Chazarain [EMAIL PROTECTED]
Date:   Thu Jan 10 23:36:43 2008 +0100

sched: monitor clock underflows in /proc/sched_debug

We monitor clock overflows, let's also monitor clock underflows.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]

diff --git a/kernel/sched.c b/kernel/sched.c
index 37cf07a..cab9756 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -317,7 +317,7 @@ struct rq {
u64 clock, prev_clock_raw;
s64 clock_max_delta;
 
-   unsigned int clock_warps, clock_overflows;
+   unsigned int clock_warps, clock_overflows, clock_underflows;
u64 idle_clock;
unsigned int clock_deep_idle_events;
u64 tick_timestamp;
@@ -3485,8 +3485,10 @@ void scheduler_tick(void)
/*
 * Let rq-clock advance by at least TICK_NSEC:
 */
-   if (unlikely(rq-clock  next_tick))
+   if (unlikely(rq-clock  next_tick)) {
rq-clock = next_tick;
+   rq-clock_underflows++;
+   }
rq-tick_timestamp = rq-clock;
update_cpu_load(rq);
if (curr != rq-idle) /* FIXME: needed? */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 80fbbfc..9e5de09 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -179,6 +179,7 @@ static void print_cpu(struct seq_file *m, int cpu)
PN(prev_clock_raw);
P(clock_warps);
P(clock_overflows);
+   P(clock_underflows);
P(clock_deep_idle_events);
PN(clock_max_delta);
P(cpu_load[0]);

commit c146421cae64bb626714dc951fa39b55d2f819c1
Author: Guillaume Chazarain [EMAIL PROTECTED]
Date:   Wed Jan 2 14:10:17 2008 +0100

commit 60c6397ce4e8c9fd7feaeaef4167ace71c3949c8

x86: scale cyc_2_nsec according to CPU frequency

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [EMAIL PROTECTED]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..00bb4c1 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include linux/jiffies.h
 #include linux/init.h
 #include linux/dmi.h
+#include linux/percpu.h
 
 #include asm/delay.h
 #include asm/tsc.h
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  * [EMAIL PROTECTED] math is hard, lets go shopping!
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   cyc2ns_scale = (100  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   unsigned long flags, prev_scale, *scale;
+   unsigned long long tsc_now, ns_now;
+
+   local_irq_save(flags);
+   sched_clock_idle_sleep_event();
+
+   scale = per_cpu(cyc2ns, cpu);
+
+   rdtscll(tsc_now);
+   ns_now = __cycles_2_ns(tsc_now);
+
+   prev_scale = *scale;
+   if (cpu_khz)
+   *scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+   /*
+* Start smoothly with the new frequency:
+*/
+   sched_clock_idle_wakeup_event(0);
+   local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_block *nb, unsigned 
long val, void *data)
ref_freq, freq-new);
if (!(freq-flags  CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
-   set_cyc2ns_scale(cpu_khz);
+   preempt_disable();
+   set_cyc2ns_scale(cpu_khz, smp_processor_id());
+   preempt_enable();
/*
 * TSC based sched_clock turns
 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_reliable(void) { }
 
 void __init tsc_init(void)
 {
+   int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);
 
-   set_cyc2ns_scale(cpu_khz);
+   /*
+* Secondary CPUs do not run through tsc_init(), so set up
+* all the scale factors

[PATCH] fs-writeback: handle errors in sync_sb_inodes()

2008-01-07 Thread Guillaume Chazarain
Currently it is possible for some errors to be detected at write-back
time but not reported to the program as shown by the following script
using the included make_file.c.

-8<-8<-8<-8<-8<-8<-
#!/bin/sh

# We binary search the size of a file in 40M filesystem that can cause
# the missed error.
MIN=500
MAX=5000
rm fs.40M
dd if=/dev/zero of=fs.40M bs=40M count=0 seek=1 status=noxfer
#mkfs.ext2 -F fs.40M
mkfs.ext3 -F fs.40M
#mkfs.jfs -q fs.40M
#mkfs.reiserfs -fq fs.40M
#mkfs.xfs fs.40M

attempt()
{
SIZE=$1
RES=0
./make_file valid_file $SIZE
mount fs.40M /mnt -o loop
if ! ./make_file /mnt/not_enough_space $SIZE; then
# We could not create the file as the requested size
# was clearly too big
RES=1
fi
umount /mnt

if [ $RES -eq 0 ]; then
mount fs.40M /mnt -o loop
if cmp valid_file /mnt/not_enough_space; then
# The file was too small, it fitted in the filesystem
RES=-1
fi
umount /mnt
fi

if [ $RES -eq 0 ]; then
echo "Undetected ENOSPC with SIZE=$SIZE"
exit
fi

return $RES
}
while [ $((MAX - MIN)) -gt 1 ]; do
SIZE=$(((MIN + MAX) / 2))
attempt $SIZE
RES=$?
if [ $RES -eq 1 ]; then
MAX=$SIZE
else
MIN=$SIZE
fi
done

echo "Could not reproduce the problem"

-8<-8<-8<-8<-8<-8<-
/* make_file.c */
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
int size, fd;
char *mapping;
if (argc != 3) {
fprintf(stderr, "Usage: %s FILE SIZE\n", argv[0]);
return 1;
}
size = atoi(argv[2]);

fd = open(argv[1], O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror(argv[1]);
return 1;
}
if (ftruncate(fd, size) < 0) {
perror("ftruncate");
return 1;
}
mapping = mmap(NULL, size, PROT_WRITE, MAP_SHARED, fd, 0);
if (mapping == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(mapping, 0xFF, size);

/* Force a write-back */
sync();

if (msync(mapping, size, MS_SYNC) < 0) {
perror("msync");
return 1;
}
if (close(fd) < 0) {
perror("close");
return 1;
}
printf("%s: successfully written %d bytes\n", argv[1], size);
return 0;
}
-8<-8<-8<-8<-8<-8<-
make_file.c mmaps a hole, performs some writeback (memset + sync) and
then expects to find some error code in msync(). The script mounts a
40M loopback filesystem and does a binary search to find the size of
a file big enough to provoke a ENOSPC, but small enough to show the
error not being detected at msync() time.

The error window is large enough for such a size to be quickly found, but with
this patch, no such file size can be found.

All mmap capable filesystems I tested are affected (ext2, ext3,
jfs, reiserfs, xfs). XFS is special in that it survives the test thanks
to the page_mkwrite() work, i.e. it SIGBUS during memset. Anyway, this
behavious solves ENOSPC but does nothing for EIO.

The offending code is in fs/fs-writeback.c:

sync_sb_inodes(...) ()
{
...
__writeback_single_inode(inode, wbc);
...
}
__writeback_single_inode() gets the error from mapping->flags, clears it
and returns it. But sync_sb_inodes() ignores this return value. In -mm
there is sync_sb_inodes-propagate-errors.patch that propagates the
error from __writeback_single_inode upwards in the call stack. IMHO,
this propagation is useless because:

- the error is combined from the errors in all the synced inodes, so it
just tells that some inode in a specific fs got an error,
- nobody in the call stack is interested in this error: certainly not
pdflush, or 'void sync(2)'.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
---

 fs/fs-writeback.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0fca820..88bb3c4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -417,6 +417,7 @@ sync_sb_inodes(struct super_block *sb, struct 
writeback_control *wbc)
struct address_space *mapping = inode->i_mapping;
struct backing_dev_info *bdi = mapping->backing_dev_info;
long pages_skipped;
+   int e

[PATCH] fs-writeback: handle errors in sync_sb_inodes()

2008-01-07 Thread Guillaume Chazarain
Currently it is possible for some errors to be detected at write-back
time but not reported to the program as shown by the following script
using the included make_file.c.

-8-8-8-8-8-8-
#!/bin/sh

# We binary search the size of a file in 40M filesystem that can cause
# the missed error.
MIN=500
MAX=5000
rm fs.40M
dd if=/dev/zero of=fs.40M bs=40M count=0 seek=1 status=noxfer
#mkfs.ext2 -F fs.40M
mkfs.ext3 -F fs.40M
#mkfs.jfs -q fs.40M
#mkfs.reiserfs -fq fs.40M
#mkfs.xfs fs.40M

attempt()
{
SIZE=$1
RES=0
./make_file valid_file $SIZE
mount fs.40M /mnt -o loop
if ! ./make_file /mnt/not_enough_space $SIZE; then
# We could not create the file as the requested size
# was clearly too big
RES=1
fi
umount /mnt

if [ $RES -eq 0 ]; then
mount fs.40M /mnt -o loop
if cmp valid_file /mnt/not_enough_space; then
# The file was too small, it fitted in the filesystem
RES=-1
fi
umount /mnt
fi

if [ $RES -eq 0 ]; then
echo Undetected ENOSPC with SIZE=$SIZE
exit
fi

return $RES
}
while [ $((MAX - MIN)) -gt 1 ]; do
SIZE=$(((MIN + MAX) / 2))
attempt $SIZE
RES=$?
if [ $RES -eq 1 ]; then
MAX=$SIZE
else
MIN=$SIZE
fi
done

echo Could not reproduce the problem

-8-8-8-8-8-8-
/* make_file.c */
#include unistd.h
#include sys/fcntl.h
#include sys/mman.h
#include string.h
#include stdio.h
#include stdlib.h

int main(int argc, char **argv)
{
int size, fd;
char *mapping;
if (argc != 3) {
fprintf(stderr, Usage: %s FILE SIZE\n, argv[0]);
return 1;
}
size = atoi(argv[2]);

fd = open(argv[1], O_RDWR | O_CREAT, 0600);
if (fd  0) {
perror(argv[1]);
return 1;
}
if (ftruncate(fd, size)  0) {
perror(ftruncate);
return 1;
}
mapping = mmap(NULL, size, PROT_WRITE, MAP_SHARED, fd, 0);
if (mapping == MAP_FAILED) {
perror(mmap);
return 1;
}
memset(mapping, 0xFF, size);

/* Force a write-back */
sync();

if (msync(mapping, size, MS_SYNC)  0) {
perror(msync);
return 1;
}
if (close(fd)  0) {
perror(close);
return 1;
}
printf(%s: successfully written %d bytes\n, argv[1], size);
return 0;
}
-8-8-8-8-8-8-
make_file.c mmaps a hole, performs some writeback (memset + sync) and
then expects to find some error code in msync(). The script mounts a
40M loopback filesystem and does a binary search to find the size of
a file big enough to provoke a ENOSPC, but small enough to show the
error not being detected at msync() time.

The error window is large enough for such a size to be quickly found, but with
this patch, no such file size can be found.

All mmap capable filesystems I tested are affected (ext2, ext3,
jfs, reiserfs, xfs). XFS is special in that it survives the test thanks
to the page_mkwrite() work, i.e. it SIGBUS during memset. Anyway, this
behavious solves ENOSPC but does nothing for EIO.

The offending code is in fs/fs-writeback.c:

sync_sb_inodes(...) ()
{
...
__writeback_single_inode(inode, wbc);
...
}
__writeback_single_inode() gets the error from mapping-flags, clears it
and returns it. But sync_sb_inodes() ignores this return value. In -mm
there is sync_sb_inodes-propagate-errors.patch that propagates the
error from __writeback_single_inode upwards in the call stack. IMHO,
this propagation is useless because:

- the error is combined from the errors in all the synced inodes, so it
just tells that some inode in a specific fs got an error,
- nobody in the call stack is interested in this error: certainly not
pdflush, or 'void sync(2)'.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
---

 fs/fs-writeback.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0fca820..88bb3c4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -417,6 +417,7 @@ sync_sb_inodes(struct super_block *sb, struct 
writeback_control *wbc)
struct address_space *mapping = inode-i_mapping;
struct backing_dev_info *bdi = mapping-backing_dev_info;
long pages_skipped;
+   int err;
 
if (!bdi_cap_writeback_dirty(bdi)) {
redirty_tail(inode);
@@ -461,7 +462,8 @@ sync_sb_inodes(struct

Re: [PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-01-04 Thread Guillaume Chazarain
Al Viro <[EMAIL PROTECTED]> wrote:

> How about this:

At least the task_mmu part works fine.

Tested-by: Guillaume Chazarain <[EMAIL PROTECTED]>

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-01-04 Thread Guillaume Chazarain
Al Viro <[EMAIL PROTECTED]> wrote:

> vma_stop() doesn't need changes either...

Hmmm, not sure ;-)

$ cat /proc/1/maps
Pid: 2282, comm: cat Not tainted (2.6.24-rc6-gc2 #185)
EIP: 0060:[] EFLAGS: 00010286 CPU: 0
EIP is at vma_stop+0xd/0x21
EAX: f7c90360 EBX: f7c90360 ECX: c042b5f0 EDX: 
ESI: f62aa240 EDI:  EBP: f62daf24 ESP: f62daf20
 DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
Process cat (pid: 2282, ti=f62da000 task=f6264d20 task.ti=f62da000)
Stack: f7c90360 f62daf30 c01a40dc f62d0080 f62daf70 c018bdf1 0400 0804f000 
   f62d0080 f62aa260   0400 f62cc000 f62dafb0  
    f62d0080 c018bc9e 0804f000 f62daf90 c01751c5 f62daf9c  
Call Trace:
 [] show_trace_log_lvl+0x1a/0x2f
 [] show_stack_log_lvl+0x9d/0xa5
 [] show_registers+0xa2/0x1b8
 [] die+0x11d/0x202
 [] do_general_protection+0x1f7/0x1ff
 [] error_code+0x6a/0x70
 [] m_stop+0xe/0x29
 [] seq_read+0x153/0x25a
 [] vfs_read+0xa6/0x158
 [] sys_read+0x3d/0x61
 [] sysenter_past_esp+0x6b/0xa1
 ===
Code: 89 50 18 31 d2 89 48 1c 83 c4 5c 89 d0 5b 5e 5f 5d c3 55 31 c9 89 e5 e8 
80 fd ff ff 5d c3 55 85 d2 89 e5 53 74 16 3b 50 08 74 11 <8b> 1a 8d 43 34 e8 80 
ea f8 ff 89 d8 e8 16 89 f7 ff 5b 5d c3 55 
EIP: [] vma_stop+0xd/0x21 SS:ESP 0068:f62daf20
---[ end trace 297d07fbbfc82b7b ]---

This is an inconsistency in the handling of errors in m_start() between
fs/proc/task_mmu.c and fs/proc/task_nommu.c.

task_mmu.c:
if (IS_ERR(mm) || !mm)
return mm;

task_nommu.c:
if (IS_ERR(mm) || !mm) {
put_task_struct(priv->task);
priv->task = NULL;
return mm;
}

task_nommu.c does the cleanup while task_mmu.c defers it to m_stop.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-01-04 Thread Guillaume Chazarain
Return an error instead of successfully reading an empty file.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
Acked-by: Al Viro <[EMAIL PROTECTED]>
---

 fs/proc/base.c   |2 +-
 fs/proc/task_mmu.c   |6 +++---
 fs/proc/task_nommu.c |4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)


diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7411bfb..3aebc85 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -219,7 +219,7 @@ out:
task_unlock(task);
up_read(>mmap_sem);
mmput(mm);
-   return NULL;
+   return ERR_PTR(-EPERM);
 }
 
 static int proc_pid_cmdline(struct task_struct *task, char * buffer)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8043a3e..74b4829 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -398,8 +398,8 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return NULL;
 
mm = mm_for_maps(priv->task);
-   if (!mm)
-   return NULL;
+   if (IS_ERR(mm) || !mm)
+   return mm;
 
priv->tail_vma = tail_vma = get_gate_vma(priv->task);
 
@@ -437,7 +437,7 @@ out:
 
 static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct 
*vma)
 {
-   if (vma && vma != priv->tail_vma) {
+   if (vma && !IS_ERR(vma) && vma != priv->tail_vma) {
struct mm_struct *mm = vma->vm_mm;
up_read(>mmap_sem);
mmput(mm);
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 1932c2c..53cb062 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -166,10 +166,10 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return NULL;
 
mm = mm_for_maps(priv->task);
-   if (!mm) {
+   if (IS_ERR(mm) || !mm) {
put_task_struct(priv->task);
priv->task = NULL;
-   return NULL;
+   return mm;
}
 
/* start from the Nth VMA */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: advertise new restrictions on /proc/*/maps & /proc/*/smaps

2008-01-04 Thread Guillaume Chazarain
Al Viro <[EMAIL PROTECTED]> wrote:

> The whole point is that we have to reject it at read() time, not open()
> time.

Yes, my patch was a complement to yours to propagate the -EPERM in easy
cases. As you noted it added restrictions on reading /proc/*/maps, even
though I found them acceptable.

How about this instead?

Maybe you'd prefer to propagate the actual -EPERM from
__ptrace_may_attach but that would be more invasive.

Sidenote: do you think a sparse annotation to check IS_ERR/PTR_ERR
usage would make sense?

proc: return -EPERM when preventing read of /proc/*/maps

Return an error instead of successfully reading an empty file.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
---

 fs/proc/base.c   |2 +-
 fs/proc/task_mmu.c   |8 +---
 fs/proc/task_nommu.c |4 ++--
 3 files changed, 8 insertions(+), 6 deletions(-)


diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7411bfb..3aebc85 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -219,7 +219,7 @@ out:
task_unlock(task);
up_read(>mmap_sem);
mmput(mm);
-   return NULL;
+   return ERR_PTR(-EPERM);
 }
 
 static int proc_pid_cmdline(struct task_struct *task, char * buffer)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8043a3e..db57e65 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -398,8 +398,8 @@ static void *m_start(struct seq_file *m, loff_t
*pos) return NULL;
 
mm = mm_for_maps(priv->task);
-   if (!mm)
-   return NULL;
+   if (IS_ERR(mm) || !mm)
+   return mm;
 
priv->tail_vma = tail_vma = get_gate_vma(priv->task);
 
@@ -437,7 +437,7 @@ out:
 
 static void vma_stop(struct proc_maps_private *priv, struct
vm_area_struct *vma) {
-   if (vma && vma != priv->tail_vma) {
+   if (vma && !IS_ERR(vma) && vma != priv->tail_vma) {
struct mm_struct *mm = vma->vm_mm;
up_read(>mmap_sem);
mmput(mm);
@@ -451,6 +451,8 @@ static void *m_next(struct seq_file *m, void *v,
loff_t *pos) struct vm_area_struct *tail_vma = priv->tail_vma;
 
(*pos)++;
+   if (IS_ERR(vma))
+   return vma;
if (vma && (vma != tail_vma) && vma->vm_next)
return vma->vm_next;
vma_stop(priv, vma);
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 1932c2c..53cb062 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -166,10 +166,10 @@ static void *m_start(struct seq_file *m, loff_t
*pos) return NULL;
 
mm = mm_for_maps(priv->task);
-   if (!mm) {
+   if (IS_ERR(mm) || !mm) {
put_task_struct(priv->task);
priv->task = NULL;
-   return NULL;
+   return mm;
}
 
/* start from the Nth VMA */


-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: advertise new restrictions on /proc/*/maps /proc/*/smaps

2008-01-04 Thread Guillaume Chazarain
Al Viro [EMAIL PROTECTED] wrote:

 The whole point is that we have to reject it at read() time, not open()
 time.

Yes, my patch was a complement to yours to propagate the -EPERM in easy
cases. As you noted it added restrictions on reading /proc/*/maps, even
though I found them acceptable.

How about this instead?

Maybe you'd prefer to propagate the actual -EPERM from
__ptrace_may_attach but that would be more invasive.

Sidenote: do you think a sparse annotation to check IS_ERR/PTR_ERR
usage would make sense?

proc: return -EPERM when preventing read of /proc/*/maps

Return an error instead of successfully reading an empty file.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
---

 fs/proc/base.c   |2 +-
 fs/proc/task_mmu.c   |8 +---
 fs/proc/task_nommu.c |4 ++--
 3 files changed, 8 insertions(+), 6 deletions(-)


diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7411bfb..3aebc85 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -219,7 +219,7 @@ out:
task_unlock(task);
up_read(mm-mmap_sem);
mmput(mm);
-   return NULL;
+   return ERR_PTR(-EPERM);
 }
 
 static int proc_pid_cmdline(struct task_struct *task, char * buffer)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8043a3e..db57e65 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -398,8 +398,8 @@ static void *m_start(struct seq_file *m, loff_t
*pos) return NULL;
 
mm = mm_for_maps(priv-task);
-   if (!mm)
-   return NULL;
+   if (IS_ERR(mm) || !mm)
+   return mm;
 
priv-tail_vma = tail_vma = get_gate_vma(priv-task);
 
@@ -437,7 +437,7 @@ out:
 
 static void vma_stop(struct proc_maps_private *priv, struct
vm_area_struct *vma) {
-   if (vma  vma != priv-tail_vma) {
+   if (vma  !IS_ERR(vma)  vma != priv-tail_vma) {
struct mm_struct *mm = vma-vm_mm;
up_read(mm-mmap_sem);
mmput(mm);
@@ -451,6 +451,8 @@ static void *m_next(struct seq_file *m, void *v,
loff_t *pos) struct vm_area_struct *tail_vma = priv-tail_vma;
 
(*pos)++;
+   if (IS_ERR(vma))
+   return vma;
if (vma  (vma != tail_vma)  vma-vm_next)
return vma-vm_next;
vma_stop(priv, vma);
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 1932c2c..53cb062 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -166,10 +166,10 @@ static void *m_start(struct seq_file *m, loff_t
*pos) return NULL;
 
mm = mm_for_maps(priv-task);
-   if (!mm) {
+   if (IS_ERR(mm) || !mm) {
put_task_struct(priv-task);
priv-task = NULL;
-   return NULL;
+   return mm;
}
 
/* start from the Nth VMA */


-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-01-04 Thread Guillaume Chazarain
Return an error instead of successfully reading an empty file.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
Acked-by: Al Viro [EMAIL PROTECTED]
---

 fs/proc/base.c   |2 +-
 fs/proc/task_mmu.c   |6 +++---
 fs/proc/task_nommu.c |4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)


diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7411bfb..3aebc85 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -219,7 +219,7 @@ out:
task_unlock(task);
up_read(mm-mmap_sem);
mmput(mm);
-   return NULL;
+   return ERR_PTR(-EPERM);
 }
 
 static int proc_pid_cmdline(struct task_struct *task, char * buffer)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8043a3e..74b4829 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -398,8 +398,8 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return NULL;
 
mm = mm_for_maps(priv-task);
-   if (!mm)
-   return NULL;
+   if (IS_ERR(mm) || !mm)
+   return mm;
 
priv-tail_vma = tail_vma = get_gate_vma(priv-task);
 
@@ -437,7 +437,7 @@ out:
 
 static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct 
*vma)
 {
-   if (vma  vma != priv-tail_vma) {
+   if (vma  !IS_ERR(vma)  vma != priv-tail_vma) {
struct mm_struct *mm = vma-vm_mm;
up_read(mm-mmap_sem);
mmput(mm);
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 1932c2c..53cb062 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -166,10 +166,10 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return NULL;
 
mm = mm_for_maps(priv-task);
-   if (!mm) {
+   if (IS_ERR(mm) || !mm) {
put_task_struct(priv-task);
priv-task = NULL;
-   return NULL;
+   return mm;
}
 
/* start from the Nth VMA */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-01-04 Thread Guillaume Chazarain
Al Viro [EMAIL PROTECTED] wrote:

 vma_stop() doesn't need changes either...

Hmmm, not sure ;-)

$ cat /proc/1/maps
Pid: 2282, comm: cat Not tainted (2.6.24-rc6-gc2 #185)
EIP: 0060:[c01a4080] EFLAGS: 00010286 CPU: 0
EIP is at vma_stop+0xd/0x21
EAX: f7c90360 EBX: f7c90360 ECX: c042b5f0 EDX: 
ESI: f62aa240 EDI:  EBP: f62daf24 ESP: f62daf20
 DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
Process cat (pid: 2282, ti=f62da000 task=f6264d20 task.ti=f62da000)
Stack: f7c90360 f62daf30 c01a40dc f62d0080 f62daf70 c018bdf1 0400 0804f000 
   f62d0080 f62aa260   0400 f62cc000 f62dafb0  
    f62d0080 c018bc9e 0804f000 f62daf90 c01751c5 f62daf9c  
Call Trace:
 [c0104e4a] show_trace_log_lvl+0x1a/0x2f
 [c0104efc] show_stack_log_lvl+0x9d/0xa5
 [c0104fa6] show_registers+0xa2/0x1b8
 [c01051d9] die+0x11d/0x202
 [c03319f9] do_general_protection+0x1f7/0x1ff
 [c0331172] error_code+0x6a/0x70
 [c01a40dc] m_stop+0xe/0x29
 [c018bdf1] seq_read+0x153/0x25a
 [c01751c5] vfs_read+0xa6/0x158
 [c0175583] sys_read+0x3d/0x61
 [c0103ea2] sysenter_past_esp+0x6b/0xa1
 ===
Code: 89 50 18 31 d2 89 48 1c 83 c4 5c 89 d0 5b 5e 5f 5d c3 55 31 c9 89 e5 e8 
80 fd ff ff 5d c3 55 85 d2 89 e5 53 74 16 3b 50 08 74 11 8b 1a 8d 43 34 e8 80 
ea f8 ff 89 d8 e8 16 89 f7 ff 5b 5d c3 55 
EIP: [c01a4080] vma_stop+0xd/0x21 SS:ESP 0068:f62daf20
---[ end trace 297d07fbbfc82b7b ]---

This is an inconsistency in the handling of errors in m_start() between
fs/proc/task_mmu.c and fs/proc/task_nommu.c.

task_mmu.c:
if (IS_ERR(mm) || !mm)
return mm;

task_nommu.c:
if (IS_ERR(mm) || !mm) {
put_task_struct(priv-task);
priv-task = NULL;
return mm;
}

task_nommu.c does the cleanup while task_mmu.c defers it to m_stop.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: return -EPERM when preventing read of /proc/*/maps

2008-01-04 Thread Guillaume Chazarain
Al Viro [EMAIL PROTECTED] wrote:

 How about this:

At least the task_mmu part works fine.

Tested-by: Guillaume Chazarain [EMAIL PROTECTED]

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] proc: advertise new restrictions on /proc/*/maps & /proc/*/smaps

2008-01-03 Thread Guillaume Chazarain
Now that strangers are kept out of /proc//maps, let's welcome them
with -EPERM instead of a blank file.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
---

 fs/proc/base.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7411bfb..c824b23 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2207,7 +2207,7 @@ static const struct pid_entry tgid_base_stuff[] = {
INF("cmdline",S_IRUGO, pid_cmdline),
INF("stat",   S_IRUGO, tgid_stat),
INF("statm",  S_IRUGO, pid_statm),
-   REG("maps",   S_IRUGO, maps),
+   REG("maps",   S_IRUSR, maps),
 #ifdef CONFIG_NUMA
REG("numa_maps",  S_IRUGO, numa_maps),
 #endif
@@ -2219,7 +2219,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("mountstats", S_IRUSR, mountstats),
 #ifdef CONFIG_MMU
REG("clear_refs", S_IWUSR, clear_refs),
-   REG("smaps",  S_IRUGO, smaps),
+   REG("smaps",  S_IRUSR, smaps),
 #endif
 #ifdef CONFIG_SECURITY
DIR("attr",   S_IRUGO|S_IXUGO, attr_dir),
@@ -2533,7 +2533,7 @@ static const struct pid_entry tid_base_stuff[] = {
INF("cmdline",   S_IRUGO, pid_cmdline),
INF("stat",  S_IRUGO, tid_stat),
INF("statm", S_IRUGO, pid_statm),
-   REG("maps",  S_IRUGO, maps),
+   REG("maps",  S_IRUSR, maps),
 #ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, numa_maps),
 #endif
@@ -2544,7 +2544,7 @@ static const struct pid_entry tid_base_stuff[] = {
REG("mounts",S_IRUGO, mounts),
 #ifdef CONFIG_MMU
REG("clear_refs", S_IWUSR, clear_refs),
-   REG("smaps", S_IRUGO, smaps),
+   REG("smaps", S_IRUSR, smaps),
 #endif
 #ifdef CONFIG_SECURITY
DIR("attr",  S_IRUGO|S_IXUGO, attr_dir),

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] proc: advertise new restrictions on /proc/*/maps /proc/*/smaps

2008-01-03 Thread Guillaume Chazarain
Now that strangers are kept out of /proc/pid/maps, let's welcome them
with -EPERM instead of a blank file.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
---

 fs/proc/base.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7411bfb..c824b23 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2207,7 +2207,7 @@ static const struct pid_entry tgid_base_stuff[] = {
INF(cmdline,S_IRUGO, pid_cmdline),
INF(stat,   S_IRUGO, tgid_stat),
INF(statm,  S_IRUGO, pid_statm),
-   REG(maps,   S_IRUGO, maps),
+   REG(maps,   S_IRUSR, maps),
 #ifdef CONFIG_NUMA
REG(numa_maps,  S_IRUGO, numa_maps),
 #endif
@@ -2219,7 +2219,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG(mountstats, S_IRUSR, mountstats),
 #ifdef CONFIG_MMU
REG(clear_refs, S_IWUSR, clear_refs),
-   REG(smaps,  S_IRUGO, smaps),
+   REG(smaps,  S_IRUSR, smaps),
 #endif
 #ifdef CONFIG_SECURITY
DIR(attr,   S_IRUGO|S_IXUGO, attr_dir),
@@ -2533,7 +2533,7 @@ static const struct pid_entry tid_base_stuff[] = {
INF(cmdline,   S_IRUGO, pid_cmdline),
INF(stat,  S_IRUGO, tid_stat),
INF(statm, S_IRUGO, pid_statm),
-   REG(maps,  S_IRUGO, maps),
+   REG(maps,  S_IRUSR, maps),
 #ifdef CONFIG_NUMA
REG(numa_maps, S_IRUGO, numa_maps),
 #endif
@@ -2544,7 +2544,7 @@ static const struct pid_entry tid_base_stuff[] = {
REG(mounts,S_IRUGO, mounts),
 #ifdef CONFIG_MMU
REG(clear_refs, S_IWUSR, clear_refs),
-   REG(smaps, S_IRUGO, smaps),
+   REG(smaps, S_IRUSR, smaps),
 #endif
 #ifdef CONFIG_SECURITY
DIR(attr,  S_IRUGO|S_IXUGO, attr_dir),

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: separate objdir Makefile regression in 2.6.24-rc*

2007-12-13 Thread Guillaume Chazarain
On Dec 13, 2007 2:48 PM, Andi Kleen <[EMAIL PROTECTED]> wrote:
>
> 2.6.24-rc5 doesn't seem to create Makefiles in empty obj dirs anymore

Known problem ;-)
See 
http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/188cbd12d7c0871b/194fbc7c94314b2c

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: separate objdir Makefile regression in 2.6.24-rc*

2007-12-13 Thread Guillaume Chazarain
On Dec 13, 2007 2:48 PM, Andi Kleen [EMAIL PROTECTED] wrote:

 2.6.24-rc5 doesn't seem to create Makefiles in empty obj dirs anymore

Known problem ;-)
See 
http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/188cbd12d7c0871b/194fbc7c94314b2c

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kbuild: Re-enable Makefile generation in a new O=... directory

2007-12-11 Thread Guillaume Chazarain
The patch kbuild: fix building with O=.. options
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=18c32dac75b187d1a4e858f3cfdf03e844129f5e
disabled the creation of a Makefile in a new O=... directory. Restore it.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
---

 scripts/mkmakefile |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/scripts/mkmakefile b/scripts/mkmakefile
index 9ad1bd7..e0f54b9 100644
--- a/scripts/mkmakefile
+++ b/scripts/mkmakefile
@@ -13,7 +13,7 @@
 test ! -r $2/Makefile -o -O $2/Makefile || exit 0
 # Only overwrite automatically generated Makefiles
 # (so we do not overwrite kernel Makefile)
-if ! grep -q Automatically $2/Makefile
+if test -e $2/Makefile && ! grep -q Automatically $2/Makefile
 then
exit 0
 fi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kbuild: Re-enable Makefile generation in a new O=... directory

2007-12-11 Thread Guillaume Chazarain
The patch kbuild: fix building with O=.. options
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=18c32dac75b187d1a4e858f3cfdf03e844129f5e
disabled the creation of a Makefile in a new O=... directory. Restore it.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
---

 scripts/mkmakefile |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/scripts/mkmakefile b/scripts/mkmakefile
index 9ad1bd7..e0f54b9 100644
--- a/scripts/mkmakefile
+++ b/scripts/mkmakefile
@@ -13,7 +13,7 @@
 test ! -r $2/Makefile -o -O $2/Makefile || exit 0
 # Only overwrite automatically generated Makefiles
 # (so we do not overwrite kernel Makefile)
-if ! grep -q Automatically $2/Makefile
+if test -e $2/Makefile  ! grep -q Automatically $2/Makefile
 then
exit 0
 fi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
Arjan van de Ven <[EMAIL PROTECTED]> wrote:

> the frequency of both cores is the maximum of what linux sets each core to;

Do you mean that the cpufreq code can be confused about the actual
frequency of the cores? That sounds like a big problem.

Thanks for any insight.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
Stefano Brivio <[EMAIL PROTECTED]> wrote:

> Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
> delays isn't that good when using my crappy unstable TSC (mdelay(2000)
> causes delays between 2 and 2.9 seconds) but it's not depending on frequency
> changes anymore. So I'd say it's fixed, but please tell me if you want me
> to do any other test so as to be sure it is.

Ingo,

it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock()
based udelay), so how udelay can be affected by your proposed changes?

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
On Dec 10, 2007 9:42 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> although some claimed effect was on udelay()/mdelay() too.

Any specific report?
The jumping sched_clock on frequency change caused some
scheduling oddities for me, but CFS attenuated the effect.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
On Dec 10, 2007 9:42 PM, Ingo Molnar [EMAIL PROTECTED] wrote:
 although some claimed effect was on udelay()/mdelay() too.

Any specific report?
The jumping sched_clock on frequency change caused some
scheduling oddities for me, but CFS attenuated the effect.

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
Stefano Brivio [EMAIL PROTECTED] wrote:

 Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
 delays isn't that good when using my crappy unstable TSC (mdelay(2000)
 causes delays between 2 and 2.9 seconds) but it's not depending on frequency
 changes anymore. So I'd say it's fixed, but please tell me if you want me
 to do any other test so as to be sure it is.

Ingo,

it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock()
based udelay), so how udelay can be affected by your proposed changes?

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
Arjan van de Ven [EMAIL PROTECTED] wrote:

 the frequency of both cores is the maximum of what linux sets each core to;

Do you mean that the cpufreq code can be confused about the actual
frequency of the cores? That sounds like a big problem.

Thanks for any insight.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] x86/hrtimer/acpi fixes

2007-12-09 Thread Guillaume Chazarain
On Dec 9, 2007 7:01 PM, Pavel Machek <[EMAIL PROTECTED]> wrote:
> > + *  ns += offset to avoid sched_clock jumps with cpufreq
> > + *
> >   *   [EMAIL PROTECTED] "math is hard, lets go shopping!"
> >   */
>
> Did john add the 'ns+=' or do comments need reorder?

I added it, but I think it needs to be removed as now the offset is maintained
by the scheduler in __update_rq_clock().

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] x86/hrtimer/acpi fixes

2007-12-09 Thread Guillaume Chazarain
On Dec 9, 2007 7:01 PM, Pavel Machek [EMAIL PROTECTED] wrote:
  + *  ns += offset to avoid sched_clock jumps with cpufreq
  + *
*   [EMAIL PROTECTED] math is hard, lets go shopping!
*/

 Did john add the 'ns+=' or do comments need reorder?

I added it, but I think it needs to be removed as now the offset is maintained
by the scheduler in __update_rq_clock().

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-08 Thread Guillaume Chazarain
On Dec 8, 2007 9:52 AM, Ingo Molnar <[EMAIL PROTECTED]> wrote:

> the scariest bit isnt even the scaling i think - that is a fairly
> straightforward and clean PER_CPU-ization of the global scaling factor,
> and its hookup with cpufreq events. (and the credit for that goes to
> Guillaume Chazarain)

To be fair, the cpufreq hook were already there, I just did a buggy percpu
conversion and added an offset that you removed ;-)

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-08 Thread Guillaume Chazarain
On Dec 8, 2007 9:52 AM, Ingo Molnar [EMAIL PROTECTED] wrote:

 the scariest bit isnt even the scaling i think - that is a fairly
 straightforward and clean PER_CPU-ization of the global scaling factor,
 and its hookup with cpufreq events. (and the credit for that goes to
 Guillaume Chazarain)

To be fair, the cpufreq hook were already there, I just did a buggy percpu
conversion and added an offset that you removed ;-)

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
Le Fri, 7 Dec 2007 15:54:18 +0100,
Ingo Molnar <[EMAIL PROTECTED]> a écrit :

> This is a version that 
> is supposed fix all known aspects of TSC and frequency-change 
> weirdnesses.

Tested it with frequency changes, the clock is as smooth as I like
it :-)

The only remaining sched_clock user in need of conversion seems to be
lockdep.

Great work.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: scale cyc_2_nsec according to CPU frequency

2007-12-07 Thread Guillaume Chazarain
Le Fri, 7 Dec 2007 14:55:25 +0100,
Ingo Molnar <[EMAIL PROTECTED]> a écrit :

> Firstly, we dont need the 'offset' anymore because cpu_clock() maintains 
> offsets itself.

Yes, but a lower quality one. __update_rq_clock tries to compensate
large jumping clocks with a jiffy resolution, while my offset arranges
for a very smooth frequency transition.

I agree with keeping a single offset, but I liked the fact that with my
patch on frequency change, the clock had no jump at all.

> + *  ns += offset to avoid sched_clock jumps with cpufreq

I guess this needs to go away if I don't make my point :-(

> + printk("CPU#%d: changed cyc2ns scale from %ld to %ld\n",
> + cpu, prev_scale, *scale);

Pointing it out just to be sure it does not end in the final version ;-)

Thanks for cleaning up my mess ;-)

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
On Dec 7, 2007 12:18 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote:
> Any pointer to it?

Nevermind, I found it ... in this same thread :-(

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
On Dec 7, 2007 12:13 PM, Nick Piggin <[EMAIL PROTECTED]> wrote:
> My patch should fix the worst cpufreq sched_clock jumping issue
> I think.

Any pointer to it?

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
Le Fri, 7 Dec 2007 09:51:21 +0100,
Ingo Molnar <[EMAIL PROTECTED]> a écrit :

> yeah, we can do something like this in 2.6.25 - this will improve the 
> quality of sched_clock().

Thanks a lot for your interest!

I'll clean it up and resend it later. As I don't have the necessary
knowledge to do the tsc_{32,64}.c unification, should I copy paste
common functions into tsc_32.c and tsc_64.c to ease later unification
or should I start a common .c file?

Thanks again for showing interest.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
"Guillaume Chazarain" <[EMAIL PROTECTED]> wrote:

> On Dec 7, 2007 6:51 AM, Thomas Gleixner <[EMAIL PROTECTED]> wrote:
> > Hmrpf. sched_clock() is used for the time stamp of the printks. We
> > need to find some better solution other than killing off the tsc
> > access completely.
> 
> Something like http://lkml.org/lkml/2007/3/16/291 that would need some 
> refresh?

And here is a refreshed one just for testing with 2.6-git. The 64 bit
part is a shamelessly untested copy/paste as I cannot test it.

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..d561b2f 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -78,15 +79,32 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  ([EMAIL PROTECTED])
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  * [EMAIL PROTECTED] "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
+
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-   cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   struct cyc2ns_params *params;
+   unsigned long flags;
+   unsigned long long tsc_now, ns_now;
+
+   rdtscll(tsc_now);
+   params = _cpu_var(cyc2ns);
+
+   local_irq_save(flags);
+   ns_now = __cycles_2_ns(params, tsc_now);
+
+   params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+   local_irq_restore(flags);
+
+   put_cpu_var(cyc2ns);
 }
 
 /*
diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 9c70af4..93e7a06 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,25 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-   cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+   struct cyc2ns_params *params;
+   unsigned long flags;
+   unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-   return (cyc * cyc2ns_scale) >> NS_SCALE;
+   rdtscll(tsc_now);
+   params = _cpu_var(cyc2ns);
+
+   local_irq_save(flags);
+   ns_now = __cycles_2_ns(params, tsc_now);
+
+   params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+   local_irq_restore(flags);
+
+   put_cpu_var(cyc2ns);
 }
 
 unsigned long long sched_clock(void)
diff --git a/include/asm-x86/timer.h b/include/asm-x86/timer.h
index 0db7e99..ff4f2a3 100644
--- a/include/asm-x86/timer.h
+++ b/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include 
 #include 
+#include 
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  * ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,44 @@ extern int recalibrate_cpu_khz(void);
  * And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  ([EMAIL PROTECTED])
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  * [EMAIL PROTECTED] "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+struct cyc2ns_params {
+   unsigned long scale;
+   unsigned long long offset;
+};
+
+DECLARE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(struct cyc2ns_params *params,
+  unsigned long long cyc)
 {
-   return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+   return ((cyc * params->scale) >> CYC2NS_SCALE_

Re: [patch] x86: scale cyc_2_nsec according to CPU frequency

2007-12-07 Thread Guillaume Chazarain
Le Fri, 7 Dec 2007 14:55:25 +0100,
Ingo Molnar [EMAIL PROTECTED] a écrit :

 Firstly, we dont need the 'offset' anymore because cpu_clock() maintains 
 offsets itself.

Yes, but a lower quality one. __update_rq_clock tries to compensate
large jumping clocks with a jiffy resolution, while my offset arranges
for a very smooth frequency transition.

I agree with keeping a single offset, but I liked the fact that with my
patch on frequency change, the clock had no jump at all.

 + *  ns += offset to avoid sched_clock jumps with cpufreq

I guess this needs to go away if I don't make my point :-(

 + printk(CPU#%d: changed cyc2ns scale from %ld to %ld\n,
 + cpu, prev_scale, *scale);

Pointing it out just to be sure it does not end in the final version ;-)

Thanks for cleaning up my mess ;-)

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
On Dec 7, 2007 12:18 PM, Guillaume Chazarain [EMAIL PROTECTED] wrote:
 Any pointer to it?

Nevermind, I found it ... in this same thread :-(

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
On Dec 7, 2007 12:13 PM, Nick Piggin [EMAIL PROTECTED] wrote:
 My patch should fix the worst cpufreq sched_clock jumping issue
 I think.

Any pointer to it?

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
Guillaume Chazarain [EMAIL PROTECTED] wrote:

 On Dec 7, 2007 6:51 AM, Thomas Gleixner [EMAIL PROTECTED] wrote:
  Hmrpf. sched_clock() is used for the time stamp of the printks. We
  need to find some better solution other than killing off the tsc
  access completely.
 
 Something like http://lkml.org/lkml/2007/3/16/291 that would need some 
 refresh?

And here is a refreshed one just for testing with 2.6-git. The 64 bit
part is a shamelessly untested copy/paste as I cannot test it.

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..d561b2f 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include linux/jiffies.h
 #include linux/init.h
 #include linux/dmi.h
+#include linux/percpu.h
 
 #include asm/delay.h
 #include asm/tsc.h
@@ -78,15 +79,32 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  ([EMAIL PROTECTED])
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  * [EMAIL PROTECTED] math is hard, lets go shopping!
  */
-unsigned long cyc2ns_scale __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
+
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-   cyc2ns_scale = (100  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   struct cyc2ns_params *params;
+   unsigned long flags;
+   unsigned long long tsc_now, ns_now;
+
+   rdtscll(tsc_now);
+   params = get_cpu_var(cyc2ns);
+
+   local_irq_save(flags);
+   ns_now = __cycles_2_ns(params, tsc_now);
+
+   params-scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   params-offset += ns_now - __cycles_2_ns(params, tsc_now);
+   local_irq_restore(flags);
+
+   put_cpu_var(cyc2ns);
 }
 
 /*
diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 9c70af4..93e7a06 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include asm/hpet.h
 #include asm/timex.h
+#include asm/timer.h
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,25 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-   cyc2ns_scale = (NSEC_PER_MSEC  NS_SCALE) / khz;
-}
+   struct cyc2ns_params *params;
+   unsigned long flags;
+   unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-   return (cyc * cyc2ns_scale)  NS_SCALE;
+   rdtscll(tsc_now);
+   params = get_cpu_var(cyc2ns);
+
+   local_irq_save(flags);
+   ns_now = __cycles_2_ns(params, tsc_now);
+
+   params-scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   params-offset += ns_now - __cycles_2_ns(params, tsc_now);
+   local_irq_restore(flags);
+
+   put_cpu_var(cyc2ns);
 }
 
 unsigned long long sched_clock(void)
diff --git a/include/asm-x86/timer.h b/include/asm-x86/timer.h
index 0db7e99..ff4f2a3 100644
--- a/include/asm-x86/timer.h
+++ b/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include linux/init.h
 #include linux/pm.h
+#include linux/percpu.h
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) = nanoseconds (64bits)
  *  basic equation:
  * ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,44 @@ extern int recalibrate_cpu_khz(void);
  * And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  ([EMAIL PROTECTED])
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  * [EMAIL PROTECTED] math is hard, lets go shopping!
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+struct cyc2ns_params {
+   unsigned long scale;
+   unsigned long long offset;
+};
+
+DECLARE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(struct cyc2ns_params *params,
+  unsigned long long cyc)
 {
-   return (cyc * cyc2ns_scale)  CYC2NS_SCALE_FACTOR;
+   return ((cyc * params-scale)  CYC2NS_SCALE_FACTOR) + params-offset

Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
Le Fri, 7 Dec 2007 09:51:21 +0100,
Ingo Molnar [EMAIL PROTECTED] a écrit :

 yeah, we can do something like this in 2.6.25 - this will improve the 
 quality of sched_clock().

Thanks a lot for your interest!

I'll clean it up and resend it later. As I don't have the necessary
knowledge to do the tsc_{32,64}.c unification, should I copy paste
common functions into tsc_32.c and tsc_64.c to ease later unification
or should I start a common .c file?

Thanks again for showing interest.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-07 Thread Guillaume Chazarain
Le Fri, 7 Dec 2007 15:54:18 +0100,
Ingo Molnar [EMAIL PROTECTED] a écrit :

 This is a version that 
 is supposed fix all known aspects of TSC and frequency-change 
 weirdnesses.

Tested it with frequency changes, the clock is as smooth as I like
it :-)

The only remaining sched_clock user in need of conversion seems to be
lockdep.

Great work.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-06 Thread Guillaume Chazarain
On Dec 7, 2007 6:51 AM, Thomas Gleixner <[EMAIL PROTECTED]> wrote:
> Hmrpf. sched_clock() is used for the time stamp of the printks. We
> need to find some better solution other than killing off the tsc
> access completely.

Something like http://lkml.org/lkml/2007/3/16/291 that would need some refresh?

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scheduler: fix x86 regression in native_sched_clock

2007-12-06 Thread Guillaume Chazarain
On Dec 7, 2007 6:51 AM, Thomas Gleixner [EMAIL PROTECTED] wrote:
 Hmrpf. sched_clock() is used for the time stamp of the printks. We
 need to find some better solution other than killing off the tsc
 access completely.

Something like http://lkml.org/lkml/2007/3/16/291 that would need some refresh?

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3: find complains about /proc/net

2007-11-20 Thread Guillaume Chazarain
On 11/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> i guess it was a v2.6.24 change, hence a regression that needs to be
> fixed?

It seems to be

http://git.kernel.org/?p=linux/kernel/git/tglx/history.git;a=commitdiff;h=01660410

So, linux 2.6.0-test6

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: Fix the threaded /proc/self.

2007-11-20 Thread Guillaume Chazarain
Hello Eric,

This fills a need I had to get the current TID in a Java program,
so I'm very interested in this change. OTOH, how will someone
not reading LKML discover that the current TID is now in
/proc/self and that it was not always the case?

I would put my 2 cents in /proc/self/task/self, this way TGID are
always in /proc and TID in /proc/TGID/task.

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: Fix the threaded /proc/self.

2007-11-20 Thread Guillaume Chazarain
Hello Eric,

This fills a need I had to get the current TID in a Java program,
so I'm very interested in this change. OTOH, how will someone
not reading LKML discover that the current TID is now in
/proc/self and that it was not always the case?

I would put my 2 cents in /proc/self/task/self, this way TGID are
always in /proc and TID in /proc/TGID/task.

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3: find complains about /proc/net

2007-11-20 Thread Guillaume Chazarain
On 11/21/07, Ingo Molnar [EMAIL PROTECTED] wrote:
 i guess it was a v2.6.24 change, hence a regression that needs to be
 fixed?

It seems to be

http://git.kernel.org/?p=linux/kernel/git/tglx/history.git;a=commitdiff;h=01660410

So, linux 2.6.0-test6

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-11 Thread Guillaume Chazarain
On 11/11/07, Sam Ravnborg <[EMAIL PROTECTED]> wrote:
> > So it's not strictly an
> > output directory, more a build directory.
> The opposite
> All output is placed there - including the configuration generated by
> the *config frontends.

I meant, it's not strictly an output directory as if I do

make O=dir oldconfig

it will _read_ dir/.config, so the O= directory is also used for input.
And yes, I was splitting hairs ;-)

Sorry for the confusion.

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-11 Thread Guillaume Chazarain
On 11/11/07, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> Another important point is that users that know about and see CONFIG_*
> variables are kernel hackers, not the normal kconfig users.

But kconfig is mainly for kernel hackers, otherwise it would be
called CML2 ;-)

> > Also, when working on a specific feature of the kernel, I tend to
> > install both a kernel with the CONFIG_ option set and one with
> > the option unset. Scripts to do that can twiddle the .config file,
> > but it would be more convenient if kbuild could avoid that.
>
> I'm wondering why you don't use two different O= output directories
> instead?
>
> Depending on the CONFIG_ option in question this might even greatly
> reduce your compile times.

/me is filled with wonder at the discovery that .config is saved in the O=
directory. Thanks a lot Adrian for this time saver. So it's not strictly an
output directory, more a build directory.

I still think "make oldconfig CONFIG_FOO=bar" is useful for the occasional
config change, but thanks again for this great tip.

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-11 Thread Guillaume Chazarain
Hi Adrian,

On 11/11/07, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> What exactly are the use cases where someone would need this?

Glad you asked. Today, when I want to recompile a kernel while
changing a CONFIG_ option, I manually edit the .config,
remove the appropriate line and then run make oldconfig.
I'd like to be able to do: make oldconfig CONFIG_FOO=bar.

Also, when working on a specific feature of the kernel, I tend to
install both a kernel with the CONFIG_ option set and one with
the option unset. Scripts to do that can twiddle the .config file,
but it would be more convenient if kbuild could avoid that.

As you see, I'm more interested in make oldconfig than
make all*config.

Cheers.

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-11 Thread Guillaume Chazarain
Hi Adrian,

On 11/11/07, Adrian Bunk [EMAIL PROTECTED] wrote:
 What exactly are the use cases where someone would need this?

Glad you asked. Today, when I want to recompile a kernel while
changing a CONFIG_ option, I manually edit the .config,
remove the appropriate line and then run make oldconfig.
I'd like to be able to do: make oldconfig CONFIG_FOO=bar.

Also, when working on a specific feature of the kernel, I tend to
install both a kernel with the CONFIG_ option set and one with
the option unset. Scripts to do that can twiddle the .config file,
but it would be more convenient if kbuild could avoid that.

As you see, I'm more interested in make oldconfig than
make all*config.

Cheers.

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-11 Thread Guillaume Chazarain
On 11/11/07, Adrian Bunk [EMAIL PROTECTED] wrote:
 Another important point is that users that know about and see CONFIG_*
 variables are kernel hackers, not the normal kconfig users.

But kconfig is mainly for kernel hackers, otherwise it would be
called CML2 ;-)

  Also, when working on a specific feature of the kernel, I tend to
  install both a kernel with the CONFIG_ option set and one with
  the option unset. Scripts to do that can twiddle the .config file,
  but it would be more convenient if kbuild could avoid that.

 I'm wondering why you don't use two different O= output directories
 instead?

 Depending on the CONFIG_ option in question this might even greatly
 reduce your compile times.

/me is filled with wonder at the discovery that .config is saved in the O=
directory. Thanks a lot Adrian for this time saver. So it's not strictly an
output directory, more a build directory.

I still think make oldconfig CONFIG_FOO=bar is useful for the occasional
config change, but thanks again for this great tip.

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-11 Thread Guillaume Chazarain
On 11/11/07, Sam Ravnborg [EMAIL PROTECTED] wrote:
  So it's not strictly an
  output directory, more a build directory.
 The opposite
 All output is placed there - including the configuration generated by
 the *config frontends.

I meant, it's not strictly an output directory as if I do

make O=dir oldconfig

it will _read_ dir/.config, so the O= directory is also used for input.
And yes, I was splitting hairs ;-)

Sorry for the confusion.

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-10 Thread Guillaume Chazarain
Hi,

On 11/10/07, Sam Ravnborg <[EMAIL PROTECTED]> wrote:
> The variable K64BIT can now be used to select the
> value of CONFIG_64BIT.

Why not calling the environment variable CONFIG_64BIT,
in preparation of the day when all CONFIG_ variables can
be passed by environment variables?

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kconfig: use $K64BIT to set 64BIT with all*config targets

2007-11-10 Thread Guillaume Chazarain
Hi,

On 11/10/07, Sam Ravnborg [EMAIL PROTECTED] wrote:
 The variable K64BIT can now be used to select the
 value of CONFIG_64BIT.

Why not calling the environment variable CONFIG_64BIT,
in preparation of the day when all CONFIG_ variables can
be passed by environment variables?

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] replace "make ARCH=i386/x86_64 with make ARCH=x86"

2007-11-05 Thread Guillaume Chazarain
On 11/6/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:
> The issue with "make allyesconfig" concerns me, although the same
> situation already exists with any multiple-choice configuration.  What I
> guess we really want is to be able to specify a few specific choices.

I don't know enough about Kbuild to know if it's possible or not, but I
would find it great if the *config targets could take CONFIG_ variables
on the command line, like:

make oldconfig CONFIG_SMP=y

If it's not possible, why not inherit the CONFIG_ options from environment
variables, like we already do for $CFLAGS, but only at make *config
time in this case?

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] replace make ARCH=i386/x86_64 with make ARCH=x86

2007-11-05 Thread Guillaume Chazarain
On 11/6/07, H. Peter Anvin [EMAIL PROTECTED] wrote:
 The issue with make allyesconfig concerns me, although the same
 situation already exists with any multiple-choice configuration.  What I
 guess we really want is to be able to specify a few specific choices.

I don't know enough about Kbuild to know if it's possible or not, but I
would find it great if the *config targets could take CONFIG_ variables
on the command line, like:

make oldconfig CONFIG_SMP=y

If it's not possible, why not inherit the CONFIG_ options from environment
variables, like we already do for $CFLAGS, but only at make *config
time in this case?

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix delay accounting regression

2007-11-02 Thread Guillaume Chazarain
On 11/2/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

> What user-space tools are utilizing delay-accounting by the way?

Thanks for the plugging opportunity ;-)
http://guichaz.free.fr/misc/#iotop uses the I/O side of delay-accounting.

-- 
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix delay accounting regression

2007-11-02 Thread Guillaume Chazarain
On 11/2/07, Ingo Molnar [EMAIL PROTECTED] wrote:

 What user-space tools are utilizing delay-accounting by the way?

Thanks for the plugging opportunity ;-)
http://guichaz.free.fr/misc/#iotop uses the I/O side of delay-accounting.

-- 
Guillaume
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: CONFIG_FAIR_USER_SCHED: auto adjust users weights

2007-10-31 Thread Guillaume Chazarain
CONFIG_FAIR_USER_SCHED is great and I'm happy to see it is enabled by default
but it suffers from some limitations IMHO at this time:

- on a single user system, it's useful to have root processes be given twice
as CPU as user processes but I don't want nice 19 cron jobs like updatedb or
rpmq to have twice as cpu as my nice -20 tasks.

- on a multi user system, a user should be able to give back its cpu share to
other users. This is not possible for now with CONFIG_FAIR_USER_SCHED.

This implies that returning EPERM on nice(<0) becomes worthless, as it is
equivalent to nice(>0) for every other process of the user, ignoring the limits
of the nice range.

To address these problems, this patch changes the weight of the cfs_rq of each
user to the maximum weight of the processes on this cfs_rq, scaled with
/sys/kernel/uids/UID/cpu_share. It's possible that more elaborate mathematics
than taking the max are needed, but basic testing showed the expected fairness.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
---

 include/linux/sched.h |4 ++
 kernel/sched.c|   50 +++
 kernel/sched_fair.c   |  108 +
 3 files changed, 154 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 155d743..d6d2db9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -908,6 +908,10 @@ struct sched_entity {
/* rq "owned" by this entity/group: */
struct cfs_rq   *my_q;
 #endif
+#ifdef CONFIG_FAIR_USER_SCHED
+   /* used to track the max load.weight */
+   struct rb_node  max_load;
+#endif
 };
 
 struct task_struct {
diff --git a/kernel/sched.c b/kernel/sched.c
index 3f6bd11..df8114b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -260,6 +260,10 @@ struct cfs_rq {
struct list_head leaf_cfs_rq_list; /* Better name : task_cfs_rq_list? */
struct task_group *tg;/* group that "owns" this runqueue */
 #endif
+#ifdef CONFIG_FAIR_USER_SCHED
+   /* used to track the sched_entity with the max load in this cfs_rq */
+   struct rb_root max_load_se;
+#endif
 };
 
 /* Real-Time classes' related field in a runqueue: */
@@ -7094,14 +7098,12 @@ done:
task_rq_unlock(rq, );
 }
 
+/* cfs_rq->rq->lock must be taken */
 static void set_se_shares(struct sched_entity *se, unsigned long shares)
 {
struct cfs_rq *cfs_rq = se->cfs_rq;
-   struct rq *rq = cfs_rq->rq;
int on_rq;
 
-   spin_lock_irq(>lock);
-
on_rq = se->on_rq;
if (on_rq)
dequeue_entity(cfs_rq, se, 0);
@@ -7111,22 +7113,54 @@ static void set_se_shares(struct sched_entity *se, 
unsigned long shares)
 
if (on_rq)
enqueue_entity(cfs_rq, se, 0);
+}
 
-   spin_unlock_irq(>lock);
+#ifdef CONFIG_FAIR_USER_SCHED
+static void update_group_share(struct task_group *tg, int cpu)
+{
+   struct rb_node *max_load_node = rb_last(>cfs_rq[cpu]->max_load_se);
+   struct sched_entity *max_load_entry;
+   unsigned long shares;
+
+   if (!max_load_node)
+   /* empty cfs_rq */
+   return;
+
+   max_load_entry = rb_entry(max_load_node, struct sched_entity, max_load);
+   shares = scale_tg_weight(tg, max_load_entry->load.weight);
+   set_se_shares(tg->se[cpu], shares);
+}
+#else
+static void update_group_share(struct task_group *tg, int cpu)
+{
+   set_se_shares(tg->se[cpu], tg->shares);
 }
+#endif
 
 int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
-   int i;
+   int cpu;
+   unsigned long flags;
+
+   if (shares <= 1)
+   return -EINVAL;
+
+#ifdef CONFIG_FAIR_USER_SCHED
+   if ((shares * prio_to_weight[0]) / prio_to_weight[0] != shares)
+   /* The provided value would overflow in scale_tg_weight() */
+   return -EINVAL;
+#endif
 
spin_lock(>lock);
if (tg->shares == shares)
goto done;
 
tg->shares = shares;
-   for_each_possible_cpu(i)
-   set_se_shares(tg->se[i], shares);
-
+   for_each_possible_cpu(cpu) {
+   spin_lock_irqsave(>cfs_rq[cpu]->rq->lock, flags);
+   update_group_share(tg, cpu);
+   spin_unlock_irqrestore(>cfs_rq[cpu]->rq->lock, flags);
+   }
 done:
spin_unlock(>lock);
return 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 01859f6..70ed34e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -135,6 +135,112 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, 
struct sched_entity *se)
return se->vruntime - cfs_rq->min_vruntime;
 }
 
+#ifdef CONFIG_FAIR_USER_SCHED
+static void set_se_shares(struct sched_entity *se, unsigned long shares);
+
+static unsigned long scale_tg_weight(struct task_group *tg, unsign

[PATCH] sched: CONFIG_FAIR_USER_SCHED: auto adjust users weights

2007-10-31 Thread Guillaume Chazarain
CONFIG_FAIR_USER_SCHED is great and I'm happy to see it is enabled by default
but it suffers from some limitations IMHO at this time:

- on a single user system, it's useful to have root processes be given twice
as CPU as user processes but I don't want nice 19 cron jobs like updatedb or
rpmq to have twice as cpu as my nice -20 tasks.

- on a multi user system, a user should be able to give back its cpu share to
other users. This is not possible for now with CONFIG_FAIR_USER_SCHED.

This implies that returning EPERM on nice(0) becomes worthless, as it is
equivalent to nice(0) for every other process of the user, ignoring the limits
of the nice range.

To address these problems, this patch changes the weight of the cfs_rq of each
user to the maximum weight of the processes on this cfs_rq, scaled with
/sys/kernel/uids/UID/cpu_share. It's possible that more elaborate mathematics
than taking the max are needed, but basic testing showed the expected fairness.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
---

 include/linux/sched.h |4 ++
 kernel/sched.c|   50 +++
 kernel/sched_fair.c   |  108 +
 3 files changed, 154 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 155d743..d6d2db9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -908,6 +908,10 @@ struct sched_entity {
/* rq owned by this entity/group: */
struct cfs_rq   *my_q;
 #endif
+#ifdef CONFIG_FAIR_USER_SCHED
+   /* used to track the max load.weight */
+   struct rb_node  max_load;
+#endif
 };
 
 struct task_struct {
diff --git a/kernel/sched.c b/kernel/sched.c
index 3f6bd11..df8114b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -260,6 +260,10 @@ struct cfs_rq {
struct list_head leaf_cfs_rq_list; /* Better name : task_cfs_rq_list? */
struct task_group *tg;/* group that owns this runqueue */
 #endif
+#ifdef CONFIG_FAIR_USER_SCHED
+   /* used to track the sched_entity with the max load in this cfs_rq */
+   struct rb_root max_load_se;
+#endif
 };
 
 /* Real-Time classes' related field in a runqueue: */
@@ -7094,14 +7098,12 @@ done:
task_rq_unlock(rq, flags);
 }
 
+/* cfs_rq-rq-lock must be taken */
 static void set_se_shares(struct sched_entity *se, unsigned long shares)
 {
struct cfs_rq *cfs_rq = se-cfs_rq;
-   struct rq *rq = cfs_rq-rq;
int on_rq;
 
-   spin_lock_irq(rq-lock);
-
on_rq = se-on_rq;
if (on_rq)
dequeue_entity(cfs_rq, se, 0);
@@ -7111,22 +7113,54 @@ static void set_se_shares(struct sched_entity *se, 
unsigned long shares)
 
if (on_rq)
enqueue_entity(cfs_rq, se, 0);
+}
 
-   spin_unlock_irq(rq-lock);
+#ifdef CONFIG_FAIR_USER_SCHED
+static void update_group_share(struct task_group *tg, int cpu)
+{
+   struct rb_node *max_load_node = rb_last(tg-cfs_rq[cpu]-max_load_se);
+   struct sched_entity *max_load_entry;
+   unsigned long shares;
+
+   if (!max_load_node)
+   /* empty cfs_rq */
+   return;
+
+   max_load_entry = rb_entry(max_load_node, struct sched_entity, max_load);
+   shares = scale_tg_weight(tg, max_load_entry-load.weight);
+   set_se_shares(tg-se[cpu], shares);
+}
+#else
+static void update_group_share(struct task_group *tg, int cpu)
+{
+   set_se_shares(tg-se[cpu], tg-shares);
 }
+#endif
 
 int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
-   int i;
+   int cpu;
+   unsigned long flags;
+
+   if (shares = 1)
+   return -EINVAL;
+
+#ifdef CONFIG_FAIR_USER_SCHED
+   if ((shares * prio_to_weight[0]) / prio_to_weight[0] != shares)
+   /* The provided value would overflow in scale_tg_weight() */
+   return -EINVAL;
+#endif
 
spin_lock(tg-lock);
if (tg-shares == shares)
goto done;
 
tg-shares = shares;
-   for_each_possible_cpu(i)
-   set_se_shares(tg-se[i], shares);
-
+   for_each_possible_cpu(cpu) {
+   spin_lock_irqsave(tg-cfs_rq[cpu]-rq-lock, flags);
+   update_group_share(tg, cpu);
+   spin_unlock_irqrestore(tg-cfs_rq[cpu]-rq-lock, flags);
+   }
 done:
spin_unlock(tg-lock);
return 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 01859f6..70ed34e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -135,6 +135,112 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, 
struct sched_entity *se)
return se-vruntime - cfs_rq-min_vruntime;
 }
 
+#ifdef CONFIG_FAIR_USER_SCHED
+static void set_se_shares(struct sched_entity *se, unsigned long shares);
+
+static unsigned long scale_tg_weight(struct task_group *tg, unsigned long 
weight)
+{
+   unsigned long scaled_weight = (weight * tg-shares) / NICE_0_LOAD;
+   return max(scaled_weight, 2UL

  1   2   3   >