from:"Prarit Bhargava"

Re: [PATCH] hpet, allow user controlled mmap for user processes

2013-03-18 Thread Prarit Bhargava

The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
registers to userspace.  The Kconfig help points out that in some cases this
can be a security risk as some systems may erroneously configure the map such
that additional data is exposed to userspace.

This is a problem for distributions -- some users want the MMAP functionality
but it comes with a significant security risk.  In an effort to mitigate this
risk, and due to the low number of users of the MMAP functionality, I've
introduced a kernel parameter, hpet_mmap_enable, that is required in order
to actually have the HPET MMAP exposed.

[v2]: Clemens suggested modifying the Kconfig help text and making the
default setting configurable.

Signed-off-by: Prarit Bhargava 
Cc: Clemens Ladisch 
---
 Documentation/kernel-parameters.txt |3 +++
 drivers/char/Kconfig|   10 --
 drivers/char/hpet.c |   21 +++--
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e567af3..dbf0d81 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -962,6 +962,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
VIA, nVidia)
verbose: show contents of HPET registers during setup
 
+   hpet_mmap_enable [X86, HPET_MMAP] option to expose HPET MMAP to
+userspace.  By default this is disabled.
+
hugepages=  [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
On x86-64 and powerpc, this option can be specified
diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 3bb6fa3..7b830aa 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -534,10 +534,16 @@ config HPET_MMAP
  If you say Y here, user applications will be able to mmap
  the HPET registers.
 
+config HPET_MMAP_DEFAULT
+   int "Enable HPET MMAP access by default"
+   depends on HPET_MMAP
+   range 0 1
+   default 0
+   help
  In some hardware implementations, the page containing HPET
  registers may also contain other things that shouldn't be
- exposed to the user.  If this applies to your hardware,
- say N here.
+ exposed to the user. This option selects the default user access
+ to the HPET registers for applications that require it.  (0=off, 1=on)
 
 config HANGCHECK_TIMER
tristate "Hangcheck timer"
diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index e3f9a99..0f32e30 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -367,12 +367,26 @@ static unsigned int hpet_poll(struct file *file, 
poll_table * wait)
return 0;
 }
 
+#ifdef CONFIG_HPET_MMAP
+static int hpet_mmap_enabled = CONFIG_HPET_MMAP_DEFAULT;
+
+static __init int hpet_mmap_enable(char *str)
+{
+   get_option(&str, &hpet_mmap_enabled);
+   pr_info(KERN_INFO "HPET MMAP %s\n",
+   hpet_mmap_enabled ? "disabled" : "enabled");
+   return 1;
+}
+__setup("hpet_mmap", hpet_mmap_enable);
+
 static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
 {
-#ifdef CONFIG_HPET_MMAP
struct hpet_dev *devp;
unsigned long addr;
 
+   if (!hpet_mmap_enabled)
+   return -EACCES;
+
if (((vma->vm_end - vma->vm_start) != PAGE_SIZE) || vma->vm_pgoff)
return -EINVAL;
 
@@ -393,10 +407,13 @@ static int hpet_mmap(struct file *file, struct 
vm_area_struct *vma)
}
 
return 0;
+}
 #else
+static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
+{
return -ENOSYS;
-#endif
 }
+#endif
 
 static int hpet_fasync(int fd, struct file *file, int on)
 {
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

clockevents: WARNING when settimeofday() is called

2013-03-19 Thread Prarit Bhargava

[Note: I can also reproduce this on latest top-of-tree but I didn't keep my
debug output :(.  I'll checkout a new tree and do the same debug if necessary.]

settimeofday causes a backtrace WARNING in clockevents code on large CPU count
systems .  Debugging points to lapic timer (which is used as a broadcast timer)
being problematic after settimeofday() is called.

Tried to debug this using printk's thinking that it was a problem in
settimeofday() where we release irqs and then adjust the hrtimers.   It turns
out that this is likely the case.  The comments with '<<<' are added after
the output to give an idea of what I added during debug.

[  131.044947] do_settimeofday: 1 irqs off cpu 11 with timespec sec = 100 nsec =
10 <<< debug line after irqs off in do_settimeofday()
[  131.052177] do_settimeofday: 2 current time sec = 1363692316 nsec = 199596399
[  131.059282] do_settimeofday: irqs on cpu 11 << end of settimeofday() before
call to clock_was_set();
[  131.063134] clock_was_set: called
[  131.066751] [ cut here ] <<< beginning of stack trace
... this is where the WARNING actually occurs.
[  131.066751] clock_was_set: on_each_cpu() <<< IPI from for_each_cpu()
[  131.066753] clock_was_set: finished
[  131.096448] WARNING: at kernel/time/clockevents.c:209
clockevents_program_event+0x135/0x140()
[  131.104935] Hardware name: Dinar
[  131.108150] Modules linked in: sg nfsv3 nfs_acl nfsv4 auth_rpcgss nfs
dns_resolver fscache lockd sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast
ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ipt_REJECT nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter
ip6_tables iptable_filter ip_tables kvm_amd kvm sp5100_tco bnx2 i2c_piix4
crc32c_intel k10temp fam15h_power ghash_clmulni_intel amd64_edac_mod pcspkr
serio_raw edac_mce_amd edac_core microcode xfs libcrc32c sr_mod sd_mod cdrom
ata_generic crc_t10dif pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ahci
pata_atiixp libahci libata usb_storage i2c_core dm_mirror dm_region_hash dm_log
dm_mod
[  131.176784] Pid: 0, comm: swapper/28 Not tainted 3.8.0+ #6
[  131.182248] Call Trace:
[  131.184684][] warn_slowpath_common+0x7f/0xc0
[  131.191312]  [] warn_slowpath_null+0x1a/0x20
[  131.197131]  [] clockevents_program_event+0x135/0x140
[  131.203721]  [] tick_program_event+0x24/0x30
[  131.209534]  [] hrtimer_interrupt+0x131/0x230
[  131.215437]  [] ? cpufreq_p4_target+0x130/0x130
[  131.221509]  [] smp_apic_timer_interrupt+0x69/0x99
[  131.227839]  [] apic_timer_interrupt+0x6d/0x80
[  131.233816][] ? sched_clock_cpu+0xc5/0x120
[  131.240267]  [] ? cpuidle_wrap_enter+0x50/0xa0
[  131.246252]  [] ? cpuidle_wrap_enter+0x49/0xa0
[  131.252238]  [] cpuidle_enter_tk+0x10/0x20
[  131.257877]  [] cpuidle_idle_call+0xa9/0x260
[  131.263692]  [] cpu_idle+0xaf/0x120
[  131.268727]  [] start_secondary+0x255/0x257
[  131.274449] ---[ end trace 1151a50552231615 ]---
[  131.279047] clockevents_program_event: cpu = 28 name = lapic expires.tv64 =
-11431039 <<< data from 'broken' clockevent.  Note, I have not seen any other
clockevent than the lapic.

ie) in the code,

int do_settimeofday(const struct timespec *tv)
{
struct timekeeper *tk = &timekeeper;
struct timespec ts_delta, xt;
unsigned long flags;

if (!timespec_valid_strict(tv))
return -EINVAL;

write_seqlock_irqsave(&tk->lock, flags);

... update time code ...

write_sequnlock_irqrestore(&tk->lock, flags);

/* signal hrtimers about time change */

clock_was_set(); <<< this is called without irq protection.  If an
hrtimer fires in the window that it takes to do the IPI in for_each_cpu() in
clock_was_set() then we get a WARNING.

return 0;
}


Is there something peculiar about the lapic timer that would cause this to
happen and not to other hrtimers?  ie) is there some subtlety of the lapic timer
that I'm missing?  If not then here are some possible fixes:

- I could lock the hrtimer_bases by acquiring all the base locks which
would prevent the hrtimer_interrupts from executing.  This would require some
reworking of the code (obviously) in both the hrtimer core code and the
places where we call time_was_set() throughout the kernel... but that seems
awfully messy and there has to be a better way :/.

- It might be possible to do something with stop_machine() which
would guarantee cpu rendezvous in a state where no locks can be held and
then update the system time.  In theory the places where clock_was_set() needs
to be called are rare slow paths so it shouldn't be a problem to use the heavy
handed stop_machine().

Obviously, I'm looking for any suggestions on a solution.  I think I'm
familiar enough with the code to take a suggestion and code it.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a

Re: [PATCH] hpet, allow user controlled mmap for user processes

2013-03-19 Thread Prarit Bhargava



On 03/19/2013 03:43 AM, Clemens Ladisch wrote:
> Prarit Bhargava wrote:
>> The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
>> registers to userspace.  The Kconfig help points out that in some cases this
>> can be a security risk as some systems may erroneously configure the map such
>> that additional data is exposed to userspace.
>>
>> This is a problem for distributions -- some users want the MMAP functionality
>> but it comes with a significant security risk.  In an effort to mitigate this
>> risk, and due to the low number of users of the MMAP functionality, I've
>> introduced a kernel parameter, hpet_mmap_enable, that is required in order
>> to actually have the HPET MMAP exposed.
>>
>> [v2]: Clemens suggested modifying the Kconfig help text and making the
>> default setting configurable.
>>
>> Signed-off-by: Prarit Bhargava 
>> Cc: Clemens Ladisch 
> 
>> +++ b/Documentation/kernel-parameters.txt
>> +hpet_mmap_enable [X86, HPET_MMAP] option to expose HPET MMAP to
>> + userspace.  By default this is disabled.
> 
> This now takes a value.
> 
>> +int "Enable HPET MMAP access by default"
>> +range 0 1
> 
> Shouldn't this be bool?

I'll fix those in v3.

> 
>> +default 0
> 
> This breaks backwards compatibility.

Does backwards compatibility matter for something like?  I have no problem
setting it to 1 but I'm more curious from a general kernel point of view.

I'll change this in v3 as well.

P.


P.

> 
> 
> Regards,
> Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hpet, allow user controlled mmap for user processes

2013-03-19 Thread Prarit Bhargava

The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
registers to userspace.  The Kconfig help points out that in some cases this
can be a security risk as some systems may erroneously configure the map such
that additional data is exposed to userspace.

This is a problem for distributions -- some users want the MMAP functionality
but it comes with a significant security risk.  In an effort to mitigate this
risk, and due to the low number of users of the MMAP functionality, I've
introduced a kernel parameter, hpet_mmap_enable, that is required in order
to actually have the HPET MMAP exposed.

[v2]: Clemens suggested modifying the Kconfig help text and making the
default setting configurable.
[v3]: Fixed up Documentation and Kconfig entries, default now "Y"

Signed-off-by: Prarit Bhargava 
Cc: Clemens Ladisch 
---
 Documentation/kernel-parameters.txt |  4 
 drivers/char/Kconfig|  9 +++--
 drivers/char/hpet.c | 21 +++--
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e567af3..191 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -962,6 +962,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
VIA, nVidia)
verbose: show contents of HPET registers during setup
 
+   hpet_mmap=  [X86, HPET_MMAP] option to expose HPET MMAP to
+   userspace.  By default this is disabled. Values are
+   0(disabled) or 1(enabled).
+
hugepages=  [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
On x86-64 and powerpc, this option can be specified
diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 3bb6fa3..51b62a1 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -534,10 +534,15 @@ config HPET_MMAP
  If you say Y here, user applications will be able to mmap
  the HPET registers.
 
+config HPET_MMAP_DEFAULT
+   bool "Enable HPET MMAP access by default"
+   default y
+   depends on HPET_MMAP
+   help
  In some hardware implementations, the page containing HPET
  registers may also contain other things that shouldn't be
- exposed to the user.  If this applies to your hardware,
- say N here.
+ exposed to the user. This option selects the default user access
+ to the HPET registers for applications that require it.
 
 config HANGCHECK_TIMER
tristate "Hangcheck timer"
diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index e3f9a99..0f32e30 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -367,12 +367,26 @@ static unsigned int hpet_poll(struct file *file, 
poll_table * wait)
return 0;
 }
 
+#ifdef CONFIG_HPET_MMAP
+static int hpet_mmap_enabled = CONFIG_HPET_MMAP_DEFAULT;
+
+static __init int hpet_mmap_enable(char *str)
+{
+   get_option(&str, &hpet_mmap_enabled);
+   pr_info(KERN_INFO "HPET MMAP %s\n",
+   hpet_mmap_enabled ? "disabled" : "enabled");
+   return 1;
+}
+__setup("hpet_mmap", hpet_mmap_enable);
+
 static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
 {
-#ifdef CONFIG_HPET_MMAP
struct hpet_dev *devp;
unsigned long addr;
 
+   if (!hpet_mmap_enabled)
+   return -EACCES;
+
if (((vma->vm_end - vma->vm_start) != PAGE_SIZE) || vma->vm_pgoff)
return -EINVAL;
 
@@ -393,10 +407,13 @@ static int hpet_mmap(struct file *file, struct 
vm_area_struct *vma)
}
 
return 0;
+}
 #else
+static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
+{
return -ENOSYS;
-#endif
 }
+#endif
 
 static int hpet_fasync(int fd, struct file *file, int on)
 {
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFE PATCH] time, Fix setting of hardware clock in NTP code

2013-02-07 Thread Prarit Bhargava

I have found a long existing bug in the ntp code that causes a forwarding of
time equal to that of the local timezone every reboot.  This is the sequence
indicating what happens during boot.

+ start boot
|
+ rtc read, written as UTC into xtime/system clock.  This time is rtc0_time
  below.
|
|
+ ... rest of initial kernel boot, initramfs, etc.
|
|
+ systemd/initscript/etc: set timezone data through first call to settimeofday()
- if LOCAL, a timezone offset is applied so that all applications "see"
   the system time as UTC, ie) sys_tz = {offset from UTC, 0}
- if UTC, no timezone offset is applied, ie) sys_tz = {0,0};
+ The first settimeofday() calls warp_clock().  xtime (system time) is set to
  rtc time + sys_tz.tz_minuteswest .
  On my system, the difference between the rtc and the system time is now
  300 mins.
|
|
+ ntpd starts
+   - RHEL7: the first adjtimex() clears the STA_PLL flag.  This causes
  STA_UNSYNC to be cleared and the system time/xtime to be written to
  the rtc via update_persistent_clock().
- if LOCAL, this means that the rtc now reads
  rtc + sys_tz.tz_minuteswest; on my system this is rtc + 300.
- if UTC, this means that the rtc on my system reads rtc + 0.

The problem with this model is what happens if /etc/adjtime is LOCAL, ie)
the rtc is set to localtime:

Reboot the system, on the next boot,
rtc0_time = rtc + sys_tz.tz_minuteswest
Reboot the system, on the next boot,
rtc0_time = rtc + sys_tz.tz_minuteswest + sys_tz.tz_minuteswest

AFAICT the only call to update_persistent_clock() in the kernel is the
ntp code.  It is wired in to allow ntp to occasionally update the system
clock.  Other calls to update the rtc are made directly through the
RTC_SET_TIME ioctl.

I believe that the value passed into update_persistent_time() is wrong.  It
should take into account the sys_tz data.  If the rtc is UTC, then the
offset is 0.  If the system is LOCAL, then there should be a 300 min offset
for the value of now.

We do not see this manifest on some architectures because they limit changes
to the rtc to +/-15 minutes of the current value of the rtc (x86, alpha,
mn10300).  Other arches do nothing (cris, mips, sh), and only a few seem to
show this problem (power, sparc).  I can reproduce this reliably on powerpc
with the latest Fedoras (17, 18, rawhide), as well as an Ubuntu powerpc spin.
I can also reproduce it "older" OSes such as RHEL6.

A few things about the patch.  'sys_time_offset' certainly could have a
better name and it could be a bool.  Also, technically I do not need to add the
'adjust' struct in sync_cmos_clock() as the value of now.tv_sec is not used
after being passed into update_persistent_clock().  However, I think the code
is easier to follow if I do add 'adjust'.

--8<-

Take the timezone offset applied in warp_clock() into account when writing
the hardware clock in the ntp code.  This patch adds sys_time_offset which
indicates that an offset has been applied in warp_clock().

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Thomas Gleixner 
---
 include/linux/time.h |1 +
 kernel/time.c|8 
 kernel/time/ntp.c|8 ++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index 4d358e9..02754b5 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -117,6 +117,7 @@ static inline bool timespec_valid_strict(const struct 
timespec *ts)
 
 extern void read_persistent_clock(struct timespec *ts);
 extern void read_boot_clock(struct timespec *ts);
+extern int sys_time_offset;
 extern int update_persistent_clock(struct timespec now);
 void timekeeping_init(void);
 extern int timekeeping_suspended;
diff --git a/kernel/time.c b/kernel/time.c
index d226c6a..66533d3 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -115,6 +115,12 @@ SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv,
 }
 
 /*
+ * Indicates if there is an offset between the system clock and the hardware
+ * clock/persistent clock/rtc.
+ */
+int sys_time_offset;
+
+/*
  * Adjust the time obtained from the CMOS to be UTC time instead of
  * local time.
  *
@@ -135,6 +141,8 @@ static inline void warp_clock(void)
struct timespec adjust;
 
adjust = current_kernel_time();
+   if (sys_tz.tz_minuteswest > 0)
+   sys_time_offset = 1;
adjust.tv_sec += sys_tz.tz_minuteswest * 60;
do_settimeofday(&adjust);
 }
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 24174b4..39b88c4 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -510,8 +510,12 @@ static void sync_cmos_clock(struct work_struct *work)
}
 
getnstimeofday(&now);
-   if (abs(now.tv_nsec - (NSEC_PER_SEC / 2)) <= tick_nsec / 2)
-   fail = update_persistent_clock(now);
+   if (abs(now.tv_ns

Re: [RFE PATCH] time, Fix setting of hardware clock in NTP code

2013-02-07 Thread Prarit Bhargava

On 02/07/2013 12:24 PM, John Stultz wrote:
> On Thu, Feb 7, 2013 at 5:29 AM, Prarit Bhargava  wrote:
>> We do not see this manifest on some architectures because they limit changes
>> to the rtc to +/-15 minutes of the current value of the rtc (x86, alpha,
>> mn10300).  Other arches do nothing (cris, mips, sh), and only a few seem to
>> show this problem (power, sparc).  I can reproduce this reliably on powerpc
>> with the latest Fedoras (17, 18, rawhide), as well as an Ubuntu powerpc spin.
>> I can also reproduce it "older" OSes such as RHEL6.
> 
> Interesting.
> Yea, local RTC time is probably pretty rare outside of x86 (due to windows).
> And the +/- 15minute trick has always explicitly masked this issue there.
> 

I'm not sure I understand the purpose behind the +/-15 minute window?  Is it
just to prevent a wild swing on the RTC?  I can understand that to some degree,
however, I'm not sure I agree with it being the default behaviour.

Here's a real-world scenario:

My RTC on my laptop is set to 1:30PM Jan 7, 2013.  I boot, systemd and ntp do
their magic, and the system time comes up as Feb 7, 12:48PM.  I never will
notice that the RTC is wrong.

Now I go somewhere and have to work on a plane.  I have no internet connection
and then boot.  Now the system time will be 1:30PM Jan 7, 2013.  That's actually
happened to me and I remember filing it away for a bug to be looked at.

AFAIK, no other OS does that ... if I install Windows or use a Mac in the
no-internet connection case, the time is *always* corrected.  I tried to see if
I could get this to happen on a Mac and I can't.

99.9% of Linux users out there are using some sort of time protocol (usually
NTP, but PTP is starting to catch on) to sync their systems.  NTP is a trusted
source of timekeeping IMO.  How often do we see systems that run NTP but don't
trust the numbers that come from it?

We should be doing a full sync of the RTC in NTP, or at least it should be an
option/CONFIG option (FYI: I want to patch for that ... it'll give me something
to do).

> 
>> A few things about the patch.  'sys_time_offset' certainly could have a
>> better name and it could be a bool.  Also, technically I do not need to add 
>> the
>> 'adjust' struct in sync_cmos_clock() as the value of now.tv_sec is not used
>> after being passed into update_persistent_clock().  However, I think the code
>> is easier to follow if I do add 'adjust'.
>>
>> --8<-
>>
>> Take the timezone offset applied in warp_clock() into account when writing
>> the hardware clock in the ntp code.  This patch adds sys_time_offset which
>> indicates that an offset has been applied in warp_clock().
>>
>> Signed-off-by: Prarit Bhargava 
>> Cc: John Stultz 
>> Cc: Thomas Gleixner 
>> ---
>>  include/linux/time.h |1 +
>>  kernel/time.c|8 
>>  kernel/time/ntp.c|8 ++--
>>  3 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/time.h b/include/linux/time.h
>> index 4d358e9..02754b5 100644
>> --- a/include/linux/time.h
>> +++ b/include/linux/time.h
>> @@ -117,6 +117,7 @@ static inline bool timespec_valid_strict(const struct 
>> timespec *ts)
>>
>>  extern void read_persistent_clock(struct timespec *ts);
>>  extern void read_boot_clock(struct timespec *ts);
>> +extern int sys_time_offset;
>>  extern int update_persistent_clock(struct timespec now);
>>  void timekeeping_init(void);
>>  extern int timekeeping_suspended;
>> diff --git a/kernel/time.c b/kernel/time.c
>> index d226c6a..66533d3 100644
>> --- a/kernel/time.c
>> +++ b/kernel/time.c
>> @@ -115,6 +115,12 @@ SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, 
>> tv,
>>  }
>>
>>  /*
>> + * Indicates if there is an offset between the system clock and the hardware
>> + * clock/persistent clock/rtc.
>> + */
>> +int sys_time_offset;
> 
> So why is this extra flag necessary instead of just using if
> (sys_tz.tz_minuteswest) ?

sys_tz can be set during runtime via settimeofday() without affecting the
current system time.  The bug *only* happens if the system clock is "warped"
ahead relative to the hardware clock on the first call to settimeofday(), so
checking for sys_tz.tz_minuteswest isn't good enough of a test.

> 
> 
>> +
>> +/*
>>   * Adjust the time obtained from the CMOS to be UTC time instead of
>>   * local time.
>>   *
>> @@ -135,6 +141,8 @@ static inline void warp_clock(void)
>> struct timespec adjust;
>>
>> adjust = current_kernel_time();
>> +

Re: [RFE PATCH] time, Fix setting of hardware clock in NTP code

2013-02-07 Thread Prarit Bhargava



On 02/07/2013 02:52 PM, John Stultz wrote:
> On 02/07/2013 10:20 AM, Prarit Bhargava wrote:
>>
>> On 02/07/2013 12:24 PM, John Stultz wrote:
>>> On Thu, Feb 7, 2013 at 5:29 AM, Prarit Bhargava  wrote:
>>>
>> I'm not sure I understand the purpose behind the +/-15 minute window?  Is it
>> just to prevent a wild swing on the RTC?  I can understand that to some 
>> degree,
>> however, I'm not sure I agree with it being the default behaviour.
> 
> The 15 minute cap is totally an x86-ism, and I believe its due to the fact 
> that
> the main concern is we don't reliably know the timezone data has been set
> properly, but we're expected to work well dual booting with Windows.

Heh :)  I think some of the other arches have copied what x86 does.  alpha and
mn10300 do the same thing. :)


> 
>> 99.9% of Linux users out there are using some sort of time protocol 
>> (usually
>> NTP, but PTP is starting to catch on) to sync their systems.  NTP is a 
>> trusted
>> source of timekeeping IMO.  How often do we see systems that run NTP but 
>> don't
>> trust the numbers that come from it?
> 
> I actually doubt ntp usage is that high, given that some popular distros don't
> install it by default, but that's a tangent. :)

:D  Fair enough :D

> 
> Again, the quirks here are all about interacting with Windows on a dual-boot
> environment.
> 

Sure, that's been my understanding as well.

> Though I think it might be reasonable at this point to say we'll set the RTC 
> as
> accurately as we can with the given info, which requires the distro to trigger
> warp clock if the RTC is kept in local time.

Okay.

> 
> 
> 
> 
>>>>   /*
>>>> + * Indicates if there is an offset between the system clock and the 
>>>> hardware
>>>> + * clock/persistent clock/rtc.
>>>> + */
>>>> +int sys_time_offset;
>>> So why is this extra flag necessary instead of just using if
>>> (sys_tz.tz_minuteswest) ?
>> sys_tz can be set during runtime via settimeofday() without affecting the
>> current system time.  The bug *only* happens if the system clock is "warped"
>> ahead relative to the hardware clock on the first call to settimeofday(), so
>> checking for sys_tz.tz_minuteswest isn't good enough of a test.
> 
> So it would probably be better named as something like rtc_is_local.

I'll do a [v2] with that change as well.

> 
> 
> So yea, I think if we include your patch, we can probably consider dropping 
> the
> 15 min cap. There will probably be some situations where system setups don't
> have RTC local configured, so that flag isn't set and we'll fight with a
> dual-boot environment, but those hopefully should be rare.
> 
> I'd suggest we do this in two steps. First your current patch, adding
> rtc_is_local flag and the RTC timezone correction in 
> update_persistent_clock(),
> then second a patch for x86 dropping the 15 min cap that gets wide 
> distribution
> so all the distros know its coming and can test it and object if necessary.

That's what I was hoping for.  I'll post this as an actual patch and then get to
work on the full sync code next.

Thanks,

P.

> 
> thanks
> -john
> 
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] time, Fix setting of hardware clock in NTP code

2013-02-08 Thread Prarit Bhargava

At init time, if the system time is "warped" forward in warp_clock()
it will differ from the hardware clock by sys_tz.tz_minuteswest.  This time
difference is not taken into account when ntp updates the hardware clock,
and this causes the system time to jump forward by this offset every reboot.

The kernel must take this offset into account when writing the system time
to the hardware clock in the ntp code.  This patch adds
persistent_clock_is_local which indicates that an offset has been applied
in warp_clock() and accounts for the "warp" before writing the hardware
clock.

x86 does not have this problem as rtc writes are software limited to a
+/-15 minute window relative to the current rtc time.  Other arches, such
as powerpc, however do a full synchronization of the system time to the
rtc and will see this problem.

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Thomas Gleixner 
---
 include/linux/time.h |1 +
 kernel/time.c|8 
 kernel/time/ntp.c|8 ++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index 4d358e9..f3646b6 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -117,6 +117,7 @@ static inline bool timespec_valid_strict(const struct 
timespec *ts)
 
 extern void read_persistent_clock(struct timespec *ts);
 extern void read_boot_clock(struct timespec *ts);
+extern int persistent_clock_is_local;
 extern int update_persistent_clock(struct timespec now);
 void timekeeping_init(void);
 extern int timekeeping_suspended;
diff --git a/kernel/time.c b/kernel/time.c
index d226c6a..c2a27dd 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -115,6 +115,12 @@ SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv,
 }
 
 /*
+ * Indicates if there is an offset between the system clock and the hardware
+ * clock/persistent clock/rtc.
+ */
+int persistent_clock_is_local;
+
+/*
  * Adjust the time obtained from the CMOS to be UTC time instead of
  * local time.
  *
@@ -135,6 +141,8 @@ static inline void warp_clock(void)
struct timespec adjust;
 
adjust = current_kernel_time();
+   if (sys_tz.tz_minuteswest != 0)
+   persistent_clock_is_local = 1;
adjust.tv_sec += sys_tz.tz_minuteswest * 60;
do_settimeofday(&adjust);
 }
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 24174b4..e98f6b7 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -510,8 +510,12 @@ static void sync_cmos_clock(struct work_struct *work)
}
 
getnstimeofday(&now);
-   if (abs(now.tv_nsec - (NSEC_PER_SEC / 2)) <= tick_nsec / 2)
-   fail = update_persistent_clock(now);
+   if (abs(now.tv_nsec - (NSEC_PER_SEC / 2)) <= tick_nsec / 2) {
+   struct timespec adjust = now;
+   if (persistent_clock_is_local)
+   adjust.tv_sec -= (sys_tz.tz_minuteswest * 60);
+   fail = update_persistent_clock(adjust);
+   }
 
next.tv_nsec = (NSEC_PER_SEC / 2) - now.tv_nsec - (TICK_NSEC / 2);
if (next.tv_nsec <= 0)
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] time, Fix setting of hardware clock in NTP code

2013-02-08 Thread Prarit Bhargava

At init time, if the system time is "warped" forward in warp_clock()
it will differ from the hardware clock by sys_tz.tz_minuteswest.  This time
difference is not taken into account when ntp updates the hardware clock,
and this causes the system time to jump forward by this offset every reboot.

The kernel must take this offset into account when writing the system time
to the hardware clock in the ntp code.  This patch adds
persistent_clock_is_local which indicates that an offset has been applied
in warp_clock() and accounts for the "warp" before writing the hardware
clock.

x86 does not have this problem as rtc writes are software limited to a
+/-15 minute window relative to the current rtc time.  Other arches, such
as powerpc, however do a full synchronization of the system time to the
rtc and will see this problem.

[v2]: generated against tip/timers/core

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Thomas Gleixner 
---
 include/linux/time.h | 1 +
 kernel/time.c| 8 
 kernel/time/ntp.c| 8 ++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index 476e1d7..a3ab6a8 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -128,6 +128,7 @@ static inline bool has_persistent_clock(void)
 
 extern void read_persistent_clock(struct timespec *ts);
 extern void read_boot_clock(struct timespec *ts);
+extern int persistent_clock_is_local;
 extern int update_persistent_clock(struct timespec now);
 void timekeeping_init(void);
 extern int timekeeping_suspended;
diff --git a/kernel/time.c b/kernel/time.c
index d226c6a..c2a27dd 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -115,6 +115,12 @@ SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv,
 }
 
 /*
+ * Indicates if there is an offset between the system clock and the hardware
+ * clock/persistent clock/rtc.
+ */
+int persistent_clock_is_local;
+
+/*
  * Adjust the time obtained from the CMOS to be UTC time instead of
  * local time.
  *
@@ -135,6 +141,8 @@ static inline void warp_clock(void)
struct timespec adjust;
 
adjust = current_kernel_time();
+   if (sys_tz.tz_minuteswest != 0)
+   persistent_clock_is_local = 1;
adjust.tv_sec += sys_tz.tz_minuteswest * 60;
do_settimeofday(&adjust);
 }
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 313b161..b10a42b 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -511,13 +511,17 @@ static void sync_cmos_clock(struct work_struct *work)
 
getnstimeofday(&now);
if (abs(now.tv_nsec - (NSEC_PER_SEC / 2)) <= tick_nsec / 2) {
+   struct timespec adjust = now;
+
fail = -ENODEV;
+   if (persistent_clock_is_local)
+   adjust.tv_sec -= (sys_tz.tz_minuteswest * 60);
 #ifdef CONFIG_GENERIC_CMOS_UPDATE
-   fail = update_persistent_clock(now);
+   fail = update_persistent_clock(adjust);
 #endif
 #ifdef CONFIG_RTC_SYSTOHC
if (fail == -ENODEV)
-   fail = rtc_set_ntp_time(now);
+   fail = rtc_set_ntp_time(adjust);
 #endif
}
 
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] time, Fix setting of hardware clock in NTP code

2013-02-08 Thread Prarit Bhargava

On 02/08/2013 06:12 PM, John Stultz wrote:
> On 02/08/2013 02:59 PM, Prarit Bhargava wrote:

> 
> Ok, I've got this queued in my tree. What sort of testing did you do with it?
> 
> I want to make sure we don't run into any bad interactions with the existing
> 15min cap on x86.

John,

I did the following:

I used powerpc pseries systems and tested this using both positive and negative
values of sys_tz.minuteswest, with both UTC and LOCAL in /etc/adjtime.  I dumped
values of 'hwclock -D' and date and confirmed that I no longer see time
increasing by sys_tz.minuteswest each reboot.

I also tested x86 32-bit and 64-bit as a sanity check and verified that the
current behaviour on those arches is the same; ie) I don't see *any* impact to
the x86 rtc.  I dumped values of 'hwclock -D' and date, and again confirmed that
I see no differences in values.  I did that with both UTC and LOCAL.

I also tested a powerpc box and set the hwclock (via BIOS) back to Dec 6 2012 to
see what would happen when I enabled ntp.  The system booted, set the system
time to Dec 6 2012, and then properly ended up with both system time AND hwclock
as Feb 8 2013 after systemd init  (The *exact* time-of-day was correct as
well.  I just can't remember the time I did it ;) )

And I did the same thing (adjusting the BIOS date back) on x86. I only see the
hours and minutes change, as we expect.  The year, month, day are unaffected
with both UTC and LOCAL.

tl;dr  Yup.  Tested as much as I could think of doing before submitting.  Tested
on a both x86, powerpc.  Fixed the bug on powerpc.  No change in behavior seen
with x86.

If you want me to do some other test I certainly can give it a shot.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFE PATCH 2/2] rtc, add write functionality to sysfs

2013-02-24 Thread Prarit Bhargava



On 02/23/2013 06:11 PM, Alessandro Zummo wrote:
> On 22/feb/2013, at 22:05, John Stultz  wrote:
> 
>> On 02/22/2013 12:55 PM, Prarit Bhargava wrote:
>>>
>>> On 02/22/2013 03:43 PM, John Stultz wrote:
>>>> On 02/14/2013 09:02 AM, Prarit Bhargava wrote:
>>>>> /sys/class/rtc/rtcX/date and /sys/class/rtc/rtcX/time currently have
>>>>> read-only access.  This patch introduces write functionality which will
>>>>> set the rtc time.
>>>>>
>>>>> Usage: echo -MM-DD > /sys/class/rtc/rtcX/date
>>>>> echo HH:MM:SS > /sys/class/rtc/rtcX/time
>>>> Why do we want to add a new interface here?
>>> John,
>>>
>>> I'm not adding a new interface.  The current date/time interface only 
>>> handles
>>> read and I'm introducing write.
>>>
>>
>> Right, but what benefit does that provide?
>> (I'm not saying there isn't any, its just not clear from your patch why this 
>> is a good thing.)
>>

Sorry John, I misunderstood your question.

>> Also CC'ing Alessandro for his input.
> 
> I'd like to keep the interfaces as simple as possible but I'm open to 
> improvements if there are good use cases.
> 

AFAICT there is no way for me to "test" or use the write from userspace.
hwclock uses the SET_TIME ioctl, which is a different code path AFAICT.

I'd like to be at least able to test this stuff when we make changes to it so I
think having write functionality for date & time is worthwhile.

For me, I'm using these to heavily test ntp and ntpdate over system reboots.

OOC, Alessandro, why is the date & time split into two fields?

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, clocksource, fix !CONFIG_CLOCKSOURCE_WATCHDOG compile

2013-02-24 Thread Prarit Bhargava

On 02/22/2013 03:14 PM, Thomas Gleixner wrote:
>> +void clocksource_mark_unstable(struct clocksource *cs) { }
> 
> Unless this is defined as
> 
>> +static inline void clocksource_mark_unstable(struct clocksource *cs) { }
> 
> Right?

Whups.  Of course ... new patch ...

-8<

If I explicitly disable the clocksource watchdog in the x86 Kconfig,
the x86 kernel will not compile unless this is properly defined.

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: x...@kernel.org
---
 kernel/time/clocksource.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c958338..3d81a76 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -450,6 +450,7 @@ static void clocksource_enqueue_watchdog(struct clocksource 
*cs)
 static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
 static inline void clocksource_resume_watchdog(void) { }
 static inline int clocksource_watchdog_kthread(void *data) { return 0; }
+static inline void clocksource_mark_unstable(struct clocksource *cs) { }
 
 #endif /* CONFIG_CLOCKSOURCE_WATCHDOG */
 
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFE PATCH 1/2] x86, rtc, ntp, Do full rtc synchronization with ntp

2013-02-24 Thread Prarit Bhargava

On 02/22/2013 03:42 PM, John Stultz wrote:
> 
> 
> This looks reasonable to me.
> 
> Though I want to make sure we get this thoroughly tested by the various 
> distros
> so we don't surprise anyone, since it has to potential to cause problems where
> folks are dualbooting windows (using a localtime RTC) and do not have their OS
> setup to trigger warp_clock to adjust for the localtime rtc (instead getting a
> time correction later via NTP).
> 
> I'll queue it and see about getting it merged to -tip & -next. Then we'll have
> to decide if 3.10 or 3.11 is the right time frame to land it.
> 

cc'ing Alessandro as well.

John, I've been testing this across various systems (including those known to
have some wonkiness with the RTC in BIOS ... see comment below) to see if this
code impacts anything.

I've tested mainly using Fedora 18 (with the latest kernel -tip obviously), but
I also installed Ubuntu on a system to see if there was any noticeable impact
there too.  I have not seen any unusual testing failures on AMD or Intel 
systems.

On the one system which I know to have "weak" battery such that the RTC doesn't
"stick" on the system shutdown the clock resets itself in BIOS reboot to "Jan 1
1970".  When I tested previously I could not get the RTC written to the current
date; after my changes, the RTC does at least reflect the current date through a
reboot.  It should be noted that if I do replace the battery on this system I
can get the RTC to properly "stick" through a reboot.

Given the test results I think this should go in earlier rather than later;  I'd
like to target 3.10 for the full sync, and possibly 3.11 for the HCTOSYS stuff.

... unless anyone has a strenuous objection ;)

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, clocksource, fix !CONFIG_CLOCKSOURCE_WATCHDOG compile

2013-02-25 Thread Prarit Bhargava



On 02/24/2013 12:04 PM, Prarit Bhargava wrote:
> On 02/22/2013 03:14 PM, Thomas Gleixner wrote:
>>> +void clocksource_mark_unstable(struct clocksource *cs) { }
>>
>> Unless this is defined as
>>
>>> +static inline void clocksource_mark_unstable(struct clocksource *cs) { }
>>
>> Right?
> 
> Whups.  Of course ... new patch ...
> 
> -8<
> 
> If I explicitly disable the clocksource watchdog in the x86 Kconfig,
> the x86 kernel will not compile unless this is properly defined.
> 
> Signed-off-by: Prarit Bhargava 
> Cc: John Stultz 
> Cc: Thomas Gleixner 
> Cc: x...@kernel.org
> ---
>  kernel/time/clocksource.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c958338..3d81a76 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -450,6 +450,7 @@ static void clocksource_enqueue_watchdog(struct 
> clocksource *cs)
>  static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
>  static inline void clocksource_resume_watchdog(void) { }
>  static inline int clocksource_watchdog_kthread(void *data) { return 0; }
> +static inline void clocksource_mark_unstable(struct clocksource *cs) { }
>  
>  #endif /* CONFIG_CLOCKSOURCE_WATCHDOG */

Yuck.  Self NACK this one ... sorry Thomas.  I'm going to have to do more work
on this patch ...

P.

>  
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, clocksource, fix !CONFIG_CLOCKSOURCE_WATCHDOG compile

2013-02-25 Thread Prarit Bhargava



On 02/22/2013 03:14 PM, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Prarit Bhargava wrote:
> 
>> If I explicitly disable the clocksource watchdog in the x86 Kconfig,
>> the x86 kernel will not compile unless this is properly defined.
> 
> You shouldn't do that. :)
>  
>> Signed-off-by: Prarit Bhargava 
>> Cc: John Stultz 
>> Cc: Thomas Gleixner 
>> Cc: x...@kernel.org
>> ---
>>  kernel/time/clocksource.c |1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index c958338..e04821f 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -450,6 +450,7 @@ static void clocksource_enqueue_watchdog(struct 
>> clocksource *cs)
>>  static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
>>  static inline void clocksource_resume_watchdog(void) { }
>>  static inline int clocksource_watchdog_kthread(void *data) { return 0; }
>> +void clocksource_mark_unstable(struct clocksource *cs) { }
> 
> Unless this is defined as
> 
>> +static inline void clocksource_mark_unstable(struct clocksource *cs) { }
> 
> Right?

Thomas,

Actually that needs to be "void clocksource_mark_unstable()" as it is exported
in include/linux/clocksource.h as such.

The other static inline functions above it are only used in clocksource.c.

So I think my patch is correct ...

P.

> 
>   tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFE PATCH 2/2] rtc, add write functionality to sysfs

2013-02-28 Thread Prarit Bhargava



On 02/25/2013 09:58 AM, Alessandro Zummo wrote:
> On Sun, 24 Feb 2013 12:03:01 -0500
> Prarit Bhargava  wrote:
> 
>>
>> AFAICT there is no way for me to "test" or use the write from userspace.
>> hwclock uses the SET_TIME ioctl, which is a different code path AFAICT.
>>
>> I'd like to be at least able to test this stuff when we make changes to it 
>> so I
>> think having write functionality for date & time is worthwhile.
>>
>> For me, I'm using these to heavily test ntp and ntpdate over system reboots.
> 
>  the point is: who will benefit from this patch? users? distributions?
>  embedded distributions? if it's useful, then just go for it.

This can be dropped IMO.

P.

> 
> 
>> OOC, Alessandro, why is the date & time split into two fields?
> 
>  because date and time are two different things and we expect
>  sysfs to preferably have one value for each entry.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, clocksource, fix !CONFIG_CLOCKSOURCE_WATCHDOG compile

2013-02-22 Thread Prarit Bhargava

If I explicitly disable the clocksource watchdog in the x86 Kconfig,
the x86 kernel will not compile unless this is properly defined.

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: x...@kernel.org
---
 kernel/time/clocksource.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c958338..e04821f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -450,6 +450,7 @@ static void clocksource_enqueue_watchdog(struct clocksource 
*cs)
 static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
 static inline void clocksource_resume_watchdog(void) { }
 static inline int clocksource_watchdog_kthread(void *data) { return 0; }
+void clocksource_mark_unstable(struct clocksource *cs) { }
 
 #endif /* CONFIG_CLOCKSOURCE_WATCHDOG */
 
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFE PATCH 2/2] rtc, add write functionality to sysfs

2013-02-22 Thread Prarit Bhargava



On 02/22/2013 03:43 PM, John Stultz wrote:
> On 02/14/2013 09:02 AM, Prarit Bhargava wrote:
>> /sys/class/rtc/rtcX/date and /sys/class/rtc/rtcX/time currently have
>> read-only access.  This patch introduces write functionality which will
>> set the rtc time.
>>
>> Usage: echo -MM-DD > /sys/class/rtc/rtcX/date
>> echo HH:MM:SS > /sys/class/rtc/rtcX/time
> 
> Why do we want to add a new interface here?

John,

I'm not adding a new interface.  The current date/time interface only handles
read and I'm introducing write.

Sorry -- maybe my description was too short ...

P.

> 
> thanks
> -john
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] hpet, allow user controlled mmap for user processes

2013-03-15 Thread Prarit Bhargava

The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
registers to userspace.  The Kconfig help points out that in some cases this
can be a security risk as some systems may erroneously configure the map such
that additional data is exposed to userspace.

This is a problem for distributions -- some users want the MMAP functionality
can verify that their systems are secure, but it comes with a significant
security risk for those who do not want the functionality.  In an effort
to mitigate this risk, and due to the low number of users of the MMAP
functionality I've introduced a kernel parameter, hpet_mmap_enable, that
is required in order to actually have the HPET MMAP exposed.

Signed-off-by: Prarit Bhargava 
Cc: Clemens Ladisch 
---
 Documentation/kernel-parameters.txt |3 +++
 drivers/char/hpet.c |   20 ++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e567af3..dbf0d81 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -962,6 +962,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
VIA, nVidia)
verbose: show contents of HPET registers during setup
 
+   hpet_mmap_enable [X86, HPET_MMAP] option to expose HPET MMAP to
+userspace.  By default this is disabled.
+
hugepages=  [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
On x86-64 and powerpc, this option can be specified
diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index e3f9a99..de770ab 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -367,12 +367,25 @@ static unsigned int hpet_poll(struct file *file, 
poll_table * wait)
return 0;
 }
 
+#ifdef CONFIG_HPET_MMAP
+static int hpet_mmap_enabled;
+
+static __init int hpet_mmap_enable(char *str)
+{
+   pr_info(KERN_INFO "HPET MMAP enabled\n");
+   hpet_mmap_enabled = 1;
+   return 1;
+}
+__setup("hpet_mmap_enable", hpet_mmap_enable);
+
 static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
 {
-#ifdef CONFIG_HPET_MMAP
struct hpet_dev *devp;
unsigned long addr;
 
+   if (!hpet_mmap_enabled)
+   return -EACCES;
+
if (((vma->vm_end - vma->vm_start) != PAGE_SIZE) || vma->vm_pgoff)
return -EINVAL;
 
@@ -393,10 +406,13 @@ static int hpet_mmap(struct file *file, struct 
vm_area_struct *vma)
}
 
return 0;
+}
 #else
+static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
+{
return -ENOSYS;
-#endif
 }
+#endif
 
 static int hpet_fasync(int fd, struct file *file, int on)
 {
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] time: Fix timeekeping_get_ns overflow on 32bit systems

2012-09-12 Thread Prarit Bhargava



On 09/11/2012 07:26 PM, John Stultz wrote:
> Thomas: Please queue this in tip/timers/urgent for 3.6.
> 
> Daniel Lezcano reported seeing multi-second stalls from
> keyboard input on his T61 laptop when NOHZ and CPU_IDLE
> were enabled on a 32bit kernel.
> 
> He bisected the problem down to
> 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 (time: Condense
> timekeeper.xtime into xtime_sec).
> 
> After reproducing this issue, I narrowed the problem down
> to the fact that timekeeping_get_ns() returns a 64bit
> nsec value that hasn't been accumulated. In some cases
> this value was being then stored in timespec.tv_nsec
> (which is a long).
> 
> On 32bit systems, With idle times larger then 4 seconds
> (or less, depending on the value of xtime_nsec), the
> returned nsec value would overflow 32bits. This limited
> kept time from increasing, causing timers to not expire.
> 
> The fix is to make sure we don't directly store the
> result of timekeeping_get_ns() into a tv_nsec field,
> instead using a 64bit nsec value which can then be
> added into the timespec via timespec_add_ns().
> 
> Cc: Ingo Molnar 
> Cc: Richard Cochran 
> Cc: Prarit Bhargava 
> Cc: Thomas Gleixner 
> Cc: Daniel Lezcano 
> Reported-and-bisected-by: Daniel Lezcano 
> Tested-by: Daniel Lezcano 
> Signed-off-by: John Stultz 

Acked-by: Prarit Bhargava 

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ntp, add debugfs entries for time_status and time_state

2012-10-04 Thread Prarit Bhargava

Add debugfs entries for ntp time_status and time_state.  These are useful
for debugging ntp issues.

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Thomas Gleixner 
---
 kernel/time/ntp.c |   40 
 1 file changed, 40 insertions(+)

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 24174b4..e1ba393 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"
 
@@ -965,3 +966,42 @@ void __init ntp_init(void)
 {
ntp_clear();
 }
+
+static int time_status_get(void *data, u64 *val)
+{
+   *val = time_status;
+   return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(time_status_fops, time_status_get, NULL, "0x%0llx\n");
+
+static int time_state_get(void *data, u64 *val)
+{
+   *val = time_state;
+   return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(time_state_fops, time_state_get, NULL, "0x%llx\n");
+
+static int __init ntp_debugfs_init(void)
+{
+   struct dentry *ntp_dentry, *time_status_dentry, *time_state_dentry;
+
+   ntp_dentry = debugfs_create_dir("ntp", NULL);
+   if (!ntp_dentry)
+   return -ENOMEM;
+
+   time_status_dentry = debugfs_create_file("time_status", 0444,
+ntp_dentry, NULL,
+&time_status_fops);
+   if (!time_status_dentry)
+   return -ENOMEM;
+
+   time_state_dentry = debugfs_create_file("time_state", 0444,
+   ntp_dentry, NULL,
+   &time_state_fops);
+   if (!time_state_dentry)
+   return -ENOMEM;
+
+   return 0;
+}
+/* debugfs init is core_initcall */
+postcore_initcall(ntp_debugfs_init);
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, add hypervisor name to dump_stack()

2012-10-24 Thread Prarit Bhargava



On 10/24/2012 05:49 AM, Ingo Molnar wrote:
> 
> * Prarit Bhargava  wrote:
> 
>> Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and
>> bare-metal boots can get difficult very quickly.  While there are ways to
>> decipher the output and determine if the output is from a virtual guest,
>> the in-kernel hypervisors now have a single registration point and set
>> x86_hyper.  We can use this to output a single extra line on virtual
>> machines that indicates the hypervisor type.
>>
>> Signed-off-by: Prarit Bhargava 
>> Cc: Avi Kivity 
>> Cc: Gleb Natapov 
>> Cc: Alex Williamson 
>> Cc: Marcelo Tostatti 
>> Cc: Ingo Molnar 
>> Cc: k...@vger.kernel.org
>> Cc: x...@kernel.org
>> ---
>>  arch/x86/kernel/dumpstack.c |3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
>> index ae42418b..75a635e 100644
>> --- a/arch/x86/kernel/dumpstack.c
>> +++ b/arch/x86/kernel/dumpstack.c
>> @@ -17,6 +17,7 @@
>>  #include 
>>  
>>  #include 
>> +#include 
>>  
>>  
>>  int panic_on_unrecovered_nmi;
>> @@ -193,6 +194,8 @@ void dump_stack(void)
>>  init_utsname()->release,
>>  (int)strcspn(init_utsname()->version, " "),
>>  init_utsname()->version);
>> +if (x86_hyper && x86_hyper->name)
>> +printk("Hypervisor: %s\n",  x86_hyper->name);
>>  show_trace(NULL, NULL, &stack, bp);
> 
> Looks useful, but please don't waste a full new line on it but 
> embedd it in the already existing status line that prints 
> details like release and version.

Ingo, I thought about doing that but since x86_hyper can be NULL (... maybe it
should initialized to "Bare-metal" or "No Hypervisor"?) I didn't want to break
up the printk line.  I'll look into doing it a different way...

P.

> 
> Thanks,
> 
>   Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, add hypervisor name to dump_stack() [v2]

2012-10-26 Thread Prarit Bhargava

Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and
bare-metal boots can get difficult very quickly.  While there are ways to
decipher the output and determine if the output is from a virtual guest,
the in-kernel hypervisors now have a single registration point
and set x86_hyper.  We can use this to output additional debug
information during a panic/oops/stack trace.

Signed-off-by: Prarit Bhargava 
Cc: Avi Kivity 
Cc: Gleb Natapov 
Cc: Alex Williamson 
Cc: Marcelo Tostatti 
Cc: Ingo Molnar 
Cc: k...@vger.kernel.org
Cc: x...@kernel.org

[v2]: Modifications suggested by Ingo and added changes for similar output
  from process.c
---
 arch/x86/kernel/dumpstack.c |   11 ++-
 arch/x86/kernel/process.c   |   12 +++-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ae42418b..5dd680f 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 
@@ -186,9 +187,17 @@ void dump_stack(void)
 {
unsigned long bp;
unsigned long stack;
+   const char *machine_name = "x86";
+   const char *kernel_type = "native";
+
+   if (x86_hyper) {
+   machine_name = x86_hyper->name;
+   kernel_type = "guest";
+   }
 
bp = stack_frame(current, NULL);
-   printk("Pid: %d, comm: %.20s %s %s %.*s\n",
+   printk("[%s %s kernel] Pid: %d, comm: %.20s %s %s %.*s\n",
+   machine_name, kernel_type,
current->pid, current->comm, print_tainted(),
init_utsname()->release,
(int)strcspn(init_utsname()->version, " "),
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b644e1c..14bd064 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -124,6 +125,13 @@ void exit_thread(void)
 void show_regs_common(void)
 {
const char *vendor, *product, *board;
+   const char *machine_name = "x86";
+   const char *kernel_type = "native";
+
+   if (x86_hyper) {
+   machine_name = x86_hyper->name;
+   kernel_type = "guest";
+   }
 
vendor = dmi_get_system_info(DMI_SYS_VENDOR);
if (!vendor)
@@ -135,7 +143,9 @@ void show_regs_common(void)
/* Board Name is optional */
board = dmi_get_system_info(DMI_BOARD_NAME);
 
-   printk(KERN_DEFAULT "Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
+   printk(KERN_DEFAULT
+  "[%s %s kernel] Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
+  machine_name, kernel_type,
   current->pid, current->comm, print_tainted(),
   init_utsname()->release,
   (int)strcspn(init_utsname()->version, " "),
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tty, add kref to sysrq handlers

2012-07-27 Thread Prarit Bhargava

3rd try on this one ...

8<-

On a large system with a large number of tasks, the output of

echo t > /proc/sysrq-trigger

can take a long period of time.  If this period is greater than the period
of the current clocksource, the clocksource watchdog will mark the
clocksource as unstable and fail the clocksource over.

The problem with sysrq is that __handle_sysrq() takes a spin_lock with
interrupts disabled and disables interrupts for the duration of the
handler.  If this happens during sysrq-t on a large system with a large
number of tasks, the result is a "brown-out" of the system.

The spin_lock in question, sysrq_key_table_lock, is in place to prevent
the removal of a sysrq handler while it is being executed in
__handle_sysrq().

A kref is added to each sysrq handler and is incremented and decremented
in __handle_sysrq().  This, while more complicated than a lock , leads to
minimizing the time that the sysrq_key_table_lock is acquired and results
in a functional sysrq-t.

I've tested both options and I no longer see the clocksource watchdog
marking the TSC clocksource as unstable.

Acked-by: Don Zickus 
Cc: gre...@linuxfoundation.org
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: lwood...@redhat.com
Cc: jba...@redhat.com
Cc: a...@linux.intel.com
---
 drivers/tty/sysrq.c   |   42 +++---
 include/linux/sysrq.h |2 ++
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 05728894..38c6ae6 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -458,6 +458,20 @@ static struct sysrq_key_op *sysrq_key_table[36] = {
&sysrq_ftrace_dump_op,  /* z */
 };
 
+void sysrq_release(struct kref *kref)
+{
+   struct sysrq_key_op *release_op;
+   int i;
+   unsigned long flags;
+
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
+   release_op = container_of(kref, struct sysrq_key_op, kref);
+   for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++)
+   if (sysrq_key_table[i] == release_op)
+   sysrq_key_table[i] = NULL;
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
+}
+
 /* key2index calculation, -1 on invalid index */
 static int sysrq_key_table_key2index(int key)
 {
@@ -502,7 +516,6 @@ void __handle_sysrq(int key, bool check_mask)
int i;
unsigned long flags;
 
-   spin_lock_irqsave(&sysrq_key_table_lock, flags);
/*
 * Raise the apparent loglevel to maximum so that the sysrq header
 * is shown to provide the user with positive feedback.  We do not
@@ -513,7 +526,12 @@ void __handle_sysrq(int key, bool check_mask)
console_loglevel = 7;
printk(KERN_INFO "SysRq : ");
 
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
 op_p = __sysrq_get_key_op(key);
+   if (op_p)
+   kref_get(&op_p->kref);
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
+
 if (op_p) {
/*
 * Should we check for enabled operations (/proc/sysrq-trigger
@@ -526,9 +544,14 @@ void __handle_sysrq(int key, bool check_mask)
} else {
printk("This sysrq operation is disabled.\n");
}
+
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
+   kref_put(&op_p->kref, sysrq_release);
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
} else {
printk("HELP : ");
/* Only print the help msg once per handler */
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++) {
if (sysrq_key_table[i]) {
int j;
@@ -541,10 +564,10 @@ void __handle_sysrq(int key, bool check_mask)
printk("%s ", sysrq_key_table[i]->help_msg);
}
}
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
printk("\n");
console_loglevel = orig_log_level;
}
-   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
 }
 
 void handle_sysrq(int key)
@@ -837,7 +860,12 @@ static int __sysrq_swap_key_ops(int key, struct 
sysrq_key_op *insert_op_p,
 
spin_lock_irqsave(&sysrq_key_table_lock, flags);
if (__sysrq_get_key_op(key) == remove_op_p) {
-   __sysrq_put_key_op(key, insert_op_p);
+   if (!remove_op_p) { /* register */
+   __sysrq_put_key_op(key, insert_op_p);
+   kref_init(&insert_op_p->kref);
+   }
+   if (!insert_op_p) /* unregister */
+   kref_put(&remove_op_p->kref, sysrq_release);
retval = 0;
} else {
retval = -1;
@@ -898,6 +926,14 @@ static inline void sysrq_init_procfs(void)
 
 static int __init sysrq_init(void)
 {
+

[PATCH] tty, add kref to sysrq handlers

2012-07-27 Thread Prarit Bhargava

On a large system with a large number of tasks, the output of

echo t > /proc/sysrq-trigger

can take a long period of time.  If this period is greater than the period
of the current clocksource, the clocksource watchdog will mark the
clocksource as unstable and fail the clocksource over.

The problem with sysrq is that __handle_sysrq() takes a spin_lock with
interrupts disabled and disables interrupts for the duration of the
handler.  If this happens during sysrq-t on a large system with a large
number of tasks, the result is a "brown-out" of the system.

The spin_lock in question, sysrq_key_table_lock, is in place to prevent
the removal of a sysrq handler while it is being executed in
__handle_sysrq().

A kref is added to each sysrq handler and is incremented and decremented
in __handle_sysrq().  This, while more complicated than a lock , leads to
minimizing the time that the sysrq_key_table_lock is acquired and results
in a functional sysrq-t.

I've tested both options and I no longer see the clocksource watchdog
marking the TSC clocksource as unstable.

Signed-off-by: Prarit Bhargava 
Acked-by: Don Zickus 
Cc: gre...@linuxfoundation.org
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: lwood...@redhat.com
Cc: jba...@redhat.com
Cc: a...@linux.intel.com
---
 drivers/tty/sysrq.c   |   42 +++---
 include/linux/sysrq.h |2 ++
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 05728894..38c6ae6 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -458,6 +458,20 @@ static struct sysrq_key_op *sysrq_key_table[36] = {
&sysrq_ftrace_dump_op,  /* z */
 };
 
+void sysrq_release(struct kref *kref)
+{
+   struct sysrq_key_op *release_op;
+   int i;
+   unsigned long flags;
+
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
+   release_op = container_of(kref, struct sysrq_key_op, kref);
+   for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++)
+   if (sysrq_key_table[i] == release_op)
+   sysrq_key_table[i] = NULL;
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
+}
+
 /* key2index calculation, -1 on invalid index */
 static int sysrq_key_table_key2index(int key)
 {
@@ -502,7 +516,6 @@ void __handle_sysrq(int key, bool check_mask)
int i;
unsigned long flags;
 
-   spin_lock_irqsave(&sysrq_key_table_lock, flags);
/*
 * Raise the apparent loglevel to maximum so that the sysrq header
 * is shown to provide the user with positive feedback.  We do not
@@ -513,7 +526,12 @@ void __handle_sysrq(int key, bool check_mask)
console_loglevel = 7;
printk(KERN_INFO "SysRq : ");
 
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
 op_p = __sysrq_get_key_op(key);
+   if (op_p)
+   kref_get(&op_p->kref);
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
+
 if (op_p) {
/*
 * Should we check for enabled operations (/proc/sysrq-trigger
@@ -526,9 +544,14 @@ void __handle_sysrq(int key, bool check_mask)
} else {
printk("This sysrq operation is disabled.\n");
}
+
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
+   kref_put(&op_p->kref, sysrq_release);
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
} else {
printk("HELP : ");
/* Only print the help msg once per handler */
+   spin_lock_irqsave(&sysrq_key_table_lock, flags);
for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++) {
if (sysrq_key_table[i]) {
int j;
@@ -541,10 +564,10 @@ void __handle_sysrq(int key, bool check_mask)
printk("%s ", sysrq_key_table[i]->help_msg);
}
}
+   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
printk("\n");
console_loglevel = orig_log_level;
}
-   spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
 }
 
 void handle_sysrq(int key)
@@ -837,7 +860,12 @@ static int __sysrq_swap_key_ops(int key, struct 
sysrq_key_op *insert_op_p,
 
spin_lock_irqsave(&sysrq_key_table_lock, flags);
if (__sysrq_get_key_op(key) == remove_op_p) {
-   __sysrq_put_key_op(key, insert_op_p);
+   if (!remove_op_p) { /* register */
+   __sysrq_put_key_op(key, insert_op_p);
+   kref_init(&insert_op_p->kref);
+   }
+   if (!insert_op_p) /* unregister */
+   kref_put(&remove_op_p->kref, sysrq_release);
retval = 0;
}

Re: [PATCH 0/2][RFC] Better handling of insane CMOS values

2012-07-31 Thread Prarit Bhargava



On 07/31/2012 02:35 AM, John Stultz wrote:
> So CAI Qian noticed recent boot trouble on a machine that had its CMOS
> clock configured for the year 8200. 
> See: http://lkml.org/lkml/2012/7/29/188

In case anyone was wondering, the system's date was very much screwed up:

│ System Time .. 13:39:40  │
│ System Date .. Tue Jul 31, 8212

After testing these patches I set the year to 2012.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, add hypervisor name to dump_stack() [v3]

2012-10-30 Thread Prarit Bhargava

Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and
bare-metal boots can get difficult very quickly.  While there are ways to
decipher the output and determine if the output is from a virtual guest,
the in-kernel hypervisors now have a single registration point
and set x86_hyper.  We can use this to output additional debug
information during a panic/oops/stack trace.

Signed-off-by: Prarit Bhargava 
Cc: Avi Kivity 
Cc: Gleb Natapov 
Cc: Alex Williamson 
Cc: Marcelo Tostatti 
Cc: Ingo Molnar 
Cc: k...@vger.kernel.org
Cc: x...@kernel.org

[v2]: Modifications suggested by Ingo and added changes for similar output
  from process.c

[v3]: Unify common code and move output to end of line
---
 arch/x86/kernel/dumpstack.c |6 +-
 arch/x86/kernel/process.c   |   12 +++-
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ae42418b..96d40ed 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -188,11 +188,7 @@ void dump_stack(void)
unsigned long stack;
 
bp = stack_frame(current, NULL);
-   printk("Pid: %d, comm: %.20s %s %s %.*s\n",
-   current->pid, current->comm, print_tainted(),
-   init_utsname()->release,
-   (int)strcspn(init_utsname()->version, " "),
-   init_utsname()->version);
+   show_regs_common();
show_trace(NULL, NULL, &stack, bp);
 }
 EXPORT_SYMBOL(dump_stack);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b644e1c..14bd064 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -124,6 +125,13 @@ void exit_thread(void)
 void show_regs_common(void)
 {
const char *vendor, *product, *board;
+   const char *machine_name = "x86";
+   const char *kernel_type = "native";
+
+   if (x86_hyper) {
+   machine_name = x86_hyper->name;
+   kernel_type = "guest";
+   }
 
vendor = dmi_get_system_info(DMI_SYS_VENDOR);
if (!vendor)
@@ -135,7 +143,9 @@ void show_regs_common(void)
/* Board Name is optional */
board = dmi_get_system_info(DMI_BOARD_NAME);
 
-   printk(KERN_DEFAULT "Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
+   printk(KERN_DEFAULT
+  "[%s %s kernel] Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
+  machine_name, kernel_type,
   current->pid, current->comm, print_tainted(),
   init_utsname()->release,
   (int)strcspn(init_utsname()->version, " "),
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, add hypervisor name to dump_stack() [v3]

2012-10-30 Thread Prarit Bhargava



On 10/30/2012 03:14 PM, Prarit Bhargava wrote:
> Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and
> bare-metal boots can get difficult very quickly.  While there are ways to
> decipher the output and determine if the output is from a virtual guest,
> the in-kernel hypervisors now have a single registration point
> and set x86_hyper.  We can use this to output additional debug
> information during a panic/oops/stack trace.
> 
> Signed-off-by: Prarit Bhargava 
> Cc: Avi Kivity 
> Cc: Gleb Natapov 
> Cc: Alex Williamson 
> Cc: Marcelo Tostatti 
> Cc: Ingo Molnar 
> Cc: k...@vger.kernel.org
> Cc: x...@kernel.org
> 
> [v2]: Modifications suggested by Ingo and added changes for similar output
>   from process.c
> 
> [v3]: Unify common code and move output to end of line
> ---
>  arch/x86/kernel/dumpstack.c |6 +-
>  arch/x86/kernel/process.c   |   12 +++-
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index ae42418b..96d40ed 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -188,11 +188,7 @@ void dump_stack(void)
>   unsigned long stack;
>  
>   bp = stack_frame(current, NULL);
> - printk("Pid: %d, comm: %.20s %s %s %.*s\n",
> - current->pid, current->comm, print_tainted(),
> - init_utsname()->release,
> - (int)strcspn(init_utsname()->version, " "),
> - init_utsname()->version);
> + show_regs_common();
>   show_trace(NULL, NULL, &stack, bp);
>  }
>  EXPORT_SYMBOL(dump_stack);
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b644e1c..14bd064 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
> @@ -124,6 +125,13 @@ void exit_thread(void)
>  void show_regs_common(void)
>  {
>   const char *vendor, *product, *board;
> + const char *machine_name = "x86";
> + const char *kernel_type = "native";
> +
> + if (x86_hyper) {
> + machine_name = x86_hyper->name;
> + kernel_type = "guest";
> + }
>  
>   vendor = dmi_get_system_info(DMI_SYS_VENDOR);
>   if (!vendor)
> @@ -135,7 +143,9 @@ void show_regs_common(void)
>   /* Board Name is optional */
>   board = dmi_get_system_info(DMI_BOARD_NAME);
>  
> - printk(KERN_DEFAULT "Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
> + printk(KERN_DEFAULT
> +"[%s %s kernel] Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
> +machine_name, kernel_type,

Ugh ... self-nak.  I sent the wrong version of this patch.  Sorry Ingo :(

P.

>  current->pid, current->comm, print_tainted(),
>  init_utsname()->release,
>  (int)strcspn(init_utsname()->version, " "),
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, add hypervisor name to dump_stack() [v4]

2012-10-30 Thread Prarit Bhargava

Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and
bare-metal boots can get difficult very quickly.  While there are ways to
decipher the output and determine if the output is from a virtual guest,
the in-kernel hypervisors now have a single registration point
and set x86_hyper.  We can use this to output additional debug
information during a panic/oops/stack trace.

Signed-off-by: Prarit Bhargava 
Cc: Avi Kivity 
Cc: Gleb Natapov 
Cc: Alex Williamson 
Cc: Marcelo Tostatti 
Cc: Ingo Molnar 
Cc: k...@vger.kernel.org
Cc: x...@kernel.org

[v2]: Modifications suggested by Ingo and added changes for similar output
  from process.c

[v3]: Unify common code and move output to end of line
---
 arch/x86/kernel/dumpstack.c |6 +-
 arch/x86/kernel/process.c   |   14 --
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ae42418b..96d40ed 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -188,11 +188,7 @@ void dump_stack(void)
unsigned long stack;
 
bp = stack_frame(current, NULL);
-   printk("Pid: %d, comm: %.20s %s %s %.*s\n",
-   current->pid, current->comm, print_tainted(),
-   init_utsname()->release,
-   (int)strcspn(init_utsname()->version, " "),
-   init_utsname()->version);
+   show_regs_common();
show_trace(NULL, NULL, &stack, bp);
 }
 EXPORT_SYMBOL(dump_stack);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b644e1c..7ea4692 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -124,6 +125,13 @@ void exit_thread(void)
 void show_regs_common(void)
 {
const char *vendor, *product, *board;
+   const char *machine_name = "x86";
+   const char *kernel_type = "native";
+
+   if (x86_hyper) {
+   machine_name = x86_hyper->name;
+   kernel_type = "guest";
+   }
 
vendor = dmi_get_system_info(DMI_SYS_VENDOR);
if (!vendor)
@@ -135,14 +143,16 @@ void show_regs_common(void)
/* Board Name is optional */
board = dmi_get_system_info(DMI_BOARD_NAME);
 
-   printk(KERN_DEFAULT "Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
+   printk(KERN_DEFAULT
+  "Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s [%s %s kernel]\n",
   current->pid, current->comm, print_tainted(),
   init_utsname()->release,
   (int)strcspn(init_utsname()->version, " "),
   init_utsname()->version,
   vendor, product,
   board ? "/" : "",
-  board ? board : "");
+  board ? board : "",
+  machine_name, kernel_type);
 }
 
 void flush_thread(void)
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: tick device NULL pointer during system initialization and shutdown

2013-07-08 Thread Prarit Bhargava



On 07/01/2013 09:30 AM, Thomas Gleixner wrote:
> On Mon, 1 Jul 2013, Prarit Bhargava wrote:
>> On 06/28/2013 06:52 AM, Thomas Gleixner wrote:
>>> Huch. Did the warning in the broadcast code trigger before that?
>>
>> tglx,
>>
>> AFAICT it does not.  Log below on the system I'm testing on.  The test on the
>> system is system boots, sleeps for 30 seconds and then reboots.
> 
>> [  270.563197] INFO: rcu_sched detected stalls on CPUs/tasks: { 51} 
>> (detected by
>> 63, t=217205 jiffies, g=3583, c=3582, q=578)
> 
> So the stall is on CPU51, but we do not get a backtrace for CPU51. 
> 
> The backtrace trigger is only sent to online cpus. So CPU51 is offline
> already. Which makes sense as we are in the process of bringing CPUs
> down and the CPUs with backtrace are 0 and 53-63.
> 
> I'm pretty sure, that the patch which clears the stale flag is
> unrelated to this and it cures the NULL pointer dereference (the
> reason why this can happen is clear).
> 
> So now you do not longer trip over the NULL pointer dereference, but
> you see a weird RCU stall on an already DEAD cpu. Note, it's dead
> because we already took CPU52 offline as well.
> 
> Paul???

I hit this a few times ... but the frequency of hitting this is MUCH less than
that off the original bug.  So Thomas, can you add

Tested-by: Prarit Bhargava 

to the "tick: Make oneshot broadcast robust vs. CPU offlining" patch?

IMO that problem seems to be solved and we're just peeling the proverbial onion
and finding deeper bugs.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH] x86 powertop, replace numa based core ID with physical ID

2013-09-19 Thread Prarit Bhargava

   8   8   0.02 1.26 2.69   0   0.10   0.00  99.88   0.00   31
 0   9   9   0.02 1.30 2.69   0   0.08   0.00  99.90   0.00   21
 0  20  20   0.04 1.23 2.69   0  99.96
 0  21  21   0.02 1.30 2.69   0  99.98
 0  22  22   0.02 1.34 2.69   0  99.98
 0  23  23   0.02 1.33 2.69   0  99.98
 0  24  24   0.02 1.28 2.69   0  99.98
 0  25  25   0.02 1.27 2.69   0  99.98
 0  26  26   0.02 1.34 2.69   0  99.98
 0  27  27   0.02 1.33 2.69   0  99.98
 0  28  28   0.02 1.29 2.69   0  99.98
 0  29  29   0.02 1.31 2.69   0  99.98
 1   0  30   0.02 1.20 2.69   0  99.98
 1   1  31   0.03 1.20 2.69   0  99.97
 1   2  32   0.02 1.20 2.69   0  99.98
 1   3  33   0.03 1.20 2.69   0  99.97
 1   4  34   0.02 1.20 2.69   0  99.98
 1   5  35   0.02 1.20 2.69   0  99.98
 1   6  36   0.02 1.20 2.69   0  99.98
 1   7  37   0.02 1.20 2.69   0  99.98
 1   8  38   0.02 1.20 2.69   0  99.98
 1   9  39   0.02 1.20 2.69   0  99.98
 1  10  10   0.05 1.20 2.69   0   0.13   0.00  99.82   0.00   29   32  12.16   
0.00  86.74   0.00   5.50   1.21  4.43  0.00  0.00
 1  11  11   0.03 1.20 2.69   0   0.14   0.00  99.83   0.00   29
 1  12  12   0.40 1.20 2.69   0   0.11   0.00  99.49   0.00   30
 1  13  13   0.03 1.20 2.69   0   0.12   0.00  99.85   0.00   29
 1  14  14   0.03 1.20 2.69   0   0.09   0.00  99.88   0.00   32
 1  15  15   0.03 1.20 2.69   0   0.10   0.00  99.87   0.00   27
 1  16  16   0.03 1.20 2.69   0   0.10   0.00  99.86   0.00   29
 1  17  17   0.03 1.20 2.69   0   0.11   0.00  99.86   0.00   28
 1  18  18   0.03 1.20 2.69   0   0.09   0.00  99.88   0.00   26
 1  19  19   0.04 1.20 2.69   0   0.10   0.00  99.86   0.00   30

which AFAICT is correct.

P.

-8<-

x86 powertop, replace numa based core ID with physical ID

On a 2-socket AMD 6276 processor system, where each socket has 8 2-thread
cores for a total of 16, turbostat only reports 8 cores for each socket
and drops data.

This happens because the sysfs file
/sys/devices/system/cpu/cpu%d/topology/core_id which is used to fetch the
"core_id" of each core is numa-centric and not physically based.

This results in fewer cores being allocated than are present and data gets
dropped.

For example, on the system above "turbostat -vvv" reports

max_core_id 7, sizing for 8 cores per package
max_package_id 1, sizing for 2 packages

when it should report

max_core_id 31, sizing for 16 cores per package
max_package_id 1, sizing for 2 packages

This patch swaps the numa based core_id for the physical core_id, which is
what we really want.  The numa core_id is now only used for debug output.

Successfully tested on the system above and also verified on an Intel
dual-socket E5-26XX system.

Signed-off-by: Prarit Bhargava 
Cc: Len Brown 
Cc: Kristen Carlson Accardi 
---
 tools/power/x86/turbostat/turbostat.c |   20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/tools/power/x86/turbostat/turbostat.c 
b/tools/power/x86/turbostat/turbostat.c
index fe70207..f7c91e0 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -2009,6 +2009,7 @@ void topology_probe()
 {
int i;
int max_core_id = 0;
+   int min_core_id = 0;
int max_package_id = 0;
int max_siblings = 0;
struct cpu_topology {
@@ -2058,7 +2059,7 @@ void topology_probe()
 
/*
 * For online cpus
-* find max_core_id, max_package_id
+* find min_core_id, max_core_id, max_package_id
 */
for (i = 0; i <= topo.max_cpu_num; ++i) {
int siblings;
@@ -2068,22 +2069,27 @@ void topology_probe()
fprintf(stderr, "cpu%d NOT PRESENT\n", i);
continue;
}
-   cpus[i].core_id = get_core_id(i);
+   cpus[i].core_id = i;
if (cpus[i].core_id > max_core_id)
max_core_id = cpus[i].core_id;
 
cpus[i].physical_package_id = get_physical_package_id(i);
-   if (cpus[i].physical_package_id > max_package_id)
+   if (cpus[i].physical_package_id > max_package_id) {
max_package_id = cpus[i].physical_package_id;
+   min_core_id = i;
+   }
 
siblings = get_num_ht_siblings(i);
if (siblings > max_siblings)
max_siblings = siblings;
if (verbose > 1)
-   fprintf(stderr, "cpu %d pkg %d core %d\n",
-   i, cpus[i].physical_package_id, 
cpus[i].core_id);
+   fprintf(stderr,
+   "cpu %d pkg %d phys-core %d numa-core %d\n",
+   i, cpus[i].physical_package_id,
+   cpus[i].core_id, get_core_id(i));
}
-   topo.num_cores_per_pkg = max_core_id + 1;
+

[PATCH] random, Add user configurable get_bytes_random()

2013-09-05 Thread Prarit Bhargava

The current code has two exported functions, get_bytes_random() and
get_bytes_random_arch().  The first function only calls the entropy
store to get random data, and the second only calls the arch specific
hardware random number generator.

The problem is that no code is using the get_bytes_random_arch() and switching
over will require a significant code change.  Even if the change is
made it will be static forcing a recompile of code if/when a user has a
system with a trusted random HW source.  A better thing to do is allow
users to decide whether they trust their hardare random number generator.

This patchset adds a kernel parameter, hw_random_bytes, and a kernel config
option, CONFIG_HW_RANDOM_BYTES, which allows the enabling and disabling
of the hardware random number generator at boot time and at compile time.
This will allow distributions to decide if they want to use the hardware
random number generator while allowing individual users to enable or
disable generator.

Signed-off-by: Prarit Bhargava 
Cc: Theodore Ts'o 
---
 Documentation/kernel-parameters.txt |5 +
 drivers/char/Kconfig|8 
 drivers/char/random.c   |   37 +++
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 31a9e51..310663c 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1029,6 +1029,11 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
   If specified, z/VM IUCV HVC accepts connections
   from listed z/VM user IDs only.
 
+   hw_random_bytes=  [HW] Enable/Disable use of arch specific hardware
+  random number generator in calls to
+  get_random_bytes()
+  Format: 0 (disable/default) | 1 (enable)
+
hwthread_map=   [METAG] Comma-separated list of Linux cpu id to
hardware thread id mappings.
Format: :
diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 1421997..1de2a0d 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -235,6 +235,14 @@ config NWFLASH
  If you're not sure, say N.
 
 source "drivers/char/hw_random/Kconfig"
+config HW_RANDOM_BYTES
+   bool "Enable Hardware Random Number Generator for get_random_bytes()"
+   default "n"
+   help
+ Some architectures provide a default hardware random number
+ generator.  By default, get_random_bytes() does not use this
+ generator to provide data.  Setting this to "y" switches
+ get_random_bytes() to use the hardware random number generator.
 
 config NVRAM
tristate "/dev/nvram support"
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0d91fe5..44ab100 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1049,19 +1049,27 @@ static ssize_t extract_entropy_user(struct 
entropy_store *r, void __user *buf,
 }
 
 /*
- * This function is the exported kernel interface.  It returns some
- * number of good random numbers, suitable for key generation, seeding
- * TCP sequence numbers, etc.  It does not use the hw random number
- * generator, if available; use get_random_bytes_arch() for that.
+ * Setting of hw_random_bytes will force get_random_bytes() to use the
+ * arch-specific hardware random number generator.
  */
-void get_random_bytes(void *buf, int nbytes)
+#ifdef CONFIG_HW_RANDOM_BYTES
+static int hw_random_bytes = 1;
+#else
+static int hw_random_bytes = 0;
+#endif
+static __init int set_hw_random_bytes(char *s)
 {
-   extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+   get_option(&s, &hw_random_bytes);
+   if (hw_random_bytes)
+   pr_info("get_random_bytes() using HW RNG\n");
+   else
+   pr_info("get_random_bytes() not using HW RNG\n");
+   return 0;
 }
-EXPORT_SYMBOL(get_random_bytes);
+__setup("hw_random_bytes=", set_hw_random_bytes);
 
 /*
- * This function will use the architecture-specific hardware random
+ * This function will always use the architecture-specific hardware random
  * number generator if it is available.  The arch-specific hw RNG will
  * almost certainly be faster than what we can do in software, but it
  * is impossible to verify that it is implemented securely (as
@@ -1092,6 +1100,19 @@ void get_random_bytes_arch(void *buf, int nbytes)
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
+/*
+ * This function is the well-known exported kernel interface.  It returns some
+ * number of good random numbers, suitable for key generation, seeding
+ * TCP sequence numbers, etc.
+ */
+void get_random_bytes(void *buf, int nbytes)
+{
+   if (hw_random_b

Re: [PATCH] random, Add user configurable get_bytes_random()

2013-09-05 Thread Prarit Bhargava



On 09/05/2013 10:48 AM, Theodore Ts'o wrote:
> On Thu, Sep 05, 2013 at 08:18:44AM -0400, Prarit Bhargava wrote:
>> The current code has two exported functions, get_bytes_random() and
>> get_bytes_random_arch().  The first function only calls the entropy
>> store to get random data, and the second only calls the arch specific
>> hardware random number generator.
>>
>> The problem is that no code is using the get_bytes_random_arch() and 
>> switching
>> over will require a significant code change.  Even if the change is
>> made it will be static forcing a recompile of code if/when a user has a
>> system with a trusted random HW source.  A better thing to do is allow
>> users to decide whether they trust their hardare random number generator.
> 
> I fail to see the benefit of just using the hardware random number
> generator.  We are already mixing in the hardware random number
> generator into the /dev/random pool, and so the only thing that using

The issue isn't userspace /dev/random as much as it is the use of
get_random_bytes() through out the kernel.  Switching to get_random_bytes_arch()
is a search'n'replace on the entire kernel.  If a user wants the faster random
HW generator why shouldn't they be able to use it by default?

P.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] random, Add user configurable get_bytes_random()

2013-09-06 Thread Prarit Bhargava



On 09/05/2013 03:49 PM, Theodore Ts'o wrote:
> BTW, note the following article, published today:
> 
> http://www.nytimes.com/2013/09/06/us/nsa-foils-much-internet-encryption.html?pagewanted=all
> 
> "By this year, the Sigint Enabling Project had found ways inside some
> of the encryption chips that scramble information for businesses and
> governments, either by working with chipmakers to insert back doors"
> 
> Relying solely and blindly on a magic hardware random number generator
> which is sealed inside a CPU chip and which is impossible to audit is
> a ***BAD*** idea.

Your argument seems to surround the idea that putting stuff on the internet is
safe.  It isn't.  If you've believed that then you've had your head in the sand
and I've got a lot of land in Florida to sell you.

Either way ... it's obvious you're not willing to take this patch and I respect
that decision.

Thanks,

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86 powertop, replace numa based core ID with physical ID

2013-09-11 Thread Prarit Bhargava

   0.00   31
 0   9   9   0.02 1.30 2.69   0   0.08   0.00  99.90   0.00   21
 0  20  20   0.04 1.23 2.69   0  99.96
 0  21  21   0.02 1.30 2.69   0  99.98
 0  22  22   0.02 1.34 2.69   0  99.98
 0  23  23   0.02 1.33 2.69   0  99.98
 0  24  24   0.02 1.28 2.69   0  99.98
 0  25  25   0.02 1.27 2.69   0  99.98
 0  26  26   0.02 1.34 2.69   0  99.98
 0  27  27   0.02 1.33 2.69   0  99.98
 0  28  28   0.02 1.29 2.69   0  99.98
 0  29  29   0.02 1.31 2.69   0  99.98
 1   0  30   0.02 1.20 2.69   0  99.98
 1   1  31   0.03 1.20 2.69   0  99.97
 1   2  32   0.02 1.20 2.69   0  99.98
 1   3  33   0.03 1.20 2.69   0  99.97
 1   4  34   0.02 1.20 2.69   0  99.98
 1   5  35   0.02 1.20 2.69   0  99.98
 1   6  36   0.02 1.20 2.69   0  99.98
 1   7  37   0.02 1.20 2.69   0  99.98
 1   8  38   0.02 1.20 2.69   0  99.98
 1   9  39   0.02 1.20 2.69   0  99.98
 1  10  10   0.05 1.20 2.69   0   0.13   0.00  99.82   0.00   29   32  12.16   
0.00  86.74   0.00   5.50   1.21  4.43  0.00  0.00
 1  11  11   0.03 1.20 2.69   0   0.14   0.00  99.83   0.00   29
 1  12  12   0.40 1.20 2.69   0   0.11   0.00  99.49   0.00   30
 1  13  13   0.03 1.20 2.69   0   0.12   0.00  99.85   0.00   29
 1  14  14   0.03 1.20 2.69   0   0.09   0.00  99.88   0.00   32
 1  15  15   0.03 1.20 2.69   0   0.10   0.00  99.87   0.00   27
 1  16  16   0.03 1.20 2.69   0   0.10   0.00  99.86   0.00   29
 1  17  17   0.03 1.20 2.69   0   0.11   0.00  99.86   0.00   28
 1  18  18   0.03 1.20 2.69   0   0.09   0.00  99.88   0.00   26
 1  19  19   0.04 1.20 2.69   0   0.10   0.00  99.86   0.00   30

which AFAICT is correct.

P.

-8<-

x86 powertop, replace numa based core ID with physical ID

On a 2-socket AMD 6276 processor system, where each socket has 8 2-thread
cores for a total of 16, turbostat only reports 8 cores for each socket
and drops data.

This happens because the sysfs file
/sys/devices/system/cpu/cpu%d/topology/core_id which is used to fetch the
"core_id" of each core is numa-centric and not physically based.

This results in fewer cores being allocated than are present and data gets
dropped.

For example, on the system above "turbostat -vvv" reports

max_core_id 7, sizing for 8 cores per package
max_package_id 1, sizing for 2 packages

when it should report

max_core_id 31, sizing for 16 cores per package
max_package_id 1, sizing for 2 packages

This patch swaps the numa based core_id for the physical core_id, which is
what we really want.  The numa core_id is now only used for debug output.

Successfully tested on the system above and also verified on an Intel
dual-socket E5-26XX system.

Signed-off-by: Prarit Bhargava 
Cc: Len Brown 
Cc: Kristen Carlson Accardi 
---
 tools/power/x86/turbostat/turbostat.c |   20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/tools/power/x86/turbostat/turbostat.c 
b/tools/power/x86/turbostat/turbostat.c
index fe70207..f7c91e0 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -2009,6 +2009,7 @@ void topology_probe()
 {
int i;
int max_core_id = 0;
+   int min_core_id = 0;
int max_package_id = 0;
int max_siblings = 0;
struct cpu_topology {
@@ -2058,7 +2059,7 @@ void topology_probe()
 
/*
 * For online cpus
-* find max_core_id, max_package_id
+* find min_core_id, max_core_id, max_package_id
 */
for (i = 0; i <= topo.max_cpu_num; ++i) {
int siblings;
@@ -2068,22 +2069,27 @@ void topology_probe()
fprintf(stderr, "cpu%d NOT PRESENT\n", i);
continue;
}
-   cpus[i].core_id = get_core_id(i);
+   cpus[i].core_id = i;
if (cpus[i].core_id > max_core_id)
max_core_id = cpus[i].core_id;
 
cpus[i].physical_package_id = get_physical_package_id(i);
-   if (cpus[i].physical_package_id > max_package_id)
+   if (cpus[i].physical_package_id > max_package_id) {
max_package_id = cpus[i].physical_package_id;
+   min_core_id = i;
+   }
 
siblings = get_num_ht_siblings(i);
if (siblings > max_siblings)
max_siblings = siblings;
if (verbose > 1)
-   fprintf(stderr, "cpu %d pkg %d core %d\n",
-   i, cpus[i].physical_package_id, 
cpus[i].core_id);
+   fprintf(stderr,
+   "cpu %d pkg %d phys-core %d numa-core %d\n",
+   i, cpus[i].physical_package_id,
+   cpus[i].core_id, get_core_id(i));
}
-   topo.num_cores_per_pkg = max_core_id + 1;
+   topo.num_cores_per_pkg = (max_core_id - min_core_id) + 1;

Re: [PATCH] hpet, allow user controlled mmap for user processes

2013-09-12 Thread Prarit Bhargava



On 08/29/2013 02:01 AM, Matt Wilson wrote:
> On Fri, Mar 22, 2013 at 09:32:54AM -0400, Prarit Bhargava wrote:
>> The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
>> registers to userspace.  The Kconfig help points out that in some cases this
>> can be a security risk as some systems may erroneously configure the map such
>> that additional data is exposed to userspace.
>>
>> This is a problem for distributions -- some users want the MMAP functionality
>> but it comes with a significant security risk.  In an effort to mitigate this
>> risk, and due to the low number of users of the MMAP functionality, I've
>> introduced a kernel parameter, hpet_mmap_enable, that is required in order
>> to actually have the HPET MMAP exposed.
>>
>> [v2]: Clemens suggested modifying the Kconfig help text and making the
>> default setting configurable.
>> [v3]: Fixed up Documentation and Kconfig entries, default now "Y"
>> [v4]: After testing, found that I need to modify CONFIG_HPET_MMAP_DEFAULT 
>> usage
>>
>> Signed-off-by: Prarit Bhargava 
>> Cc: Clemens Ladisch 
>> ---
>>  Documentation/kernel-parameters.txt |4 
>>  drivers/char/Kconfig|9 +++--
>>  drivers/char/hpet.c |   25 +++--
>>  3 files changed, 34 insertions(+), 4 deletions(-)
> 
> It doesn't seem like this patch got picked up and seems like a good
> idea to me. Clemens, what do you think?
> 
> Acked-by: Matt Wilson 
> 

Clemens?  I didn't see a reply...

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] RFC: Introduce FW_INFO* functions

2013-09-27 Thread Prarit Bhargava

Logging and tracking firmware bugs in the kernel has long been an issue
for system administrators.  The current kernel does not have a good
uniform method of reporting firmware bugs and the code in the kernel is a
mix of printk's and WARN_ONs.  This causes problems for both system
administrators and QA engineers who attempt to diagnose problems within
the kernel.

Using printk's is somewhat effective but lacks information useful
for reporting a bug such as the system vendor or model, BIOS revision,
etc.  Using WARN_ONs is also questionable because much of the data like the
backtrace and the list of modules is unnecessary for firmware issues as
the warning stems from one call path during system or driver
initialization.  We have heard many complaints from users about the excess
verbosity of these messages.

I'm proposing with this patch to do something similar to the WARN()
mechanism that is currently implemented in the kernel.  This
patchset introduces FW_INFO() and FW_INFO_DEV() which logs output

[  230.661137] [Firmware Info]: pci_bus :00: at
/home/prarit_modules/prarit.c:21 Your BIOS is broken because it is
-ENOWORKY.
[  230.671076] [Firmware Info]: Intel Corporation SandyBridge Platform/To
be filled by O.E.M., BIOS RMLCRB.86I.R3.27.D685.1305151733 05/15/2013

instead of the verbose back traces we are currently seeing.  These messages
can be easily cleaned from /var/log/messages, etc., by automatic bug
reporting tools and system administrators to properly report bugs to
hardware vendors.

For the RFC I've only implemented FW_INFO and FW_INFO_DEV as an example to
gauge interest and gather comments on the patch.  A full patchset would
also include FW_BUG and FW_WARN replacements.  Even in doing only this
simple FW_INFO modification I found an improperly classified FW_INFO in
arch/x86/kernel/cpu/amd.c which that should be a FW_WARN.  For now, I am
leaving it as a FW_INFO and will fix it when I implement the FW_BUG for
the "full" patchset.

Comments are appreciated and welcomed.  Thanks in advance for reviewing.

Signed-off-by: Prarit Bhargava 
Cc: gre...@linuxfoundation.org
Cc: a...@linux-foundation.org
---
 arch/x86/kernel/cpu/amd.c  |3 +--
 arch/x86/pci/mmconfig-shared.c |   15 +++
 include/asm-generic/bug.h  |   24 
 include/linux/printk.h |   13 ++---
 kernel/panic.c |   24 
 kernel/printk/printk.c |   12 
 6 files changed, 70 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 903a264..a806ed4 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -620,8 +620,7 @@ static void init_amd(struct cpuinfo_x86 *c)
rdmsrl(0xc0011005, value);
if (value & (1ULL << 54)) {
set_cpu_cap(c, X86_FEATURE_TOPOEXT);
-   printk(KERN_INFO FW_INFO "CPU: Re-enabling "
- "disabled Topology Extensions Support\n");
+   FW_INFO(1, "CPU: Re-enabling disabled Topology 
Extensions Support\n");
}
}
}
diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 5596c7b..1cdf930 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -498,15 +498,9 @@ static int __ref pci_mmcfg_check_reserved(struct device 
*dev,
if (is_mmconf_reserved(is_acpi_reserved, cfg, dev, 0))
return 1;
 
-   if (dev)
-   dev_info(dev, FW_INFO
-"MMCONFIG at %pR not reserved in "
-"ACPI motherboard resources\n",
+   FW_INFO_DEV(dev, dev, "MMCONFIG at %pR not reserved in 
ACPI motherboard resources\n",
 &cfg->res);
-   else
-   pr_info(FW_INFO PREFIX
-  "MMCONFIG at %pR not reserved in "
-  "ACPI motherboard resources\n",
+   FW_INFO(!dev, PREFIX "MMCONFIG at %pR not reserved in 
ACPI motherboard resources\n",
   &cfg->res);
}
 
@@ -707,10 +701,7 @@ int pci_mmconfig_insert(struct device *dev, u16 seg, u8 
start, u8 end,
cfg = pci_mmconfig_lookup(seg, start);
if (cfg) {
if (cfg->end_bus < end)
-   dev_info(dev, FW_INFO
-"MMCONFIG for "
-"domain %04x [bus %02x-%02x] "
-"only partially covers this bridge\n",
+

Re: [PATCH] RFC: Introduce FW_INFO* functions

2013-09-28 Thread Prarit Bhargava



On 09/27/2013 11:40 AM, Joe Perches wrote:
> On Fri, 2013-09-27 at 09:22 -0400, Prarit Bhargava wrote:
>> I'm proposing with this patch to do something similar to the WARN()
>> mechanism that is currently implemented in the kernel.  This
>> patchset introduces FW_INFO() and FW_INFO_DEV() which logs output
> 
> My first thought was "how ugly".
> There must be a better way than scraping dmesg output.

I am in no way married to this patch.  If anyone has a better idea I'd like to
hear it.  The dmesg log is the place that sysadmins are used to looking for it
-- it is the kernel that discovers and reports these issues.  AFAICT we've
always reported FW problems in the kernel log.

> 
>> diff --git a/kernel/panic.c b/kernel/panic.c
> []
>> @@ -445,6 +446,29 @@ void warn_slowpath_fmt_taint(const char *file, int line,
>>  }
>>  EXPORT_SYMBOL(warn_slowpath_fmt_taint);
>>  
>> +void warn_slowpath_fmt_dev(const char *file, int line,
>> +   struct device *dev, const char *fmt, ...)
>> +{
>> +struct slowpath_args args;
>> +
>> +pr_info("[Firmware Info]: ");
>> +if (dev)
>> +pr_cont("%s %s: ",
>> +dev_driver_string(dev), dev_name(dev));
>> +pr_cont("at %s:%d ", file, line);
>> +
>> +args.fmt = fmt;
>> +va_start(args.args, fmt);
>> +vprintk(args.fmt, args.args);
>> +va_end(args.args);
>> +if (dump_hardware_arch_desc())
>> +pr_info("[Firmware Info]: %s\n", dump_hardware_arch_desc());
>> +else
>> +pr_info("[Firmware Info]: Hardware Unidentified\n");
>> +}
>> +EXPORT_SYMBOL(warn_slowpath_fmt_dev);
> 
> This bit should just use %pV and a single printk to
> avoid any possible message interleaving.
> 

Ah ... of course.  I'll definitely do that in a future patch.

Thanks for looking at this Joe.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ntp, add debugfs entries for time_status and time_state

2012-10-09 Thread Prarit Bhargava



On 10/08/2012 09:47 PM, John Stultz wrote:
> On 10/04/2012 06:48 AM, Prarit Bhargava wrote:
>> Add debugfs entries for ntp time_status and time_state.  These are useful
>> for debugging ntp issues.
> Aren't these easily fetched from adjtimex()?  How does having them in debugfs 
> help?
> 

They are, however, there have been circumstances in the past when I've been
monitoring things from kernel-side that I've found it useful to have them in
debugfs.

P.

> thanks
> -john
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, add hypervisor name to dump_stack()

2012-10-10 Thread Prarit Bhargava

Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and
bare-metal boots can get difficult very quickly.  While there are ways to
decipher the output and determine if the output is from a virtual guest,
the in-kernel hypervisors now have a single registration point and set
x86_hyper.  We can use this to output a single extra line on virtual
machines that indicates the hypervisor type.

Signed-off-by: Prarit Bhargava 
Cc: Avi Kivity 
Cc: Gleb Natapov 
Cc: Alex Williamson 
Cc: Marcelo Tostatti 
Cc: Ingo Molnar 
Cc: k...@vger.kernel.org
Cc: x...@kernel.org
---
 arch/x86/kernel/dumpstack.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ae42418b..75a635e 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -17,6 +17,7 @@
 #include 
 
 #include 
+#include 
 
 
 int panic_on_unrecovered_nmi;
@@ -193,6 +194,8 @@ void dump_stack(void)
init_utsname()->release,
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);
+   if (x86_hyper && x86_hyper->name)
+   printk("Hypervisor: %s\n",  x86_hyper->name);
show_trace(NULL, NULL, &stack, bp);
 }
 EXPORT_SYMBOL(dump_stack);
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU NOHZ, tsc, and clock_gettime

2012-11-13 Thread Prarit Bhargava



On 11/12/2012 06:27 PM, John Stultz wrote:

> Hey Prarit,
> Just back from being on leave, and wanted to check in on this. Did you 
> ever
> get to run with an increase sample size to see how that affected things?  Its
> exactly your point that the non-NOHZ case could align the execution of a short
> run in a way that you always see good results, where as with NOHZ the 
> alignment
> might not be the same, so you see periodic delays from timer interrupts, etc.
> 
> Anyway, let me know if this got resolved or not.

Hey John,

I never did narrow this down, although this "disappears" with the move to the
lower resolution patches in upstream.  I have not, however, done any testing to
see if the situation is actually resolved or if we just reduced the problem by a
factor of 1000 ;)

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

BUG: tick device NULL pointer during system initialization and shutdown

2013-06-18 Thread Prarit Bhargava

Similar panics reported during bringup here:

http://lists.infradead.org/pipermail/linux-arm-kernel/2013-May/166205.html
http://lkml.org/lkml/2013/5/8/342

I've seen this a few times on 3.10 based kernels.

[  175.842027] Disabling non-boot CPUs ...
[  475.827017] BUG: unable to handle kernel NULL pointer dereference at
0048
[  475.835780] IP: [] tick_do_broadcast+0x67/0xa0
[  475.842499] PGD 0
[  475.844750] Oops:  [#1] SMP
[  475.848368] Modules linked in: lockd nf_conntrack_netbios_ns
nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sg
acpi_cpufreq mperf i7core_edac coretemp iTCO_wdt iTCO_vendor_support kvm_intel
edac_core kvm lpc_ich mfd_core serio_raw microcode pcspkr xfs libcrc32c sr_mod
cdrom sd_mod crc_t10dif mgag200 drm_kms_helper ttm ixgbe igb ahci dca mdio drm
libahci i2c_algo_bit ptp crc32c_intel libata hpsa i2c_core pps_core sunrpc
dm_mirror dm_region_hash dm_log dm_mod
[  475.917907] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G  I
--   3.10.0-0.rc5.61.el7.x86_64 #1
[  475.929071] Hardware name: HP ProLiant DL180 G6  , BIOS O20 10/01/2012
[  475.936355] task: 818ff440 ti: 818ec000 task.ti:
818ec000
[  475.944706] RIP: 0010:[]  []
tick_do_broadcast+0x67/0xa0
[  475.954135] RSP: 0018:88013bc03e60  EFLAGS: 00010006
[  475.960061] RAX:  RBX: 88013b843800 RCX: 00f8
[  475.968024] RDX:  RSI: 00f8 RDI: 88013b843800
[  475.975987] RBP: 88013bc03e70 R08: 88013b843800 R09: 004a
[  475.983950] R10:  R11: 0001 R12: e8e0
[  475.991914] R13: e8e0 R14:  R15: 8190e200
[  475.999878] FS:  () GS:88013bc0()
knlGS:
[  476.008908] CS:  0010 DS:  ES:  CR0: 8005003b
[  476.015318] CR2: 0048 CR3: 018f8000 CR4: 07f0
[  476.023281] DR0:  DR1:  DR2: 
[  476.031244] DR3:  DR6: 0ff0 DR7: 0400
[  476.039206] Stack:
[  476.041448]  7fff 006e86ffee75 88013bc03ea8 
810b847c
[  476.049741]  81902740   

[  476.058033]  8199dba0 88013bc03eb8 81013a75 
88013bc03f00
[  476.066326] Call Trace:
[  476.069054]  
[  476.071198]  [] tick_handle_oneshot_broadcast+0x14c/0x190
[  476.079185]  [] timer_interrupt+0x15/0x20
[  476.085404]  [] handle_irq_event_percpu+0x3e/0x1e0
[  476.092495]  [] handle_irq_event+0x37/0x60
[  476.098812]  [] handle_edge_irq+0x6f/0x120
[  476.105127]  [] handle_irq+0xbf/0x150
[  476.110959]  [] ? atomic_notifier_call_chain+0x1a/0x20
[  476.118439]  [] do_IRQ+0x4d/0xc0
[  476.123786]  [] common_interrupt+0x6d/0x6d
[  476.130099]  
[  476.132244]  [] ? cpuidle_enter_state+0x4f/0xc0
[  476.139262]  [] cpuidle_idle_call+0xc9/0x210
[  476.145773]  [] arch_cpu_idle+0xe/0x30
[  476.151704]  [] cpu_startup_entry+0x87/0x230
[  476.158206]  [] rest_init+0x77/0x80
[  476.163845]  [] start_kernel+0x415/0x421
[  476.169968]  [] ? repair_env_string+0x5c/0x5c
[  476.176575]  [] ? early_idt_handlers+0x120/0x120
[  476.183473]  [] x86_64_start_reservations+0x2a/0x2c
[  476.190661]  [] x86_64_start_kernel+0xf3/0x100
[  476.197363] Code: 00 00 00 00 48 63 35 b1 bc 94 00 48 89 df 49 c7 c4 e0 e8 00
00 e8 aa 11 24 00 89 c0 48 89 df 48 8b 04 c5 c0 5e 9f 81 4a 8b 04 20  50 48
5b 41 5c 5d c3 90 f0 0f b3 07 48 98 48 c7 c2 e0 e8 00
[  476.219005] RIP  [] tick_do_broadcast+0x67/0xa0
[  476.225816]  RSP 
[  476.229706] CR2: 0048
[  476.233402] ---[ end trace b7cdc1f0d37ce6df ]---
[  476.238552] Kernel panic - not syncing: Fatal exception in interrupt
[  477.305771] Shutting down cpus with NMI
[  477.310252] drm_kms_helper: panic occurred, switching back to text console

I'm debugging assuming a race between the downing of a cpu and the setting of
the cpu mask in the broadcast code -- tglx, what do you think?

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND 2] [PATCH 0/2] Rewrite power limit notification interrupt handling

2013-05-29 Thread Prarit Bhargava

Len,

>The "Power Limit Notification" (X86_FEATURE_PLN) was added in Sandy Bridge
>to give the OS the option of knowing when the package has reached
>a configured power threshold.

>printk(KERN_CRIT "CPU%d: %s power limit notification (total events = %lu)
>printk(KERN_INFO "CPU%d: %s power limit normal\n"

I'm seeing this on a widening number of systems, mostly newer Intel systems.

>However, these events are quite routine on some systems under some conditions,
>alarming customers and provoking un-necessary customer support calls.

The idea that these are "routine" just doesn't make sense to me.  Either this
warning is firing for a valid reason or it isn't.  If it isn't then the question
remains -- why is it firing?  Is it because of buggy FW or is something
actually wrong with the hardware?

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kvm_intel: Could not allocate 42 bytes percpu data

2013-06-24 Thread Prarit Bhargava



On 06/24/2013 03:01 PM, Chegu Vinod wrote:
> 
> Hello,
> 
> Lots (~700+) of the following messages are showing up in the dmesg of a 
> 3.10-rc1
> based kernel (Host OS is running on a large socket count box with HT-on).
> 
> [   82.270682] PERCPU: allocation failed, size=42 align=16, alloc from 
> reserved
> chunk failed
> [   82.272633] kvm_intel: Could not allocate 42 bytes percpu data

On 3.10?  Geez.  I thought we had fixed this.  I'll grab a big machine and see
if I can debug.

Rusty -- any ideas off the top of your head?'
> 
> ... also call traces like the following...
> 
> [  101.852136]  c901ad5aa090 88084675dd08 81633743 
> 88084675ddc8
> [  101.860889]  81145053 81f3fa78 88084809dd40 
> 8907d1cfd2e8
> [  101.869466]  8907d1cfd280 88087fffdb08 88084675c010 
> 88084675dfd8
> [  101.878190] Call Trace:
> [  101.880953]  [] dump_stack+0x19/0x1e
> [  101.886679]  [] pcpu_alloc+0x9a3/0xa40
> [  101.892754]  [] __alloc_reserved_percpu+0x13/0x20
> [  101.899733]  [] load_module+0x35f/0x1a70
> [  101.905835]  [] ? do_page_fault+0xe/0x10
> [  101.911953]  [] SyS_init_module+0xfb/0x140
> [  101.918287]  [] system_call_fastpath+0x16/0x1b
> [  101.924981] kvm_intel: Could not allocate 42 bytes percpu data
> 
> 
> Wondering if anyone else has seen this with the recent [3.10] based kernels 
> esp.
> on larger boxes?
> 
> There was a similar issue that was reported earlier (where modules were being
> loaded per cpu without checking if an instance was already 
> loaded/being-loaded).
> That issue seems to have been addressed in the recent past (e.g.
> https://lkml.org/lkml/2013/1/24/659 along with a couple of follow on 
> cleanups)  
> Is the above yet another variant of the original issue or perhaps some race
> condition that got exposed when there are lot more threads ?

Hmm ... not sure but yeah, that's the likely culprit.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: tick device NULL pointer during system initialization and shutdown

2013-06-25 Thread Prarit Bhargava



On 06/24/2013 09:57 AM, Thomas Gleixner wrote:
> On Tue, 18 Jun 2013, Prarit Bhargava wrote:
> 
>> Similar panics reported during bringup here:
>>
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2013-May/166205.html
>> http://lkml.org/lkml/2013/5/8/342
>>
>> I've seen this a few times on 3.10 based kernels.
>>
>> [  175.842027] Disabling non-boot CPUs ...
>> [  475.827017] BUG: unable to handle kernel NULL pointer dereference at
>> 0048
> 
> That looks like a stale bit in tick_broadcast_force_mask.
> 
> Does the patch below fix it?
>

Thomas,

Thanks for the patch.

The reproducibility appears to be quite low.  I'm seeing this roughly 1 time
every six hours of continuous system reboots.  I'm testing right now with your
patch.  I'll update the thread in a couple of days...

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hrtimer: one more expiry time overflow check in hrtimer_interrupt

2013-06-12 Thread Prarit Bhargava



On 06/11/2013 03:41 AM, Shinya Kuribayashi wrote:
> When executing a date command to set the system date and time to a few
> seconds before the 2038 problem expiration time, we got a WARN_ON_ONCE()
> like this:
> 
>   root@renesas:~# date -s "2038-1-19 3:14:00"
>   Tue Jan 19 03:14:00 GMT 2038
>   (then wait for 7-8 seconds)
>   root@renesas:~# [   27.662658] [ cut here ]
>   [   27.667297] WARNING: at kernel/time/clockevents.c:209 
> clockevents_program_event+0x3c/0x138()
>   [   27.675720] Modules linked in:
>   [   27.678802] [] (unwind_backtrace+0x0/0xe0) from [] 
> (warn_slowpath_common+0x4c/0x64)
>   [   27.688201] [] (warn_slowpath_common+0x4c/0x64) from 
> [] (warn_slowpath_null+0x18/0x1c)
>   [   27.697845] [] (warn_slowpath_null+0x18/0x1c) from 
> [] (clockevents_program_event+0x3c/0x138)
>   [   27.708007] [] (clockevents_program_event+0x3c/0x138) from 
> [] (tick_program_event+0x2c/0x34)
>   [   27.718170] [] (tick_program_event+0x2c/0x34) from 
> [] (hrtimer_interrupt+0x268/0x2a8)
>   [   27.727752] [] (hrtimer_interrupt+0x268/0x2a8) from 
> [] (cmt_timer_interrupt+0x2c/0x34)
>   [   27.737396] [] (cmt_timer_interrupt+0x2c/0x34) from 
> [] (handle_irq_event_percpu+0xb0/0x2a8)
>   [   27.747467] [] (handle_irq_event_percpu+0xb0/0x2a8) from 
> [] (handle_irq_event+0x58/0x74)
>   [   27.757293] [] (handle_irq_event+0x58/0x74) from [] 
> (handle_fasteoi_irq+0xc0/0x148)
>   [   27.72] [] (handle_fasteoi_irq+0xc0/0x148) from 
> [] (generic_handle_irq+0x20/0x30)
>   [   27.776245] [] (generic_handle_irq+0x20/0x30) from 
> [] (handle_IRQ+0x60/0x84)
>   [   27.785003] [] (handle_IRQ+0x60/0x84) from [] 
> (gic_handle_irq+0x34/0x4c)
>   [   27.793426] [] (gic_handle_irq+0x34/0x4c) from [] 
> (__irq_svc+0x40/0x70)
>   [   27.801788] Exception stack(0xc04aff68 to 0xc04affb0)
>   [   27.806823] ff60:    f010 0001  
> c04ae000 c04ec388
>   [   27.815002] ff80: c04b604c c0840d80 40004059 412fc093   
> c04ce140 c04affb0
>   [   27.823150] ffa0: c000f064 c000f068 6013 
>   [   27.828216] [] (__irq_svc+0x40/0x70) from [] 
> (default_idle+0x24/0x2c)
>   [   27.836395] [] (default_idle+0x24/0x2c) from [] 
> (cpu_idle+0x74/0xc8)
>   [   27.844451] [] (cpu_idle+0x74/0xc8) from [] 
> (start_kernel+0x248/0x288)
>   [   27.852722] ---[ end trace 9d8ad385bde80fd3 ]---


>   [   27.857330] hrtimer: interrupt took 0 ns

^^^ see below ...

> 
> This is triggered with our v3.4-based custom ARM kernel, but we
> confirmed that v3.10-rc can still have the same problem.
> 
> I found a similar issue fixed in v3.9 by Prarit Bhargava in commit
> 8f294b5a13 (hrtimer: Add expiry time overflow check in hrtimer_interrupt,
> 2013-04-08).  It tried to resolve a overflow issue detected around 1970
> + 100 seconds.
> 
> On the other hand, we have another call site of tick_program_event() at
> the bottom of hrtimer_interrupt().  The warning this time is triggered
> there, so we need to apply the same fix to it.
> 
> Reported-by: Hiroyuki Yokoyama 
> Signed-off-by: Shinya Kuribayashi 
> ---
> 
> Hi Prarit-san and John-san,
> 
> http://git.kernel.org/linus/8f294b5a139ee4b75e890ad5b443c93d1e558a8b
> hrtimer: Add expiry time overflow check in hrtimer_interrupt
> 
> I tried to fix the other case of overflow issues in hrtimer_interrupt(),
> but not sure it should be worked around like this in the first place.
> 
> Any comments are appreciated, thanks in advance.
> 
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index cdd5607..a42d712 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1368,6 +1370,8 @@ retry:
>   expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
>   else
>   expires_next = ktime_add(now, delta);
> + if (expires_next.tv64 < 0)
> + expires_next.tv64 = KTIME_MAX;

Even with this change you will still see the warning below if delta = 0.

>   tick_program_event(expires_next, 1);
>   printk_once(KERN_WARNING "hrtimer: interrupt took %llu ns\n",
>   ktime_to_ns(delta));

So I'm not sure that this is the correct thing to do.

Is this reproducible on any ARM system?  I'll grab an x86_64 box and give it a
shot there too.  Can you dump the values of now, delta, and expires_next when
the printk_once triggers?

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: tick device NULL pointer during system initialization and shutdown

2013-06-27 Thread Prarit Bhargava



On 06/26/2013 07:05 AM, Thomas Gleixner wrote:
> On Tue, 25 Jun 2013, Prarit Bhargava wrote:
>> On 06/24/2013 09:57 AM, Thomas Gleixner wrote:
>>> Does the patch below fix it?
>>>
>>
>> Thomas,
>>
>> Thanks for the patch.
>>
>> The reproducibility appears to be quite low.  I'm seeing this roughly 1 time
>> every six hours of continuous system reboots.  I'm testing right now with 
>> your
>> patch.  I'll update the thread in a couple of days...
> 
> I have a proper version of that patch now along with an explanation of
> the failure.
> 
> >
> 
> Subject: tick: Make oneshot broadcast robust vs. CPU offlining
> From: Thomas Gleixner 
> Date: Wed, 26 Jun 2013 12:17:32 +0200
> 
> In periodic mode we remove offline cpus from the broadcast propagation
> mask. In oneshot mode we fail to do so. This was not a problem so far,
> but the recent changes to the broadcast propagation introduced a
> constellation which can result in a NULL pointer dereference.
> 

Unfortunately this patch causes an NMI watchdog during system shutdown.  Most of
the CPUs are in start_secondary+0x254/0x256.

CPU 0, however, is

[  270.579581] NMI backtrace for cpu 0^M
[  270.583480] CPU: 0 PID: 595 Comm: kworker/0:2 Not tainted 3.10.0-rc4+ #2^M
[  270.590954] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS
QSSC-S4R.QCI.01.00.T030.072620111404 07/26/2011^M
[  270.601345] task: 880851c5 ti: 880851c72000 task.ti:
880851c72000^M
[  270.609691] RIP: 0010:[]  []
update_cfs_shares+0xf0/0xf0^M
[  270.619126] RSP: 0018:880851c73d78  EFLAGS: 0086^M
[  270.625049] RAX: 81626180 RBX: 880851c50048 RCX: 
^M
[  270.633007] RDX: 0001 RSI: 880851c50048 RDI: 
88085f414670^M
[  270.640965] RBP: 880851c73dc0 R08: 003effcc9cfd R09: 
^M
[  270.648923] R10:  R11: 0005 R12: 
88085f414670^M
[  270.656881] R13: 88085f414600 R14: 0001 R15: 
0001^M
[  270.664841] FS:  () GS:88085f40()
knlGS:^M
[  270.673865] CS:  0010 DS:  ES:  CR0: 8005003b^M
[  270.680272] CR2: 00b8 CR3: 018f8000 CR4: 
07f0^M
[  270.688229] DR0:  DR1:  DR2: 
^M
[  270.696188] DR3:  DR6: 0ff0 DR7: 
0400^M
[  270.704146] Stack:^M
[  270.706388]  8109b019 88085f414600 88085f414600
^M
[  270.714684]  88085f414600 88085f414600 
880851c5^M
[  270.722981]  8808521ec700 880851c73de8 8108ed39
000168d36c00^M
[  270.731276] Call Trace:^M
[  270.734007]  [] ? dequeue_task_fair+0x59/0x640^M
[  270.740713]  [] dequeue_task+0x79/0xa0^M
[  270.746638]  [] deactivate_task+0x23/0x30^M
[  270.752857]  [] __schedule+0x589/0x7d0^M
[  270.758782]  [] schedule+0x29/0x70^M
[  270.764323]  [] worker_thread+0x1c3/0x3a0^M
[  270.770541]  [] ? rescuer_thread+0x350/0x350^M
[  270.777041]  [] kthread+0xc0/0xd0^M
[  270.782474]  [] ? insert_kthread_work+0x40/0x40^M
[  270.789272]  [] ret_from_fork+0x7c/0xb0^M
[  270.795295]  [] ? insert_kthread_work+0x40/0x40^M

and CPU63 is doing the back trace:

[  272.655049] CPU: 63 PID: 0 Comm: swapper/63 Not tainted 3.10.0-rc4+ #2^M
[  272.662331] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS
QSSC-S4R.QCI.01.00.T030.072620111404 07/26/2011^M
[  272.672714] task: 880854df4de0 ti: 880854e02000 task.ti:
880854e02000^M
[  272.681062] RIP: 0010:[]  []
delay_tsc+0x32/0x80^M
[  272.689720] RSP: 0018:88106f3c3dd0  EFLAGS: 0083^M
[  272.695647] RAX: 009e RBX: cea08f3d RCX: 
0001^M
[  272.703607] RDX: cea08fdb RSI: 0050 RDI: 
001e7000^M
[  272.711569] RBP: 88106f3c3de8 R08: 81a02928 R09: 
070e^M
[  272.719529] R10:  R11: 88106f3c3b46 R12: 
001e7000^M
[  272.727491] R13: 003f R14: 88106f3cec80 R15: 
81949480^M
[  272.735452] FS:  () GS:88106f3c()
knlGS:^M
[  272.744470] CS:  0010 DS:  ES:  CR0: 8005003b^M
[  272.750879] CR2: 7f114a8f7920 CR3: 000c61b5f000 CR4: 
07e0^M
[  272.758841] DR0:  DR1:  DR2: 
^M
[  272.766801] DR3:  DR6: 0ff0 DR7: 
0400^M
[  272.774759] Stack:^M
[  272.777001]  2710 81949300 81949000
88106f3c3df8^M
[  272.785303]  812f3be8 88106f3c3e10 81036faa
81a02ba0^M
[  272.793605]  88106f3c3e70 810f8060 000354df4de0
0242^M
[  272.801908] Call Trace:^M
[  272.804634]   ^M
[  272.806782]  [] __const_ude

[PATCH] hpet, allow user controlled mmap for user processes

2013-03-22 Thread Prarit Bhargava

The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
registers to userspace.  The Kconfig help points out that in some cases this
can be a security risk as some systems may erroneously configure the map such
that additional data is exposed to userspace.

This is a problem for distributions -- some users want the MMAP functionality
but it comes with a significant security risk.  In an effort to mitigate this
risk, and due to the low number of users of the MMAP functionality, I've
introduced a kernel parameter, hpet_mmap_enable, that is required in order
to actually have the HPET MMAP exposed.

[v2]: Clemens suggested modifying the Kconfig help text and making the
default setting configurable.
[v3]: Fixed up Documentation and Kconfig entries, default now "Y"
[v4]: After testing, found that I need to modify CONFIG_HPET_MMAP_DEFAULT usage

Signed-off-by: Prarit Bhargava 
Cc: Clemens Ladisch 
---
 Documentation/kernel-parameters.txt |4 
 drivers/char/Kconfig|9 +++--
 drivers/char/hpet.c |   25 +++--
 3 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e567af3..191 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -962,6 +962,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
VIA, nVidia)
verbose: show contents of HPET registers during setup
 
+   hpet_mmap=  [X86, HPET_MMAP] option to expose HPET MMAP to
+   userspace.  By default this is disabled. Values are
+   0(disabled) or 1(enabled).
+
hugepages=  [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
On x86-64 and powerpc, this option can be specified
diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 3bb6fa3..51b62a1 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -534,10 +534,15 @@ config HPET_MMAP
  If you say Y here, user applications will be able to mmap
  the HPET registers.
 
+config HPET_MMAP_DEFAULT
+   bool "Enable HPET MMAP access by default"
+   default y
+   depends on HPET_MMAP
+   help
  In some hardware implementations, the page containing HPET
  registers may also contain other things that shouldn't be
- exposed to the user.  If this applies to your hardware,
- say N here.
+ exposed to the user. This option selects the default user access
+ to the HPET registers for applications that require it.
 
 config HANGCHECK_TIMER
tristate "Hangcheck timer"
diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index e3f9a99..b3ba043 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -367,12 +367,30 @@ static unsigned int hpet_poll(struct file *file, 
poll_table * wait)
return 0;
 }
 
+#ifdef CONFIG_HPET_MMAP
+#ifdef CONFIG_HPET_MMAP_DEFAULT
+static int hpet_mmap_enabled = 1;
+#else
+static int hpet_mmap_enabled = 0;
+#endif
+
+static __init int hpet_mmap_enable(char *str)
+{
+   get_option(&str, &hpet_mmap_enabled);
+   pr_info(KERN_INFO "HPET MMAP %s\n",
+   hpet_mmap_enabled ? "disabled" : "enabled");
+   return 1;
+}
+__setup("hpet_mmap", hpet_mmap_enable);
+
 static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
 {
-#ifdef CONFIG_HPET_MMAP
struct hpet_dev *devp;
unsigned long addr;
 
+   if (!hpet_mmap_enabled)
+   return -EACCES;
+
if (((vma->vm_end - vma->vm_start) != PAGE_SIZE) || vma->vm_pgoff)
return -EINVAL;
 
@@ -393,10 +411,13 @@ static int hpet_mmap(struct file *file, struct 
vm_area_struct *vma)
}
 
return 0;
+}
 #else
+static int hpet_mmap(struct file *file, struct vm_area_struct *vma)
+{
return -ENOSYS;
-#endif
 }
+#endif
 
 static int hpet_fasync(int fd, struct file *file, int on)
 {
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, tsc add an initial read offset to __cycles_2_ns() calculations

2013-07-24 Thread Prarit Bhargava

Note that the E5 Sandybridge processor does not have IA32_TSC_ADJUST MSR
implemented so attempting to resynch the TSCs is not possible on the
problem hardware.

Thanks to John for the suggestion below.

P.

8<

The TSC can have non-zero values at boot time on Intel Xeon E5 (family 6,
model 45) aka "SandyBridge" processors.  This is documented in the Errata
for the E5 processors as BT81.

The __cycles_2_ns() calculation is known to overflow if a large value of
cycles is passed into the function.  This is done by design to improve
precision for smaller significant digits in the calculation.  Since the E5
processor can pass in a large value,  we need to snapshot the TSC's
initial value to avoid calculation overflows in the conversions of cycles
to nanoseconds.

Tested successfully on various Sandybridge systems as well as a few older
and newer systems without any issues.

Also, remove the unused cycles_2_ns() function.

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Dave Hansen 
Cc: x...@kernel.org
---
 arch/x86/include/asm/timer.h |   15 +++
 arch/x86/kernel/tsc.c|   13 +
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index 34baa0e..f9d666b 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -12,6 +12,8 @@ extern int recalibrate_cpu_khz(void);
 
 extern int no_timer_check;
 
+extern unsigned long long tsc_initial_value;
+
 /* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
@@ -59,21 +61,10 @@ static inline unsigned long long __cycles_2_ns(unsigned 
long long cyc)
 {
int cpu = smp_processor_id();
unsigned long long ns = per_cpu(cyc2ns_offset, cpu);
+   cyc -= tsc_initial_value;
ns += mult_frac(cyc, per_cpu(cyc2ns, cpu),
(1UL << CYC2NS_SCALE_FACTOR));
return ns;
 }
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-   unsigned long long ns;
-   unsigned long flags;
-
-   local_irq_save(flags);
-   ns = __cycles_2_ns(cyc);
-   local_irq_restore(flags);
-
-   return ns;
-}
-
 #endif /* _ASM_X86_TIMER_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6ff4924..63ed8cc 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,6 +38,16 @@ static int __read_mostly tsc_unstable;
 static int __read_mostly tsc_disabled = -1;
 
 int tsc_clocksource_reliable;
+
+/*
+ * TSC can have non-zero values at boot time on Intel Xeon E5 (family 6,
+ * model 45) aka "SandyBridge" processors.  This is documented in the
+ * Errata for the processors as BT81.  As a result, we need to snapshot
+ * the TSC's initial value to avoid calculation overflows in the conversions
+ * of cycles to nanoseconds.
+ */
+unsigned long long tsc_initial_value;
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
@@ -979,6 +989,9 @@ void __init tsc_init(void)
return;
}
 
+   tsc_initial_value = get_cycles();
+   pr_info("TSC: tsc initial value = %lld\n", tsc_initial_value);
+
pr_info("Detected %lu.%03lu MHz processor\n",
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

2013-03-02 Thread Prarit Bhargava

On 03/01/2013 12:17 PM, Neil Horman wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> 
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, 
> and
> as a result the recommend that interrupt remapping be disabled in bios.  While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem.  As a
> result, occasionally interrupts can arrive at a cpu even after affinity for 
> that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> 
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is 
> such
> that this feature was not properly turned off.  As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
> 
> Signed-off-by: Neil Horman 
> CC: Prarit Bhargava 
> CC: Don Zickus 
> CC: Don Dutile 
> CC: Bjorn Helgaas 
> CC: Asit Mallick 
> CC: linux-...@vger.kernel.org
> ---
>  drivers/iommu/intel_irq_remapping.c | 20 
>  include/linux/pci_ids.h |  2 ++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c 
> b/drivers/iommu/intel_irq_remapping.c
> index f3b8f23..9bfb6c2 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1113,3 +1113,23 @@ struct irq_remap_ops intel_irq_remap_ops = {
>   .msi_setup_irq  = intel_msi_setup_irq,
>   .setup_hpet_msi = intel_setup_hpet_msi,
>  };
> +
> +
> +static void intel_remapping_check(struct pci_dev *dev)
> +{
> + u8 revision;
> +
> + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> +
> + if ((revision == 0x13) && irq_remapping_enabled) {
> + pr_warn("WARNING WARNING WARNING WARNING WARNING WARNING\n"
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an errata making that\n"
> + "feature unstable.  Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");

Make this one line?  Might be too long but I believe the preferred policy is now
to keep the output on one line so that it is easy to find in the kernel source.

Also, IMO, remove the WARNING WARNING stuff.

You also should probably use HW_ERR here too.

P.

> + }
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, 
> intel_remapping_check);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, 
> intel_remapping_check);
> +
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 31717bd..54027a6 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -2732,6 +2732,8 @@
>  #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2  0x2db2
>  #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV20x2db3
>  #define PCI_DEVICE_ID_INTEL_82855PM_HB   0x3340
> +#define PCI_DEVICE_ID_INTEL_5500_IOHUB   0x3403
> +#define PCI_DEVICE_ID_INTEL_5520_IOHUB   0x3406
>  #define PCI_DEVICE_ID_INTEL_IOAT_TBG40x3429
>  #define PCI_DEVICE_ID_INTEL_IOAT_TBG50x342a
>  #define PCI_DEVICE_ID_INTEL_IOAT_TBG60x342b

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, clocksource, fix !CONFIG_CLOCKSOURCE_WATCHDOG compile [v2]

2013-03-03 Thread Prarit Bhargava

If I explicitly disable the clocksource watchdog in the x86 Kconfig,
the x86 kernel will not compile unless this is properly defined.

v2: Oops ... move this to clocksource.h where it belongs

Signed-off-by: Prarit Bhargava 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: x...@kernel.org
---
 include/linux/clocksource.h |5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 27cfda4..8292fe6 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -285,7 +285,12 @@ extern void clocksource_change_rating(struct clocksource 
*cs, int rating);
 extern void clocksource_suspend(void);
 extern void clocksource_resume(void);
 extern struct clocksource * __init __weak clocksource_default_clock(void);
+#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
 extern void clocksource_mark_unstable(struct clocksource *cs);
+#else
+static inline void clocksource_mark_unstable(struct clocksource *cs) { }
+#endif
+
 
 extern void
 clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 minsec);
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/urgent] tick: Cleanup NOHZ per cpu data on cpu down

2013-05-13 Thread Prarit Bhargava



On 05/12/2013 06:27 AM, tip-bot for Thomas Gleixner wrote:
> Commit-ID:  4b0c0f294f60abcdd20994a8341a95c8ac5eeb96
> Gitweb: http://git.kernel.org/tip/4b0c0f294f60abcdd20994a8341a95c8ac5eeb96
> Author: Thomas Gleixner 
> AuthorDate: Fri, 3 May 2013 15:02:50 +0200
> Committer:  Thomas Gleixner 
> CommitDate: Sun, 12 May 2013 12:20:09 +0200
> 
> tick: Cleanup NOHZ per cpu data on cpu down
> 
> Prarit reported a crash on CPU offline/online. The reason is that on
> CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
> up. If at cpu online an interrupt happens before the per cpu tick
> device is registered the irq_enter() check potentially sees stale data
> and dereferences a NULL pointer.
> 
> Cleanup the data after the cpu is dead.

Thomas, while this does fix up the NULL pointer issue, I think you've introduced
a new bug in the schedule timer code.

While doing up and downs on the same CPU, I now occasionally see long delays in
the up and down...

[   65.150073] smpboot: Booting Node 1 Processor 19 APIC 0x28
[   66.715339] smpboot: CPU 19 is now offline
[   67.752751] smpboot: Booting Node 1 Processor 19 APIC 0x28
[   68.758711] smpboot: CPU 19 is now offline

Everything is normal ...

[   69.711612] smpboot: Booting Node 1 Processor 19 APIC 0x28
[   70.731521] smpboot: CPU 19 is now offline

Long delay in bringing CPU "down"

[   81.744565] smpboot: Booting Node 1 Processor 19 APIC 0x28
[   82.848591] smpboot: CPU 19 is now offline

Long delay in bringing CPU "up"

[   89.826533] smpboot: Booting Node 1 Processor 19 APIC 0x28
[   84.905358] smpboot: CPU 19 is now offline
[   87.565274] smpboot: Booting Node 1 Processor 19 APIC 0x28

Also, if the system is in this state I cannot reboot -- the system appears to
hang while bringing down CPUs...

Oddly, if I do

+   memset(ts, 0, sizeof(*ts));
+   ts->tick_stopped = 1;

instead of your memset, everything works.  I'm looking at the tick-sched.c code
to see why setting tick_stopped = 1 seems to fix the problem.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/urgent] tick: Cleanup NOHZ per cpu data on cpu down

2013-05-14 Thread Prarit Bhargava



On 05/13/2013 03:10 PM, Thomas Gleixner wrote:
> On Mon, 13 May 2013, Prarit Bhargava wrote:
>> Thomas, while this does fix up the NULL pointer issue, I think you've 
>> introduced
>> a new bug in the schedule timer code.
> 
> I don't think that I introduced a new bug. I'm quite sure that change
> unearthed another issue which was papered over by the stale data.
> 
> That memset is putting the data structure into the same state as we
> have on boot. From tick-sched perspective cpu onlining is not
> different between boot and an offline/online cycle
> 
>> While doing up and downs on the same CPU, I now occasionally see long delays 
>> in
>> the up and down...
> 
>> [   81.744565] smpboot: Booting Node 1 Processor 19 APIC 0x28
>> [   82.848591] smpboot: CPU 19 is now offline
>>
>> Long delay in bringing CPU "up"
>>
>> [   89.826533] smpboot: Booting Node 1 Processor 19 APIC 0x28
>> [   84.905358] smpboot: CPU 19 is now offline
>> [   87.565274] smpboot: Booting Node 1 Processor 19 APIC 0x28
> 
> Errm, the timestamps are random. -ENOTUSEFUL
>  

I'm always saying my computer is full of lies ;)

Here's the bottom line.  The patch included in this thread plus the patch you
pointed me to here

http://marc.info/?l=linux-kernel&m=136847403809031&w=2

seem to resolve the cpu up/down + thermal interrupt issues that I've been 
seeing.

So thank you :)

Tested-by: Prarit Bhargava 

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] hrtimer, add expiry time overflow check in hrtimer_interrupt

2013-04-08 Thread Prarit Bhargava

2nd submit: I did quite a bit of testing with this patch and I don't see any
ill effects.  I've tested across several large and small x86 systems, and a
ppc system for good measure.

P.

8<---
`
The settimeofday01 test in the LTP testsuite effectively does

gettimeofday(current time);
settimeofday(Jan 1, 1970 + 100 seconds);
settimeofday(current time);

This test causes a stack trace to be displayed on the console during the
setting of timeofday to Jan 1, 1970 + 100 seconds:

[  131.066751] [ cut here ]
[  131.096448] WARNING: at kernel/time/clockevents.c:209 
clockevents_program_event+0x135/0x140()
[  131.104935] Hardware name: Dinar
[  131.108150] Modules linked in: sg nfsv3 nfs_acl nfsv4 auth_rpcgss nfs 
dns_resolver fscache lockd sunrpc nf_conntrack_netbios_ns 
nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle 
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables 
kvm_amd kvm sp5100_tco bnx2 i2c_piix4 crc32c_intel k10temp fam15h_power 
ghash_clmulni_intel amd64_edac_mod pcspkr serio_raw edac_mce_amd edac_core 
microcode xfs libcrc32c sr_mod sd_mod cdrom ata_generic crc_t10dif pata_acpi 
radeon i2c_algo_bit drm_kms_helper ttm drm ahci pata_atiixp libahci libata 
usb_storage i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  131.176784] Pid: 0, comm: swapper/28 Not tainted 3.8.0+ #6
[  131.182248] Call Trace:
[  131.184684][] warn_slowpath_common+0x7f/0xc0
[  131.191312]  [] warn_slowpath_null+0x1a/0x20
[  131.197131]  [] clockevents_program_event+0x135/0x140
[  131.203721]  [] tick_program_event+0x24/0x30
[  131.209534]  [] hrtimer_interrupt+0x131/0x230
[  131.215437]  [] ? cpufreq_p4_target+0x130/0x130
[  131.221509]  [] smp_apic_timer_interrupt+0x69/0x99
[  131.227839]  [] apic_timer_interrupt+0x6d/0x80
[  131.233816][] ? sched_clock_cpu+0xc5/0x120
[  131.240267]  [] ? cpuidle_wrap_enter+0x50/0xa0
[  131.246252]  [] ? cpuidle_wrap_enter+0x49/0xa0
[  131.252238]  [] cpuidle_enter_tk+0x10/0x20
[  131.257877]  [] cpuidle_idle_call+0xa9/0x260
[  131.263692]  [] cpu_idle+0xaf/0x120
[  131.268727]  [] start_secondary+0x255/0x257
[  131.274449] ---[ end trace 1151a50552231615 ]---

When we change the system time to a low value like this, the value of
timekeeper->offs_real will be a negative value.

It seems that the WARN occurs because an hrtimer has been started in the time
between the releasing of the timekeeper lock and the IPI call (via a call to
on_each_cpu) in clock_was_set() in the do_settimeofday() code.  The end result
is that a REALTIME_CLOCK timer has been added with softexpires = expires =
KTIME_MAX.  The hrtimer_interrupt() fires/is called and the loop at
kernel/hrtimer.c:1289 is executed.  In this loop the code subtracts the
clock base's offset (which was set to timekeeper->offs_real in
do_settimeofday()) from the current hrtimer_cpu_base->expiry value (which
was KTIME_MAX):

KTIME_MAX - (a negative value) = overflow

A simple check for an overflow can resolve this problem.  Using KTIME_MAX
instead of the overflow value will result in the hrtimer function being run,
and the reprogramming of the timer after that.

Signed-off-by: Prarit Bhargava 
Cc: Thomas Gleixner 
Cc: John Stultz 
---
 kernel/hrtimer.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index cc47812..17eafd7 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1309,6 +1309,8 @@ retry:
 
expires = ktime_sub(hrtimer_get_expires(timer),
base->offset);
+   if (expires.tv64 < 0)
+   expires.tv64 = KTIME_MAX;
if (expires.tv64 < expires_next.tv64)
expires_next = expires;
break;
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hrtimer, add expiry time overflow check in hrtimer_interrupt

2013-04-08 Thread Prarit Bhargava



On 04/08/2013 04:19 PM, John Stultz wrote:
> On 04/08/2013 05:47 AM, Prarit Bhargava wrote:

>>
>> A simple check for an overflow can resolve this problem.  Using KTIME_MAX
>> instead of the overflow value will result in the hrtimer function being run,
>> and the reprogramming of the timer after that.
>>
>> Signed-off-by: Prarit Bhargava 
>> Cc: Thomas Gleixner 
>> Cc: John Stultz 
> 
> Prarit: Should this be tagged for -stable?

John,

Yes, this should go to -stable.  cc'd.

P.

> 
> thanks
> -john
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] NOHZ, check to see if tick device is initialized in IRQ handling path

2013-04-30 Thread Prarit Bhargava

2nd try at this ... going with a more global cc.

I think the linux.git "system hang" isn't really a hang.  For some reason the
panic text wasn't displayed on the console.  I've seen this behaviour a few
times now ... maybe there's a bug in the panic output path?

It seems that the power interrupt is an error with the CPU exceeded the
OSes current requested frequency on the package.  If I disable on demand
cpu frequency, the problem goes away.

Anyhoo, here's a patch...

8<

When adding a CPU there is a small window in which interrupts are enabled and
the clock tick device has not been initialized.  If an interrupt occurs in
this window, irq_exit() will be called which calls tick_nohz_irq_exit() which
in turn calls __tick_nohz_idle_enter().

__tick_nohz_idle() enter assumes that the tick has been initialized.  In the
above case, however, it has not and this leads to what appears to be a system
hang on latest linux.git or a the following panic on RHEL6:

Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
RIP: 0010:[]  [] 
tick_nohz_stop_sched_tick+0x2a5/0x3e0
RSP: 0018:88089c503f38  EFLAGS: 00010046
RAX: 81c07520 RBX: 88089c5116a0 RCX: 02f04bb18cd8
RDX:  RSI: a1b5 RDI: 02f04bb0eb23
RBP: 88089c503f88 R08: 88089c50e060 R09: 
R10: 0001 R11:  R12: 0017
R13: 02f04bb17dd5 R14:  R15: 0092
FS:  () GS:88089c50() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 0078 CR3: 01a85000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper (pid: 0, threadinfo 8810745c, task 8808740f2080)
Stack:
 000116a0 0087 88089c503f78 0046
 88089c503f98   
   88089c503f98 81076d86
Call Trace:
 
 [] irq_exit+0x76/0x90
 [] smp_thermal_interrupt+0x26/0x40
 [] thermal_interrupt+0x13/0x20
 
 [] ? start_secondary+0x127/0x2ef
 [] ? start_secondary+0x120/0x2ef

The code currently assumes that the tick device is initialized when
irq_enter() and irq_exit() are called.  This is not correct and a check must
be performed prior to entering the tick code through these code paths to
ensure that the tick device is initialized and running.

I've only seen this occur on a few systems.  I've tested with and without the
patch and as far as I can tell this patch resolves the problem on
linux.git top of tree.

Signed-off-by: Prarit Bhargava 
Cc: Thomas Gleixner 
Cc: John Stultz 
---
 kernel/time/tick-sched.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a19a399..5027187 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -567,6 +567,12 @@ EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 void tick_nohz_irq_exit(void)
 {
struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+   struct clock_event_device *dev =
+__get_cpu_var(tick_cpu_device).evtdev;
+
+   /* Has the tick been initialized yet? */
+   if (unlikely(!dev || dev->mode == CLOCK_EVT_MODE_UNUSED))
+   return;
 
if (!ts->inidle)
return;
@@ -809,6 +815,12 @@ static inline void tick_check_nohz(int cpu) { }
  */
 void tick_check_idle(int cpu)
 {
+   struct clock_event_device *dev = per_cpu(tick_cpu_device, cpu).evtdev;
+
+   /* Has the tick been initialized yet? */
+   if (unlikely(!dev || dev->mode == CLOCK_EVT_MODE_UNUSED))
+   return;
+
tick_check_oneshot_broadcast(cpu);
tick_check_nohz(cpu);
 }
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] NOHZ, check to see if tick device is initialized in IRQ handling path

2013-05-03 Thread Prarit Bhargava



On 05/03/2013 04:10 AM, Thomas Gleixner wrote:
> On Fri, 3 May 2013, Thomas Gleixner wrote:
> 
>> On Tue, 30 Apr 2013, Prarit Bhargava wrote:
>>
>>> 2nd try at this ... going with a more global cc.
>>>
>>> I think the linux.git "system hang" isn't really a hang.  For some reason 
>>> the
>>> panic text wasn't displayed on the console.  I've seen this behaviour a few
>>> times now ... maybe there's a bug in the panic output path?
>>>
>>> It seems that the power interrupt is an error with the CPU exceeded the
>>> OSes current requested frequency on the package.  If I disable on demand
>>> cpu frequency, the problem goes away.
>>
>> Huch?
>>  
>>> Anyhoo, here's a patch...
>>>
>>> 8<
>>>
>>> When adding a CPU there is a small window in which interrupts are enabled 
>>> and
>>> the clock tick device has not been initialized.  If an interrupt occurs in
>>
>> What's that small window and why does it exist?
>>
>>> this window, irq_exit() will be called which calls tick_nohz_irq_exit() 
>>> which
>>> in turn calls __tick_nohz_idle_enter().
>>>
>>> __tick_nohz_idle() enter assumes that the tick has been initialized.  In the
>>> above case, however, it has not and this leads to what appears to be a 
>>> system
>>> hang on latest linux.git or a the following panic on RHEL6:
>>>
>>> Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
>>> RIP: 0010:[]  [] 
>>> tick_nohz_stop_sched_tick+0x2a5/0x3e0
>>> RSP: 0018:88089c503f38  EFLAGS: 00010046
>>> RAX: 81c07520 RBX: 88089c5116a0 RCX: 02f04bb18cd8
>>> RDX:  RSI: a1b5 RDI: 02f04bb0eb23
>>> RBP: 88089c503f88 R08: 88089c50e060 R09: 
>>> R10: 0001 R11:  R12: 0017
>>> R13: 02f04bb17dd5 R14:  R15: 0092
>>> FS:  () GS:88089c50() knlGS:
>>> CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
>>> CR2: 0078 CR3: 01a85000 CR4: 001406e0
>>> DR0:  DR1:  DR2: 
>>> DR3:  DR6: 0ff0 DR7: 0400
>>> Process swapper (pid: 0, threadinfo 8810745c, task 8808740f2080)
>>> Stack:
>>>  000116a0 0087 88089c503f78 0046
>>>  88089c503f98   
>>>    88089c503f98 81076d86
>>> Call Trace:
>>>  
>>>  [] irq_exit+0x76/0x90
>>>  [] smp_thermal_interrupt+0x26/0x40
>>>  [] thermal_interrupt+0x13/0x20
>>>  
>>>  [] ? start_secondary+0x127/0x2ef
>>>  [] ? start_secondary+0x120/0x2ef
>>>
>>> The code currently assumes that the tick device is initialized when
>>> irq_enter() and irq_exit() are called.  This is not correct and a check must
>>> be performed prior to entering the tick code through these code paths to
>>> ensure that the tick device is initialized and running.
>>>
>>> I've only seen this occur on a few systems.  I've tested with and without 
>>> the
>>> patch and as far as I can tell this patch resolves the problem on
>>> linux.git top of tree.
>>>
>>> Signed-off-by: Prarit Bhargava 
>>> Cc: Thomas Gleixner 
>>> Cc: John Stultz 
>>> ---
>>>  kernel/time/tick-sched.c |   12 
>>>  1 file changed, 12 insertions(+)
>>>
>>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>>> index a19a399..5027187 100644
>>> --- a/kernel/time/tick-sched.c
>>> +++ b/kernel/time/tick-sched.c
>>> @@ -567,6 +567,12 @@ EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
>>>  void tick_nohz_irq_exit(void)
>>>  {
>>> struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
>>> +   struct clock_event_device *dev =
>>> +__get_cpu_var(tick_cpu_device).evtdev;
>>> +
>>> +   /* Has the tick been initialized yet? */
>>> +   if (unlikely(!dev || dev->mode == CLOCK_EVT_MODE_UNUSED))
>>> +   return;
>>>  
>>> if (!ts->inidle)
>>> return;
> 
> This does not make any sense at all.
> 
> If ts->i

Re: [PATCH] NOHZ, check to see if tick device is initialized in IRQ handling path

2013-05-03 Thread Prarit Bhargava



On 05/03/2013 09:02 AM, Thomas Gleixner wrote:
> On Fri, 3 May 2013, Prarit Bhargava wrote:
>> Down a cpu and then bring it back up.
> 
> A. So the issue is, that we do not clear the per cpu ts->inidle
> and friends when we bring the cpu down.
> 

:)  I'll give this a shot and will update with testing results.

Thanks tglx!

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/urgent] tick: Cleanup NOHZ per cpu data on cpu down

2013-05-05 Thread Prarit Bhargava



On 05/05/2013 02:20 AM, tip-bot for Thomas Gleixner wrote:
> Commit-ID:  ae7868e241c015aadc8632d9fe633a102a5918f6
> Gitweb: http://git.kernel.org/tip/ae7868e241c015aadc8632d9fe633a102a5918f6
> Author: Thomas Gleixner 
> AuthorDate: Fri, 3 May 2013 15:02:50 +0200
> Committer:  Thomas Gleixner 
> CommitDate: Sun, 5 May 2013 08:15:11 +0200
> 
> tick: Cleanup NOHZ per cpu data on cpu down
> 
> Prarit reported a crash on CPU offline/online. The reason is that on
> CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
> up. If at cpu online an interrupt happens before the per cpu tick
> device is registered the irq_enter() check potentially sees stale data
> and dereferences a NULL pointer.
> 
> Cleanup the data after the cpu is dead.
> 
> Reported-by: Prarit Bhargava 
> Cc: sta...@vger.kernel.org
> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionos
> Signed-off-by: Thomas Gleixner 
> ---
>  kernel/time/tick-common.c   | 1 +
>  kernel/time/tick-internal.h | 6 ++
>  kernel/time/tick-sched.c| 7 +++
>  3 files changed, 14 insertions(+)

Whoa -- I thought I said I'll test this first.  It doesn't work :( so that means
something else is wrong ... I'm in the middle of debug ATM.

P.

> 
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index 6176a3e..29b765d 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -387,6 +387,7 @@ static int tick_notify(struct notifier_block *nb, 
> unsigned long reason,
>   tick_shutdown_broadcast_oneshot(dev);
>   tick_shutdown_broadcast(dev);
>   tick_shutdown(dev);
> + tick_shutdown_nohz(dev);
>   break;
>  
>   case CLOCK_EVT_NOTIFY_SUSPEND:
> diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> index f0299ea..9644b29 100644
> --- a/kernel/time/tick-internal.h
> +++ b/kernel/time/tick-internal.h
> @@ -144,3 +144,9 @@ static inline int tick_device_is_functional(struct 
> clock_event_device *dev)
>  #endif
>  
>  extern void do_timer(unsigned long ticks);
> +
> +#ifdef CONFIG_NO_HZ
> +extern void tick_shutdown_nohz(unsigned int *cpup);
> +#else
> +static inline void tick_shutdown_nohz(unsigned int *cpup) { }
> +#endif
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 225f8bf..e985ccd 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -797,6 +797,13 @@ static inline void tick_check_nohz(int cpu)
>   }
>  }
>  
> +void tick_shutdown_nohz(unsigned int *cpup)
> +{
> + struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> +
> + memset(ts, 0, sizeof(*ts));
> +}
> +
>  #else
>  
>  static inline void tick_nohz_switch_to_nohz(void) { }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] NOHZ, check to see if tick device is initialized in IRQ handling path

2013-04-15 Thread Prarit Bhargava

I think the linux.git "system hang" isn't really a hang.  For some reason the
panic text wasn't displayed on the console.  I've seen this behaviour a few
times now ... maybe there's a bug in the panic output path?

I also haven't determined why I'm seeing a thermal interrupt during a CPU
hotplug test.  It could be a problem with the hardware I'm testing on, but
the bottom line is that it is possible to take an interrupt before the
tick's evtdev is set.

I can also see via some simple printk debug that it certainly looks like
the thermal interrupt code is enabled prior to the tick:

echo 1 > /sys/.../cpu/cpu20/online

[  347.402647] thermal_throttle_cpu_callback: cpu 20 add dev  <<< my debug
[  347.408992] Booting Node 0 Processor 20 APIC 0x1
[  347.429219] cpu_notify: called with CPU_STARTING <<< my debug
[  347.431261] cpu_notify: called with CPU_ONLINE <<< my debug
[  347.429219] tick_check_new_device: new cpu 20 evtdev start  <<< my debug
[  347.462542] microcode: CPU20 sig=0x306e2, pf=0x1, revision=0x209
[  347.469276] platform microcode: firmware: requesting intel-ucode/06-3e-02

Here's a patch...

8<

When adding a CPU there is a small window in which interrupts are enabled and
the clock tick device has not been initialized.  If an interrupt occurs in
this window, irq_exit() will be called which calls tick_nohz_irq_exit() which
in turn calls __tick_nohz_idle_enter().

__tick_nohz_idle() enter assumes that the tick has been initialized.  In the
above case, however, it has not and this leads to what appears to be a system
hang on latest linux.git or a the following panic on RHEL6:

Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
RIP: 0010:[]  [] 
tick_nohz_stop_sched_tick+0x2a5/0x3e0
RSP: 0018:88089c503f38  EFLAGS: 00010046
RAX: 81c07520 RBX: 88089c5116a0 RCX: 02f04bb18cd8
RDX:  RSI: a1b5 RDI: 02f04bb0eb23
RBP: 88089c503f88 R08: 88089c50e060 R09: 
R10: 0001 R11:  R12: 0017
R13: 02f04bb17dd5 R14:  R15: 0092
FS:  () GS:88089c50() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 0078 CR3: 01a85000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper (pid: 0, threadinfo 8810745c, task 8808740f2080)
Stack:
 000116a0 0087 88089c503f78 0046
 88089c503f98   
   88089c503f98 81076d86
Call Trace:
 
 [] irq_exit+0x76/0x90
 [] smp_thermal_interrupt+0x26/0x40
 [] thermal_interrupt+0x13/0x20
 
 [] ? start_secondary+0x127/0x2ef
 [] ? start_secondary+0x120/0x2ef

The code currently assumes that the tick device is initialized when
irq_enter() and irq_exit() are called.  This is not correct and a check must
be performed prior to entering the tick code through these code paths to
ensure that the tick device is initialized and running.

I've only seen this occur on one system.  I've tested with and without the
patch and as far as I can tell this patch resolves the problem on
linux.git top of tree.

Signed-off-by: Prarit Bhargava 
Cc: Thomas Gleixner 
---
 kernel/time/tick-sched.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a19a399..5027187 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -567,6 +567,12 @@ EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 void tick_nohz_irq_exit(void)
 {
struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+   struct clock_event_device *dev =
+__get_cpu_var(tick_cpu_device).evtdev;
+
+   /* Has the tick been initialized yet? */
+   if (unlikely(!dev || dev->mode == CLOCK_EVT_MODE_UNUSED))
+   return;
 
if (!ts->inidle)
return;
@@ -809,6 +815,12 @@ static inline void tick_check_nohz(int cpu) { }
  */
 void tick_check_idle(int cpu)
 {
+   struct clock_event_device *dev = per_cpu(tick_cpu_device, cpu).evtdev;
+
+   /* Has the tick been initialized yet? */
+   if (unlikely(!dev || dev->mode == CLOCK_EVT_MODE_UNUSED))
+   return;
+
tick_check_oneshot_broadcast(cpu);
tick_check_nohz(cpu);
 }
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, clocksource, fix !CONFIG_CLOCKSOURCE_WATCHDOG compile

2013-05-27 Thread Prarit Bhargava



On 05/27/2013 06:00 AM, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Prarit Bhargava wrote:
> 
>> If I explicitly disable the clocksource watchdog in the x86 Kconfig,
> 
> And why do you want to do that?

Hey Thomas, I was debugging something and stumbled across this.

IIRC the issue was that there some weirdness on a series of new AMD systems and
unfortunately the watchdog would fire and switch clocksources on me :(  That
resulted in me not being able to debug the HW because the clocksource I wanted
was no longer available.

P.

> 
>> the x86 kernel will not compile unless this is properly defined.
>>
>> Signed-off-by: Prarit Bhargava 
>> Cc: John Stultz 
>> Cc: Thomas Gleixner 
>> Cc: x...@kernel.org
>> ---
>>  kernel/time/clocksource.c |1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index c958338..e04821f 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -450,6 +450,7 @@ static void clocksource_enqueue_watchdog(struct 
>> clocksource *cs)
>>  static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
>>  static inline void clocksource_resume_watchdog(void) { }
>>  static inline int clocksource_watchdog_kthread(void *data) { return 0; }
>> +void clocksource_mark_unstable(struct clocksource *cs) { }
>>  
>>  #endif /* CONFIG_CLOCKSOURCE_WATCHDOG */
>>  
>> -- 
>> 1.7.9.3
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Move ntp state to be protected by timekeeping lock

2013-03-28 Thread Prarit Bhargava



On 03/25/2013 04:08 PM, John Stultz wrote:

> 
> The problem of course, is that the new asynchronous behavior the
> shadow updates breaks some of the assumptions on how the NTP state
> is used. Thus to correct this, we really need to serialize the ntp
> state updates along with the timekeeping state. With the added side
> benefit that of reducing lock acquisitions.
> 
> The downside is that the timekeeping state has been cleaned up to
> live nicely in the timekeeper struct, which nicely bounded what the
> timekeeping lock protected. Where as 99% of the NTP state was all
> static to ntp.c, and was protected by the ntp.c static ntp_lock, so
> it was all nicely encapsulated as well. 

> 
> This patchset makes the lock ownership lines less obvious, but I've
> been sure to keep the ntp state static to ntp.c and instead provided
> some accessors via ntp-internal.h that timekeping code can use to
> make changes.  The only really ugly part is that do_adjtimex() has
> to split some of the logic between timekeeping.c and ntp.c in order
> to really get the locking done correctly.


John, I have no technical objection to the patch ... but after reviewing the
changes I think you've significantly changed the way the locking works in the
NTP code, and IMO, some note should be made in the code about the timekeeper
lock and its impact to ntp.  It's not trivial reading this code and I think the
dropping of the ntp lock will confuse the casual viewer.

IMO of course.

> 
> I may try to rework the code in the future so that the timekeeper
> holds the ntp state and passes it to the ntp.c functions to be
> modified, but that is a much deeper rework then I'd like to do right
> now, and causes fruther complexity to the shadow-state updates, since
> we'd end up unnecessarily copying the ntp state back and forth every
> time.
> 
> This applies on top of my fortglx/3.10/time queue here:
> git://git.linaro.org/people/jstultz/linux.git fortglx/3.10/time
> 
> If you want to see this entire set (along with Thomas' shadow-update
> work) it can be found here:
> git://git.linaro.org/people/jstultz/linux.git dev/tglx-shadowtime
> or 
> http://git.linaro.org/gitweb?p=people/jstultz/linux.git;a=shortlog;h=refs/heads/dev/tglx-shadowtime
> 

Beyond the above comment, a quick test shows that ntp does work AFAICT (at least
on F18 + your git tree.  I'll try and do a heavier test next week.

So for now ...

Acked-by: Prarit Bhargava 

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFE PATCH] hrtimer, add expiry time overflow check in hrtimer_interrupt

2013-03-28 Thread Prarit Bhargava

The settimeofday01 test in the LTP testsuite effectively does

gettimeofday(current time);
settimeofday(Jan 1, 1970 + 100 seconds);
settimeofday(current time);

This test causes a stack trace to be displayed on the console during the
setting of timeofday to Jan 1, 1970 + 100 seconds:

[  131.066751] [ cut here ]
[  131.096448] WARNING: at kernel/time/clockevents.c:209 
clockevents_program_event+0x135/0x140()
[  131.104935] Hardware name: Dinar
[  131.108150] Modules linked in: sg nfsv3 nfs_acl nfsv4 auth_rpcgss nfs 
dns_resolver fscache lockd sunrpc nf_conntrack_netbios_ns 
nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle 
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables 
kvm_amd kvm sp5100_tco bnx2 i2c_piix4 crc32c_intel k10temp fam15h_power 
ghash_clmulni_intel amd64_edac_mod pcspkr serio_raw edac_mce_amd edac_core 
microcode xfs libcrc32c sr_mod sd_mod cdrom ata_generic crc_t10dif pata_acpi 
radeon i2c_algo_bit drm_kms_helper ttm drm ahci pata_atiixp libahci libata 
usb_storage i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  131.176784] Pid: 0, comm: swapper/28 Not tainted 3.8.0+ #6
[  131.182248] Call Trace:
[  131.184684][] warn_slowpath_common+0x7f/0xc0
[  131.191312]  [] warn_slowpath_null+0x1a/0x20
[  131.197131]  [] clockevents_program_event+0x135/0x140
[  131.203721]  [] tick_program_event+0x24/0x30
[  131.209534]  [] hrtimer_interrupt+0x131/0x230
[  131.215437]  [] ? cpufreq_p4_target+0x130/0x130
[  131.221509]  [] smp_apic_timer_interrupt+0x69/0x99
[  131.227839]  [] apic_timer_interrupt+0x6d/0x80
[  131.233816][] ? sched_clock_cpu+0xc5/0x120
[  131.240267]  [] ? cpuidle_wrap_enter+0x50/0xa0
[  131.246252]  [] ? cpuidle_wrap_enter+0x49/0xa0
[  131.252238]  [] cpuidle_enter_tk+0x10/0x20
[  131.257877]  [] cpuidle_idle_call+0xa9/0x260
[  131.263692]  [] cpu_idle+0xaf/0x120
[  131.268727]  [] start_secondary+0x255/0x257
[  131.274449] ---[ end trace 1151a50552231615 ]---

When we change the system time to a low value like this, the value of
timekeeper->offs_real will be a negative value.

It seems that the WARN occurs because an hrtimer has been started in the time
between the releasing of the timekeeper lock and the IPI call (via a call to
on_each_cpu) in clock_was_set() in the do_settimeofday() code.  The end result
is that a REALTIME_CLOCK timer has been added with softexpires = expires =
KTIME_MAX.  The hrtimer_interrupt() fires/is called and the loop at
kernel/hrtimer.c:1289 is executed.  In this loop the code subtracts the
clock base's offset (which was set to timekeeper->offs_real in
do_settimeofday()) from the current hrtimer_cpu_base->expiry value (which
was KTIME_MAX):

KTIME_MAX - (a negative value) = overflow

A simple check for an overflow can resolve this problem.  Using KTIME_MAX
instead of the overflow value will result in the hrtimer function being run,
and the reprogramming of the timer after that.

Signed-off-by: Prarit Bhargava 
Cc: Thomas Gleixner 
Cc: John Stultz 
---
 kernel/hrtimer.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index cc47812..17eafd7 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1309,6 +1309,8 @@ retry:
 
expires = ktime_sub(hrtimer_get_expires(timer),
base->offset);
+   if (expires.tv64 < 0)
+   expires.tv64 = KTIME_MAX;
if (expires.tv64 < expires_next.tv64)
expires_next = expires;
break;
-- 
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

2013-03-09 Thread Prarit Bhargava

On 03/04/2013 08:24 AM, Don Dutile wrote:
> On 03/02/2013 10:59 AM, Andreas Mohr wrote:
>> Hi,
>>
>>> if ((revision == 0x13)&&  irq_remapping_enabled) {
>>> +pr_warn("WARNING WARNING WARNING WARNING WARNING
>>> WARNING\n"
>>> +"This system BIOS has enabled interrupt
>>> remapping\n"
>>> +"on a chipset that contains an errata making
>>> that\n"
>>> +"feature unstable.  Please reboot with
>>> nointremap\n"
>>> +"added to the kernel command line and contact\n"
>>> +"your BIOS vendor for an update");
>>> +}
>>
>> Forgive me, but ISTR that there's a special BIOS firmware quirk bug 
>> annotating
>> logger warning message mechanism (have I managed to hit all keywords yet? ;)
>> in the kernel which might be useful in this case.
>>
>>
>> OK, found something (but I don't think it was the mechanism
>> that ISTR - perhaps it got modernized?):
>>
>>
>> include/linux/printk.h:
>>
>> /*
>>   * FW_BUG
>>   * Add this to a message where you are sure the firmware is buggy or
>>   * behaves
>>   * really stupid or out of spec. Be aware that the responsible BIOS
>>   * developer
>>   * should be able to fix this issue or at least get a concrete idea of
>>   * the
>>   * problem by reading your message without the need of looking at the
>>   * kernel
>>   * code.
>>   *
>>   * Use it for definite and high priority BIOS bugs.
>>   *
>>   * FW_WARN
>>   * Use it for not that clear (e.g. could the kernel messed up things
>>   * already?)
>>   * and medium priority BIOS bugs.
>>   *
>>   * FW_INFO
>>   * Use this one if you want to tell the user or vendor about something
>>   * suspicious, but generally harmless related to the firmware.
>>   *
>>   * Use it for information or very low priority BIOS bugs.
>>   */
>>
> 
> It is not a firmware/BIOS bug. 

Correct.  This is a hardware bug that *may be* resolved through a BIOS update.
But there is no guarantee that a BIOS update is available.  Labelling it a FW
bug would be a mistake.

Prarit's comment to annotate it as
> a HW_ERR is more accurate.  A software patch is being tested now
> to see if it can do set-affinity in a manner that avoids this race
> and enables IR to stay on for all these systems.  It requires
> more testing to ensure the logic is valid.  This patch was
> recommended as a necessary short-term fix, and to highlight to
> others this possible state -- which Gerry mentioned he had.

Yup -- as mstowe asked ... should we even consider this patch then, or should we
wait for the possible real fix?

Having said that ... I'm nervous about playing around with the set-affinity path
for this HW problem.  We're basically changing good/reliable code for broken-ass
hardware.  :/  That doesn't seem a like a good choice to me.

I can understand if we all feel that the code is broken, or it can be made
better -- but to change it because of bad HW  just doesn't seem like the right
thing to do.

IMO.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

2013-03-11 Thread Prarit Bhargava

On 03/11/2013 07:25 AM, Neil Horman wrote:
> On Sat, Mar 09, 2013 at 03:20:57PM -0700, Myron Stowe wrote:
>> On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman  wrote:
>>> On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote:
>>>> A few years back intel published a spec update:
>>>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>>>
>>>> For the 5520 and 5500 chipsets which contained an errata (specificially 
>>>> errata
>>>> 53), which noted that these chipsets can't properly do interrupt 
>>>> remapping, and
>>>> as a result the recommend that interrupt remapping be disabled in bios.  
>>>> While
>>>> many vendors have a bios update to do exactly that, not all do, and of 
>>>> course
>>>> not all users update their bios to a level that corrects the problem.  As a
>>>> result, occasionally interrupts can arrive at a cpu even after affinity 
>>>> for that
>>>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>>>> characterized by the message:
>>>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>>>
>>>> There have been several incidents recently of people seeing this error, and
>>>> investigation has shown that they have system for which their BIOS level 
>>>> is such
>>>> that this feature was not properly turned off.  As such, it would be good 
>>>> to
>>>> give them a reminder that their systems are vulnurable to this problem.
>>>>
>>>> Signed-off-by: Neil Horman 
>>>> CC: Prarit Bhargava 
>>>> CC: Don Zickus 
>>>> CC: Don Dutile 
>>>> CC: Bjorn Helgaas 
>>>> CC: Asit Mallick 
>>>> CC: linux-...@vger.kernel.org
>>>>
>>> Ping, anyone want to Ack/Nack this?
>>
>> Don's comment earlier seems to imply that this is a short term fix and
>> that a more long term fix may be coming soon.  If that is the case
>> wouldn't we want to wait for the long term fix and just pull that in?
>>
>> Myron
>>
> As Don and Prarit have mentioned, an alternate change is being worked on and
> tested that may work around this issue, but we're not yet sure that it will, 
> and
> we're not sure of the time frame for this fix.  Normally I would agree, that 
> it
> would be easier just to wait for the long term fix, but as Prarit noted, since
> this hardware is in fact broken, I would rather do a both approach.  Its fine 
> if
> this gets reverted tomorrow with a longer term fix as far as I'm concerned, 
> its
> just caused enough problems already that I'd like to see it in place until the
> better solution arrives.

I agree with Neil on this.  While vendors are supposed to fix their BIOSes,
experience has shown that not all vendors will fix their BIOSes for a problem
like this.

Ack this quirk.

P.

> Neil
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] time: Fix casting issue in tk_set_xtime and tk_xtime_add

2012-07-24 Thread Prarit Bhargava



On 07/23/2012 04:22 PM, John Stultz wrote:
> Fix missing casts that can cause boot problems on 32bit systems,
> most easily observed with Xen systems. This issue was introduced
> w/ 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1.
> 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Cc: Prarit Bhargava 
> Cc: Konrad Rzeszutek Wilk 
> Reported-by: Konrad Rzeszutek Wilk 
> Tested-by: Konrad Rzeszutek Wilk 
> Signed-off-by: John Stultz 

Acked-by: Prarit Bhargava 

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()

2012-07-11 Thread Prarit Bhargava



On 07/10/2012 06:43 PM, John Stultz wrote:
> clock_was_set() cannot be called from hard interrupt context because
> it calls on_each_cpu(). For fixing the widely reported leap seconds
> issue it's necessary to call it from the timer interrupt context.
> 
> Provide a new function which denotes it in the hrtimer cpu base
> structure of the cpu on which it is called and raising the timer
> softirq.
> 
> We then execute the clock_was_set() notificiation in the timer softirq
> context in hrtimer_run_pending().

I wish there was a nicer way to do this ... but looking at the code I can't
figure out a better way.  (no offense John, it's just the way the code is ;) )

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)

2012-07-11 Thread Prarit Bhargava



On 07/10/2012 06:43 PM, John Stultz wrote:
> Over the weekend, Thomas got a chance to review the leap second fix
> in more detail and had a few additional changes he wanted to make
> to improve performance as well as style.
> 
> So this iteration includes his modifications.
> 
> Once merged, I'll be working to get the backports finished as quickly
> as I can and sent to -stable.
> 
> thanks
> -john
> 

Add an Acked-by: Prarit Bhargava 

John -- I'll do some testing of this patchset today and tomorrow and let you
know the results.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()

2012-07-11 Thread Prarit Bhargava


>> I wish there was a nicer way to do this ... but looking at the code I can't
>> figure out a better way.  (no offense John, it's just the way the code is ;) 
>> )
> 
> Yeah, I had the same discussion with Peter earlier today. There is
> only a rather limited set of options.
> 
> 1) Retrigger the timer interrupt vectors on all CPUs - except the one
>we are running on, but we have no interface for that at the moment
> 
> 2) Do the nasty __smp_call_function_single() hack
> 
>Preallocate call_single_data for all cpus and do a
>__smp_call_function_single() on all online cpus.
> 
>This can be called from hard interrupt context or irq disabled
>regions.
> 
>That would allow to get rid of the whole delay magic all
>together.
> 
> Thoughts?
> 

Both of those options seem like a lot of work for something that happens once
every 3-4 years, and may not happen ever again[1].  Based on that statement, if
we're going to modify code I would prefer that it be as lightweight as possible.
 So, in terms of the kernel, option 2 is likely the best way to go rather than
introducing new code that will be used once every 3-4 years.

I keep asking the question of why the mechanism of inserting a leap second isn't
moved into userspace ntpd (or some other appropriate daemon).  I suppose there
is a risk of ntpd being starved out on heavily loaded systems...

P.

[1] http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()

2012-07-12 Thread Prarit Bhargava

On 07/12/2012 03:44 AM, Jan Ceuleers wrote:
> On 07/11/2012 06:47 PM, John Stultz wrote:
>> I'll see if my worry is unfounded, but it might be a bit too clever for rare 
>> events.
> 
> Full ACK.
> 
> There is an unfortunate history of critical-to-moderately-serious bugs in
> the leap second handling, so I submit that what is needed is a simple,
> obviously-correct and robust mechanism. Robust statically, but also in the
> face of code churn because these code paths are exercised so rarely out in
> the wild.
> 
> Just my opinion, FWIW.
> 

Ditto - and it's not just FWIW.

John (and everyone else), I think we're over-thinking this.  Would it be nice to
get an extremely elegant solution to this?  Yeah ... it would.  But the reality
is that we're not going to get there and IMO we're making things too complex for
this little piece of code.

Acked-by: Prarit Bhargava 

IMO, this is the simplest way to move forward with this code.

P.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)

2012-07-12 Thread Prarit Bhargava

On 07/11/2012 07:17 AM, Ingo Molnar wrote:
> 
> * John Stultz  wrote:
> 
>> Over the weekend, Thomas got a chance to review the leap 
>> second fix in more detail and had a few additional changes he 
>> wanted to make to improve performance as well as style.

John,

FYI -- Using a mix of AMD and Intel systems (big and small, large memory and
small memory footprint), current test runs are ~18 hours at this point without
any issues seen, using a slightly modified version of your leap-a-day.c .

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFE PATCH 0/2] x86, rtc, ntp, Enable full rtc synchronization

2013-02-14 Thread Prarit Bhargava

This patchset enables a full rtc synchronization via ntp on x86.  The current
codebase (plus http://marc.info/?l=linux-kernel&m=136036689219340&w=2, which is
queued for tip), will attempt to synchronize the rtc to the system time every
11 minutes if ntp is running.

The problem in the current code is that the synchronization will only occur
if the system time is within +/-15 minutes of the current rtc time.  ie)
we only do a "partial" synchronization of the rtc.  Other architectures do
full synchronizations and the partial sync appears to be a software limitation.

This patchset introduces a full synchronization of the rtc, and allows the
writing of the rtc date and time via sysfs (read for date and time is already
implemented).

I tested this patch by using the write capability introduced in 2/2 to
write in older and newer dates into the rtc, and then rebooting with ntpdate,
and/or ntpdate enabled and verifying the correct setting of the hwclock (and
system time) via calls to date and hwclock (all on 64-bit x86)

I have not tested the mrst/vrtc.c code, however, code inspection indicates
that the only change required is the year offset of 1972.  I booted 32-bit
Fedora 18 on an UEFI system and confirmed that the system time and hwclock
were now correct at boot.

Signed-off-by: Prarit Bhargava 
Cc: Thomas Gleixner 
Cc: John Stultz 
Cc: x...@kernel.org
Cc: Matt Fleming 
Cc: David Vrabel 
Cc: Andrew Morton 
Cc: Andi Kleen 
Cc: linux-...@vger.kernel.org

Prarit Bhargava (2):
  x86, rtc, ntp, Do full rtc synchronization with ntp
  rtc, add write functionality to sysfs

 arch/x86/kernel/rtc.c | 69 ++
 arch/x86/platform/efi/efi.c   | 24 
 arch/x86/platform/mrst/vrtc.c | 41 -
 drivers/rtc/rtc-sysfs.c   | 86 ++-
 4 files changed, 136 insertions(+), 84 deletions(-)

-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFE PATCH 1/2] x86, rtc, ntp, Do full rtc synchronization with ntp

2013-02-14 Thread Prarit Bhargava

Every 11 minutes ntp attempts to update the x86 rtc with the current
system time.  Currently, the x86 code only updates the rtc if the system
time is within +/-15 minutes of the current value of the rtc.  Other
architectures do a full synchronization and there is no reason that x86
should be software limited to a 30 minute window.

This patch changes the behavior of the kernel to do a full synchronization
(year, month, day, hour, minute, and second) of the rtc when ntp requests
a synchronization between the system time and the rtc.

I've used the RTC library functions in this patchset as they do all the
required bounds checking.

Signed-off-by: Prarit Bhargava 
Cc: Thomas Gleixner 
Cc: John Stultz 
Cc: x...@kernel.org
Cc: Matt Fleming 
Cc: David Vrabel 
Cc: Andrew Morton 
Cc: Andi Kleen 
Cc: linux-...@vger.kernel.org
---
 arch/x86/kernel/rtc.c | 69 ---
 arch/x86/platform/efi/efi.c   | 24 ++-
 arch/x86/platform/mrst/vrtc.c | 41 ++---
 3 files changed, 52 insertions(+), 82 deletions(-)

diff --git a/arch/x86/kernel/rtc.c b/arch/x86/kernel/rtc.c
index 801602b..44e6c3a 100644
--- a/arch/x86/kernel/rtc.c
+++ b/arch/x86/kernel/rtc.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_32
 /*
@@ -36,70 +37,24 @@ EXPORT_SYMBOL(rtc_lock);
  * nowtime is written into the registers of the CMOS clock, it will
  * jump to the next second precisely 500 ms later. Check the Motorola
  * MC146818A or Dallas DS12887 data sheet for details.
- *
- * BUG: This routine does not handle hour overflow properly; it just
- *  sets the minutes. Usually you'll only notice that after reboot!
  */
 int mach_set_rtc_mmss(unsigned long nowtime)
 {
-   int real_seconds, real_minutes, cmos_minutes;
-   unsigned char save_control, save_freq_select;
-   unsigned long flags;
+   struct rtc_time tm;
int retval = 0;
 
-   spin_lock_irqsave(&rtc_lock, flags);
-
-/* tell the clock it's being set */
-   save_control = CMOS_READ(RTC_CONTROL);
-   CMOS_WRITE((save_control|RTC_SET), RTC_CONTROL);
-
-   /* stop and reset prescaler */
-   save_freq_select = CMOS_READ(RTC_FREQ_SELECT);
-   CMOS_WRITE((save_freq_select|RTC_DIV_RESET2), RTC_FREQ_SELECT);
-
-   cmos_minutes = CMOS_READ(RTC_MINUTES);
-   if (!(save_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD)
-   cmos_minutes = bcd2bin(cmos_minutes);
-
-   /*
-* since we're only adjusting minutes and seconds,
-* don't interfere with hour overflow. This avoids
-* messing with unknown time zones but requires your
-* RTC not to be off by more than 15 minutes
-*/
-   real_seconds = nowtime % 60;
-   real_minutes = nowtime / 60;
-   /* correct for half hour time zone */
-   if (((abs(real_minutes - cmos_minutes) + 15)/30) & 1)
-   real_minutes += 30;
-   real_minutes %= 60;
-
-   if (abs(real_minutes - cmos_minutes) < 30) {
-   if (!(save_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) {
-   real_seconds = bin2bcd(real_seconds);
-   real_minutes = bin2bcd(real_minutes);
-   }
-   CMOS_WRITE(real_seconds, RTC_SECONDS);
-   CMOS_WRITE(real_minutes, RTC_MINUTES);
+   rtc_time_to_tm(nowtime, &tm);
+   if (!rtc_valid_tm(&tm)) {
+   retval = set_rtc_time(&tm);
+   if (retval)
+   printk(KERN_ERR "%s: RTC write failed with error %d\n",
+  __FUNCTION__, retval);
} else {
-   printk_once(KERN_NOTICE
-  "set_rtc_mmss: can't update from %d to %d\n",
-  cmos_minutes, real_minutes);
-   retval = -1;
+   printk(KERN_ERR
+  "%s: Invalid RTC value: write of %lx to RTC failed\n",
+   __FUNCTION__, nowtime);
+   retval = -EINVAL;
}
-
-   /* The following flags have to be released exactly in this order,
-* otherwise the DS12887 (popular MC146818A clone with integrated
-* battery and quartz) will not reset the oscillator and will not
-* update precisely 500 ms later. You won't find this mentioned in
-* the Dallas Semiconductor data sheets, but who believes data
-* sheets anyway ...   -- Markus Kuhn
-*/
-   CMOS_WRITE(save_control, RTC_CONTROL);
-   CMOS_WRITE(save_freq_select, RTC_FREQ_SELECT);
-
-   spin_unlock_irqrestore(&rtc_lock, flags);
-
return retval;
 }
 
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 77cf009..6ca930e 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -48,6 +48,7 @@
 #include 
 #include

[RFE PATCH 2/2] rtc, add write functionality to sysfs

2013-02-14 Thread Prarit Bhargava

/sys/class/rtc/rtcX/date and /sys/class/rtc/rtcX/time currently have
read-only access.  This patch introduces write functionality which will
set the rtc time.

Usage: echo -MM-DD > /sys/class/rtc/rtcX/date
   echo HH:MM:SS > /sys/class/rtc/rtcX/time

Signed-off-by: Prarit Bhargava 
Cc: Thomas Gleixner 
Cc: John Stultz 
Cc: x...@kernel.org
Cc: Matt Fleming 
Cc: David Vrabel 
Cc: Andrew Morton 
Cc: Andi Kleen 
Cc: linux-...@vger.kernel.org
---
 drivers/rtc/rtc-sysfs.c | 86 +++--
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/drivers/rtc/rtc-sysfs.c b/drivers/rtc/rtc-sysfs.c
index b70e2bb..87d63d7 100644
--- a/drivers/rtc/rtc-sysfs.c
+++ b/drivers/rtc/rtc-sysfs.c
@@ -48,6 +48,46 @@ rtc_sysfs_show_date(struct device *dev, struct 
device_attribute *attr,
 }
 
 static ssize_t
+rtc_sysfs_store_date(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   ssize_t retval;
+   struct rtc_time tm;
+   char tmp[11] = "\0", *tmpptr = tmp;
+   char *cyear, *cmonth, *cday;
+   int year, month, day;
+
+   if (!capable(CAP_SYS_TIME))
+   return -EPERM;
+
+   /* from rtc_sysfs_show_date(): date string format is -MM-DD */
+   if (strlen(buf) != 11)
+   return -EINVAL;
+
+   strncpy(tmp, buf, 11);
+   cyear = strsep(&tmpptr, "-");
+   sscanf(cyear, "%i", &year);
+   cmonth = strsep(&tmpptr, "-");
+   sscanf(cmonth, "%i", &month);
+   cday = strsep(&tmpptr, "-");
+   sscanf(cday, "%i", &day);
+
+   retval = rtc_read_time(to_rtc_device(dev), &tm);
+   if (retval)
+   return -ENODEV;
+
+   tm.tm_year = year - 1900;
+   tm.tm_mon = month - 1;
+   tm.tm_mday = day;
+
+   retval = rtc_set_time(to_rtc_device(dev), &tm);
+   if (retval)
+   return retval;
+
+   return count;
+}
+
+static ssize_t
 rtc_sysfs_show_time(struct device *dev, struct device_attribute *attr,
char *buf)
 {
@@ -64,6 +104,46 @@ rtc_sysfs_show_time(struct device *dev, struct 
device_attribute *attr,
 }
 
 static ssize_t
+rtc_sysfs_store_time(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   ssize_t retval;
+   struct rtc_time tm;
+   char tmp[9] = "\0", *tmpptr = tmp;
+   char *chour, *cmin, *csec;
+   int hour, min, sec;
+
+   if (!capable(CAP_SYS_TIME))
+   return -EPERM;
+
+   /* from rtc_sysfs_show_time(): time string format is HH:MM:SS */
+   if (strlen(buf) != 9)
+   return -EINVAL;
+
+   strncpy(tmp, buf, 9);
+   chour = strsep(&tmpptr, ":");
+   sscanf(chour, "%i", &hour);
+   cmin = strsep(&tmpptr, ":");
+   sscanf(cmin, "%i", &min);
+   csec = strsep(&tmpptr, ":");
+   sscanf(csec, "%i", &sec);
+
+   retval = rtc_read_time(to_rtc_device(dev), &tm);
+   if (retval)
+   return -ENODEV;
+
+   tm.tm_hour = hour;
+   tm.tm_min = min;
+   tm.tm_sec = sec;
+
+   retval = rtc_set_time(to_rtc_device(dev), &tm);
+   if (retval)
+   return retval;
+
+   return count;
+}
+
+static ssize_t
 rtc_sysfs_show_since_epoch(struct device *dev, struct device_attribute *attr,
char *buf)
 {
@@ -124,8 +204,10 @@ rtc_sysfs_show_hctosys(struct device *dev, struct 
device_attribute *attr,
 
 static struct device_attribute rtc_attrs[] = {
__ATTR(name, S_IRUGO, rtc_sysfs_show_name, NULL),
-   __ATTR(date, S_IRUGO, rtc_sysfs_show_date, NULL),
-   __ATTR(time, S_IRUGO, rtc_sysfs_show_time, NULL),
+   __ATTR(date, S_IRUGO | S_IWUGO, rtc_sysfs_show_date,
+  rtc_sysfs_store_date),
+   __ATTR(time, S_IRUGO | S_IWUGO, rtc_sysfs_show_time,
+  rtc_sysfs_store_time),
__ATTR(since_epoch, S_IRUGO, rtc_sysfs_show_since_epoch, NULL),
__ATTR(max_user_freq, S_IRUGO | S_IWUSR, rtc_sysfs_show_max_user_freq,
rtc_sysfs_set_max_user_freq),
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RCU NOHZ, tsc, and clock_gettime

2012-10-11 Thread Prarit Bhargava

I've been tracking an odd bug that may involve the RCU NOHZ code and
just want to know if you have any ideas on debugging and/or what might be
wrong.  Note the bug happens on *BOTH* upstream and the current RHEL6 tree.
The data in this email is from running on RHEL6 because that's what I happen
to be running ATM.  The result, however, is _identical_ to that of linux.git
latest.

The attached program compares userspace TSC reads to the time returned from
the REALTIME_CLOCK[1].  The test does the following

read tsc1
get REALTIME_CLOCK value
read tsc2

and then does a comparison between the tsc read and the REALTIME_CLOCK value
to see if they are in sync with each other.

[I'm leaving out the guts of the analysis here.  It is sufficient to show
examples of "good" data and "bad" data IMO.]

On a good run, we see little variance in between the values:

0   144   0.1
1   138   1.8
2   147  -2.9
3   138   0.6
4   141   0.2
5   138  -1.2
6   147  -0.4
7   144   0.4
8   144  -0.6
9   144  -0.9
   10   144   2.1
   11   138  -0.7
   12   144  -0.4
   13   144   0.5
   14   144   0.6
   15   144  -0.8
   16   141   1.5
   17   147   1.9
   18   141  -0.1
   19   141   0.7
   20   144   0.3
   21   144  -0.4
   22   138   0.2
   23   141  -2.1
   24   144  -1.0
   25   141   1.2
   26   141  -0.5
   27   144  -0.2
   28   138   0.7
   29   144  -0.6
n: 30, slope: 0.50 (1.99 GHz), dev: 1.1 ns, max: 2.9 ns


On a bad run, there is a lot of variance between the values:

0   144-346.0
1   1381410.8
2   138-806.9
3   1414006.6
4   147   -3996.1
5   138-255.8
6   144 -22.2
7   138 -22.4
8   144 102.7
9   141 218.0
   10   138 -11.6
   11   138  -4.9
   12   138 -26.2
   13   138 -51.2
   14   141-280.0
   15   144   -1120.5
   16   144 -13.8
   17   138 -15.9
   18   144 -46.2
   19   138 -46.7
   20   138-453.4
   21   1382062.7
   22   141-125.4
   23   138-453.4
   24   1381050.1
   25   138-643.6
   26   138  14.3
   27   138   7.7
   28   138 -80.6
   29   141 -50.3
n: 30, slope: 0.50 (1.99 GHz), dev: 1231.4 ns, max: 4006.6 ns


It was noted by the bug reporter that specifying "nohz=off" resolved the
problem.  I tested with "nohz=off" and AFAICT it fixes the issue.  I started
out debugging by assuming that delays in the c-state transitions were not being
properly accounted for in the timing calculations.

I ran a baseline test on an unmodified kernel (with no extra boot options) and
confirmed that powertop shows the CPUs entering deep c-states while the test was
running for 300 runs.

I then instrumented the PM QoS and the power management code (specifically
cpuidle).  I put in a large # of printk's to monitor the CPU transitions, and
monitored the power states via powertop in order to verify that the system was
behaving correctly wrt PM QoS.

If you modify the tstsc script to run 300 times with this modified kernel, and
run powertop in the middle of the script, you will see that the processors do
NOT enter deep c-states.  **This means that PM QoS is doing its job correctly**.

After this I decided to use "idle=poll" as a boot parameter.  This kernel
argument forces a kernel polling loop on each cpu that should prevent the
cpu from entering a deep c-state.

When running with idle=poll, powertop indicates that the processors NEVER
enter a deep c-state (or at least AFAICS).  Running the tstsc test again
results in the *SAME* failure as with the regular kernel.

So this means, IMO, that the problem lies within some other aspect of the
NOHZ code.  I started backing out some pieces of the NOHZ code trying to
see what caused the problems with the test, and finally got to the RCU NOHZ
code.  (Yeah ... this wasn't the best thing to do ... and it results in a
almost nonfunctional kernel, especially with filesystems ...)

When I do that, surprisingly the problem goes away.  That is, the test
functions like it should.  I do not see any problems in the calculations and
at the sametime I can confirm that I'm seeing c-state transitions via
powertop.

I've narrowed down a brute-force code removal to code in
dyntick_save_progress_counter()
and rcu_implicit_dynticks_qs() but don't have enough knowledge of NOHZ RCU to
get my hands around the problem.  Admittedly I'm still trying to wrap my head
around dynticks and its usage by reading the code.  I was wondering if anyone
might have an idea of what could be wrong?  I'm certainly willing to continue to
debug.

P.

[1]  Please note that switching to the MONOTONIC, or MONOTONIC_RAW clocks
also result in the same problem.



/* gcc tstsc.c -o tstsc -lrt -lm */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SAMPLES 30
#define SAMPLES

Re: RCU NOHZ, tsc, and clock_gettime

2012-10-12 Thread Prarit Bhargava



On 10/11/2012 04:21 PM, Paul E. McKenney wrote:
> On Thu, Oct 11, 2012 at 12:51:44PM -0700, John Stultz wrote:
>> On 10/11/2012 11:52 AM, Prarit Bhargava wrote:
>>> I've been tracking an odd bug that may involve the RCU NOHZ code and
>>> just want to know if you have any ideas on debugging and/or what might be
>>> wrong.  Note the bug happens on *BOTH* upstream and the current RHEL6 tree.
>>> The data in this email is from running on RHEL6 because that's what I happen
>>> to be running ATM.  The result, however, is _identical_ to that of linux.git
>>> latest.
>>>
>>> The attached program compares userspace TSC reads to the time returned from
>>> the REALTIME_CLOCK[1].  The test does the following
>>>
>>> read tsc1
>>> get REALTIME_CLOCK value
>>> read tsc2
>>>
>>> and then does a comparison between the tsc read and the REALTIME_CLOCK value
>>> to see if they are in sync with each other.
>>>
>>> [I'm leaving out the guts of the analysis here.  It is sufficient to show
>>> examples of "good" data and "bad" data IMO.]
>>>
>>> On a good run, we see little variance in between the values:
>>>
>>> 0   144   0.1
>>> 1   138   1.8
>>> 2   147  -2.9
>> [snip]
>>>29   144  -0.6
>>> n: 30, slope: 0.50 (1.99 GHz), dev: 1.1 ns, max: 2.9 ns
>>>
>>>
>>> On a bad run, there is a lot of variance between the values:
>>>
>>> 0   144-346.0
>>> 1   1381410.8
>>> 2   138-806.9
>>> 3   1414006.6
>>> 4   147   -3996.1
>> [snip]
>>>29   141 -50.3
>>> n: 30, slope: 0.50 (1.99 GHz), dev: 1231.4 ns, max: 4006.6 ns
>>
>>
>> Do you see the same noisy variance when instead of doing:
>> rdtsc()
>> clock_gettime()
>> rdtsc()
>>
>> you do:
>> clock_gettime()
>> clock_gettime()
>>
>> And calculate the delta of the timestamp results?

I do not see the noisy variance when comparing clock_gettime() to 
clock_gettime().

>>
>> Also does this behavior change if you select different clocksources
>> on the system?

No, if the clocksource is the hpet, I still see the large variance. ie) the
behaviour does not change.

>>
>>> It was noted by the bug reporter that specifying "nohz=off" resolved the
>>> problem.  I tested with "nohz=off" and AFAICT it fixes the issue.  I started
>>> out debugging by assuming that delays in the c-state transitions were not 
>>> being
>>> properly accounted for in the timing calculations.
>>>
>>> I ran a baseline test on an unmodified kernel (with no extra boot options) 
>>> and
>>> confirmed that powertop shows the CPUs entering deep c-states while the 
>>> test was
>>> running for 300 runs.
>>>
>>> I then instrumented the PM QoS and the power management code (specifically
>>> cpuidle).  I put in a large # of printk's to monitor the CPU transitions, 
>>> and
>>> monitored the power states via powertop in order to verify that the system 
>>> was
>>> behaving correctly wrt PM QoS.
>>>
>>> If you modify the tstsc script to run 300 times with this modified kernel, 
>>> and
>>> run powertop in the middle of the script, you will see that the processors 
>>> do
>>> NOT enter deep c-states.  **This means that PM QoS is doing its job 
>>> correctly**.
>>
>> So its not clear here,  do you see the same noisier latencies when
>> using PM QoS to limit deep c-states?

I see the same noisier latencies when using PM QoS to limit the deep c-state
transitions.

>>
>> Finally, how many cpus are on the machine you see this with?  Does
>> the effect go away with maxcpus=1?


24 physical/48 logical, 2G/core RAM

The large variance is still there if maxcpus=1.

> 
> Also, what is the value of NR_CPUS?  And exactly which kernel.org kernel
> are you using?

NR_CPUS=4096

I'm using the "main" kernel.org tree,

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

as of commit 4f1cd91497774488ed16119ec3f54b3daf1561de

(I recompiled this morning and re-ran the tests and they still show a large
variance)

> 
> The effect of removing the two functions you noted (on 3.6 and earlier)
> is to prevent RCU from checking for dyntick-idle CPUs, likely incurring
> a cache miss for each CPU with interrupts disabled.  If you have a lot
>

Re: RCU NOHZ, tsc, and clock_gettime

2012-10-12 Thread Prarit Bhargava


> The effect of removing the two functions you noted (on 3.6 and earlier)
> is to prevent RCU from checking for dyntick-idle CPUs, likely incurring
> a cache miss for each CPU with interrupts disabled.  If you have a lot
> of CPUs (or even if NR_CPUS is large and you have a smaller number of
> CPUs), this can result in user-space-visible delays.
> 

Paul,

I built a kernel with NR_CPUS=48 and booted on a 48 cpu (logical) system.  I do
not see a difference in the test -- the variance is AFAICT just as large as if I
had run with NR_CPUS=4096.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] i7core_edac, Fix panic when accessing sysfs files

2012-10-16 Thread Prarit Bhargava

The i7core_edac addrmatch_dev and chancounts_dev have sysfs files
associated with them.  The sysfs files, however, are coded so that the
parent device is is the mci device.  This is incorrect and the mci struct
should be obtained through the addrmatch_dev and chancounts_dev device's
private data field which is populated in i7core_create_sysfs_devices().

Signed-off-by: Prarit Bhargava 
Cc: Mauro Chehab 
---
 drivers/edac/i7core_edac.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 3672101..10c8c00 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -816,7 +816,7 @@ static ssize_t i7core_inject_store_##param( 
\
struct device_attribute *mattr, \
const char *data, size_t count) \
 {  \
-   struct mem_ctl_info *mci = to_mci(dev); \
+   struct mem_ctl_info *mci = dev_get_drvdata(dev);\
struct i7core_pvt *pvt; \
long value; \
int rc; \
@@ -845,7 +845,7 @@ static ssize_t i7core_inject_show_##param(  
\
struct device_attribute *mattr, \
char *data) \
 {  \
-   struct mem_ctl_info *mci = to_mci(dev); \
+   struct mem_ctl_info *mci = dev_get_drvdata(dev);\
struct i7core_pvt *pvt; \
\
pvt = mci->pvt_info;\
@@ -1052,7 +1052,7 @@ static ssize_t i7core_show_counter_##param(   
\
struct device_attribute *mattr, \
char *data) \
 {  \
-   struct mem_ctl_info *mci = to_mci(dev); \
+   struct mem_ctl_info *mci = dev_get_drvdata(dev);\
struct i7core_pvt *pvt = mci->pvt_info; \
\
edac_dbg(1, "\n");  \
-- 
1.7.12.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND] i7core_edac, Fix panic when accessing sysfs files

2012-10-16 Thread Prarit Bhargava

The i7core_edac addrmatch_dev and chancounts_dev have sysfs files
associated with them.  The sysfs files, however, are coded so that the
parent device is is the mci device.  This is incorrect and the mci struct
should be obtained through the addrmatch_dev and chancounts_dev device's
private data field which is populated in i7core_create_sysfs_devices().

Signed-off-by: Prarit Bhargava 
Cc: Mauro Chehab 
Cc: linux-e...@vger.kernel.org
---
 drivers/edac/i7core_edac.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 3672101..10c8c00 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -816,7 +816,7 @@ static ssize_t i7core_inject_store_##param( 
\
struct device_attribute *mattr, \
const char *data, size_t count) \
 {  \
-   struct mem_ctl_info *mci = to_mci(dev); \
+   struct mem_ctl_info *mci = dev_get_drvdata(dev);\
struct i7core_pvt *pvt; \
long value; \
int rc; \
@@ -845,7 +845,7 @@ static ssize_t i7core_inject_show_##param(  
\
struct device_attribute *mattr, \
char *data) \
 {  \
-   struct mem_ctl_info *mci = to_mci(dev); \
+   struct mem_ctl_info *mci = dev_get_drvdata(dev);\
struct i7core_pvt *pvt; \
\
pvt = mci->pvt_info;\
@@ -1052,7 +1052,7 @@ static ssize_t i7core_show_counter_##param(   
\
struct device_attribute *mattr, \
char *data) \
 {  \
-   struct mem_ctl_info *mci = to_mci(dev); \
+   struct mem_ctl_info *mci = dev_get_drvdata(dev);\
struct i7core_pvt *pvt = mci->pvt_info; \
\
edac_dbg(1, "\n");  \
-- 
1.7.12.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-11 Thread Prarit Bhargava

On 01/10/2013 10:48 PM, Rusty Russell wrote:
> Prarit Bhargava  writes:
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
>> from reserved chunk failed
> ...
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
> 
> Wow, what a cool bug!  Classic unforseen side-effect.
> 
> I'd prefer not to do relocations with the module_lock held: it can be
> relatively slow.  Yet we can't do relocations before the per-cpu
> allocation, obviously.  Did you do boot timings before and after?

Heh ... I did! :)  I had a lot of concerns about moving the mutex around so I
put in print at the end of boot to see how long the boot time actually was.

>From stock kernel:

[  22.893015] PRARIT: FINAL BOOT MESSAGE

>From stock kernel + my patch:

[  22.673214] PRARIT: FINAL BOOT MESSAGE

Both kernel boots showed the problem with kvm loading.  A quick grep through my
bootlogs of stock kernel + my patch don't show anything greater than 23.539392
and less than 20.980321.  Those numbers are similar to the numbers from the
stock kernel (23.569450 - 20.898321).

ie) I don't think there's an increase due to calling the relocation under the
module mutex, and if there is it is definitely lost within the noise of boot.

The timing were similar.  I didn't see any huge delays, etc.  Can the
relocations really cause a long delay?  I thought we were pretty much writing
values to memory...

[I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
memory]

> 
> An alternative would be to put the module into the list even earlier
> (say, just after layout_and_allocate) so we could block on concurrent
> loads at that point.  But then we have to make sure noone looks in the
> module too early before it's completely set up, and that's complicated
> and error-prone too.  A separate list is kind of icky.

Yeah -- that was my first attempt actually, and it got very complex very
quickly.  I abandoned that approach in favor of moving the percpu allocations
under the lock.  I thought that was likely the easiest approach.

> 
> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
> allmodconfig build, there are only three modules with per-cpu data,
> totalling 328 bytes.  So it's not reasonable to increase that number to
> paper over this.

I've been thinking about that.  The problem is that at the same time the kvm
problem occurs I'm attempting to load a debug module that I've written to debug
some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
tomorrow if I have the need to increase the size of my log buffer.

... that is ;), I prefer your and my approach of fixing this problem.

> 
> This is what a new boot state looks like (pains not to break ksplice).
> It's two patches, but I'll just post them back to back:
> 
> module: add new state MODULE_STATE_UNFORMED
> 
> You should never look at such a module, so it's excised from all paths
> which traverse the modules list.
> 
> We add the state at the end, to avoid gratuitous ABI break (ksplice).
> 
> Signed-off-by: Rusty Russell 
> 

Sure, but I'm always nervous about expanding any state machine ;).  That's just
me though :).

> 
> module: put modules in list much earlier.
> 
> Prarit's excellent bug report:
>> In recent Fedora releases (F17 & F18) some users have reported seeing
>> messages similar to
>>
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
>> reserved chunk failed
>>
>> during system boot.  In some cases, users have also reported seeing this
>> message along with a failed load of other modules.
>>
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu

Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-14 Thread Prarit Bhargava

On 01/11/2013 08:06 PM, Rusty Russell wrote:
> Prarit Bhargava  writes:
>> On 01/10/2013 10:48 PM, Rusty Russell wrote:
>> The timing were similar.  I didn't see any huge delays, etc.  Can the
>> relocations really cause a long delay?  I thought we were pretty much writing
>> values to memory...
> 
> For x86 that's true, but look at what ppc64 has to do for example.  I'm
> guessing you don't have a giant Nvidia proprietary driver module
> loading, either.

Ah -- I see.  I hadn't thought much about the other arches and I see what ppc64
does ...

> 
>> [I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
>> memory]
> 
> I figured it had to be something big ;)

:)  Imagine what happens at 4096 cpus (SGI territory).  I'm wondering about that
kvm commit.  Maybe the systemd/udev rule needs to be rewritten to avoid a 'kvm
loading flood' during boot ... I'll talk with Kay Sievers about it to see if
there's a way around that.

> 
> OTOH, Tested-by: means it actually fixed someone's problem.

Got it.  For the record over-the-weekend testing didn't show any bizarre
results.  The boot times were all around 20-23 seconds.

Tested-by: Prarit Bhargava 

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] kvm, Add x86_hyper_kvm to complete detect_hypervisor_platform check [v2]

2012-07-06 Thread Prarit Bhargava



On 07/06/2012 07:27 AM, Marcelo Tosatti wrote:
> On Thu, Jul 05, 2012 at 09:37:00AM -0400, Prarit Bhargava wrote:
>>
>>
>> On 07/05/2012 09:26 AM, Avi Kivity wrote:
>>> Please copy at least k...@vger.kernel.org, and preferably Marcelo as well
>>> (the other kvm co-maintainer).
>>>
>>>
>>
>> While debugging I noticed that unlike all the other hypervisor code in the
>> kernel, kvm does not have an entry for x86_hyper which is used in
>> detect_hypervisor_platform() which results in a nice printk in the
>> syslog.  This is only really a stub function but it
>> does make kvm more consistent with the other hypervisors.
>>
>> [v2]: add detect and _GPL export
>>
>> Signed-off-by: Prarit Bhargava 
>> Cc: Avi Kivity 
>> Cc: Gleb Natapov 
>> Cc: Alex Williamson 
>> Cc: Konrad Rzeszutek Wilk 
> 
> Looks good, please regenerate:
> 
> Hunk #1 FAILED at 39.
> Hunk #2 succeeded at 484 (offset 51 lines).
> 1 out of 2 hunks FAILED -- saving rejects to file
> arch/x86/kernel/kvm.c.rej

Oops.  Sorry about that Marcelo.  I didn't know about kvm next :(  My bad.
8<-

[PATCH 1/2] kvm, Add x86_hyper_kvm to complete detect_hypervisor_platform check 
[v3]

While debugging I noticed that unlike all the other hypervisor code in the
kernel, kvm does not have an entry for x86_hyper which is used in
detect_hypervisor_platform() which results in a nice printk in the
syslog.  This is only really a stub function but it
does make kvm more consistent with the other hypervisors.

[v2]: add detect and _GPL export
[v3]: patch against kvm next

Signed-off-by: Prarit Bhargava 
Cc: Avi Kivity 
Cc: Gleb Natapov 
Cc: Alex Williamson 
Cc: Konrad Rzeszutek Wilk 
Cc: Marcelo Tostatti 
Cc: k...@vger.kernel.org
---
 arch/x86/include/asm/hypervisor.h |1 +
 arch/x86/kernel/cpu/hypervisor.c  |1 +
 arch/x86/kernel/kvm.c |   14 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/hypervisor.h 
b/arch/x86/include/asm/hypervisor.h
index 7a15153..b518c75 100644
--- a/arch/x86/include/asm/hypervisor.h
+++ b/arch/x86/include/asm/hypervisor.h
@@ -49,6 +49,7 @@ extern const struct hypervisor_x86 *x86_hyper;
 extern const struct hypervisor_x86 x86_hyper_vmware;
 extern const struct hypervisor_x86 x86_hyper_ms_hyperv;
 extern const struct hypervisor_x86 x86_hyper_xen_hvm;
+extern const struct hypervisor_x86 x86_hyper_kvm;

 static inline bool hypervisor_x2apic_available(void)
 {
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 755f64f..6d6dd7a 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -37,6 +37,7 @@ static const __initconst struct hypervisor_x86 * const
hypervisors[] =
 #endif
&x86_hyper_vmware,
&x86_hyper_ms_hyperv,
+   &x86_hyper_kvm,
 };

 const struct hypervisor_x86 *x86_hyper;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 75ab94c..299cf14 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 

 static int kvmapf = 1;

@@ -483,6 +484,19 @@ void __init kvm_guest_init(void)
 #endif
 }

+static bool __init kvm_detect(void)
+{
+   if (!kvm_para_available())
+   return false;
+   return true;
+}
+
+const struct hypervisor_x86 x86_hyper_kvm __refconst = {
+   .name   = "KVM",
+   .detect = kvm_detect,
+};
+EXPORT_SYMBOL_GPL(x86_hyper_kvm);
+
 static __init int activate_jump_labels(void)
 {
if (has_steal_clock) {
-- 
1.7.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] module, fix percpu reserved memory exhaustion

2013-01-09 Thread Prarit Bhargava

Rusty,

There is likely some subtlety of moving the module mutex that I'm unaware of.
What I can say is that this patch seems to resolve the problem for me, or at
least through 100+ reboots I have not seen the problem (I'm still testing
as I write this).

I'm more than willing to hear an alternative approach, or test an alternative
patch.

Thanks,

P.

8<

In recent Fedora releases (F17 & F18) some users have reported seeing
messages similar to

[   15.478121] Pid: 727, comm: systemd-udevd Tainted: GF 3.8.0-rc2+ #1
[   15.478121] Call Trace:
[   15.478131]  [] pcpu_alloc+0xa01/0xa60
[   15.478137]  [] ? printk+0x61/0x63
[   15.478140]  [] __alloc_reserved_percpu+0x13/0x20
[   15.478145]  [] load_module+0x1dc2/0x20b0
[   15.478150]  [] ? do_page_fault+0xe/0x10
[   15.478152]  [] ? page_fault+0x28/0x30
[   15.478155]  [] sys_init_module+0xd7/0x120
[   15.478159]  [] system_call_fastpath+0x16/0x1b
[   15.478160] kvm: Could not allocate 304 bytes percpu data
[   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from 
reserved chunk failed

during system boot.  In some cases, users have also reported seeing this
message along with a failed load of other modules.

As the message indicates, the reserved chunk of percpu memory (where
modules allocate their memory) is exhausted.  A debug printk inserted in
the code shows

[   15.478533] PRARIT size = 304 > chunk->contig_hint = 208

ie) the reserved chunk of percpu has only 208 bytes of available space.

What is happening is systemd is loading an instance of the kvm module for
each cpu found (see commit e9bda3b).  When the module load occurs the kernel
currently allocates the modules percpu data area prior to checking to see
if the module is already loaded or is in the process of being loaded.  If
the module is already loaded, or finishes load, the module loading code
releases the current instance's module's percpu data.

The problem is that these module loads race and it is possible that all of
the percpu reserved area is consumed by repeated loads of the same module
which results in the failure of other drivers to load.

This patch moves the module percpu allocation after the check for an
existing instance of the module.

Signed-off-by: Prarit Bhargava 
Cc: Rusty Russell 
Cc: Mike Galbraith 
---
 kernel/module.c |  124 ++-
 1 file changed, 85 insertions(+), 39 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 250092c..e7e9b57 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1929,6 +1929,27 @@ static int verify_export_symbols(struct module *mod)
return 0;
 }
 
+static void simplify_percpu_symbols(struct module *mod,
+   const struct load_info *info)
+{
+   Elf_Shdr *symsec = &info->sechdrs[info->index.sym];
+   Elf_Sym *sym = (void *)symsec->sh_addr;
+   unsigned long secbase;
+   unsigned int i;
+
+   /*
+* No need for error checking in this function because
+* simplify_symbols has already been called.
+*/
+   for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
+   /* Divert to percpu allocation if a percpu var. */
+   if (sym[i].st_shndx == info->index.pcpu) {
+   secbase = (unsigned long)mod_percpu(mod);
+   sym[i].st_value += secbase;
+   }
+   }
+}
+
 /* Change all symbols so that st_value encodes the pointer directly. */
 static int simplify_symbols(struct module *mod, const struct load_info *info)
 {
@@ -1976,12 +1997,11 @@ static int simplify_symbols(struct module *mod, const 
struct load_info *info)
break;
 
default:
-   /* Divert to percpu allocation if a percpu var. */
-   if (sym[i].st_shndx == info->index.pcpu)
-   secbase = (unsigned long)mod_percpu(mod);
-   else
+   /* percpu diverts handled in simplify_percpu_symbols */
+   if (sym[i].st_shndx != info->index.pcpu) {
secbase = 
info->sechdrs[sym[i].st_shndx].sh_addr;
-   sym[i].st_value += secbase;
+   sym[i].st_value += secbase;
+   }
break;
}
}
@@ -2899,11 +2919,29 @@ int __weak module_frob_arch_sections(Elf_Ehdr *hdr,
return 0;
 }
 
+static int allocate_percpu(struct module *mod, struct load_info *info)
+{
+   Elf_Shdr *pcpusec;
+   int err;
+
+   pcpusec = &info->sechdrs[info->index.pcpu];
+   if (pcpusec->sh_size) {
+   /* We have a special allocation for this section. */
+   pr_debug("module %s attempting to percpu with size %d\n",
+

Re: [5/6 PATCH] Kprobes : Prevent possible race conditions ia64 changes

2005-07-08 Thread Prarit Bhargava


Keshavamurthy Anil S wrote:

On Fri, Jul 08, 2005 at 04:40:45PM +0530, Prasanna S Panchamukhi wrote:


Hi Anil,

I have updated the patch as per your comments to move routines
from jprobes.S to .kprobes.text section.

Please let me know if you any issues.


Looks fine and tested it too on IA64 Tiger4 box and works as intened.
Acked-by: Anil S Keshavamurthy <[EMAIL PROTECTED]>



I ran this on 16p and a 64p ia64 systems and didn't see any issues.

P.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

PATCH: Remove endflag in NMI smp_call_function call

2007-12-03 Thread Prarit Bhargava

(Wim, I'm not sure if you're the right person to get this or not.  Nothing else
 came close in the MAINTAINERS file )

'endflag' is globally defined -- Passing endflag into smp_call_function is no
longer necessary.

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

diff --git a/arch/x86/kernel/nmi_32.c b/arch/x86/kernel/nmi_32.c
index f5cc47c..50ba4e6 100644
--- a/arch/x86/kernel/nmi_32.c
+++ b/arch/x86/kernel/nmi_32.c
@@ -89,7 +89,7 @@ static int __init check_nmi_watchdog(void)
printk(KERN_INFO "Testing NMI watchdog ... ");
 
if (nmi_watchdog == NMI_LOCAL_APIC)
-   smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
+   smp_call_function(nmi_cpu_busy, NULL, 0, 0);
 
for_each_possible_cpu(cpu)
prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count;
diff --git a/arch/x86/kernel/nmi_64.c b/arch/x86/kernel/nmi_64.c
index a576fd7..0443068 100644
--- a/arch/x86/kernel/nmi_64.c
+++ b/arch/x86/kernel/nmi_64.c
@@ -97,7 +97,7 @@ int __init check_nmi_watchdog (void)
 
 #ifdef CONFIG_SMP
if (nmi_watchdog == NMI_LOCAL_APIC)
-   smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
+   smp_call_function(nmi_cpu_busy, NULL, 0, 0);
 #endif
 
for (cpu = 0; cpu < NR_CPUS; cpu++)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC]: DEBUG for PCI IO & MEM allocation

2005-03-14 Thread Prarit Bhargava

Colleagues,
Over the past few years I've been heavily involved in two projects that 
deal with PCI HotPlug.  While doing this work, one area of code always 
seems to require printk's to debug through -- the allocation & freeing 
of IO & MEM resources.

I've discovered many bugs surrounding Hotplug PCI IO & MEM allocations 
that would have been much easier to debug had there been printk's 
existing within the code, including one recently which impacts all of 
PCI Hotplug and appears to have been around since pre-2.6.9 (a patch to 
fix this issue has already been reviewed & accepted by Greg Kroah). 

As Hotplug continues to mature and more Hotplug drivers are introduced, 
I suspect that more and more bugs will be introduced in the resource space.

I propose the following patch to add a compile time DEBUG option to 
kernel/resource.c that would help in analyzing problems in this area.  
It's a few simple lines of output in  __request_resource, 
__release_resource, __request_region, and __release_region .

Thanks,
P.

= kernel/resource.c 1.26 vs edited =
--- 1.26/kernel/resource.c	2005-01-08 00:44:13 -05:00
+++ edited/kernel/resource.c	2005-03-14 17:14:36 -05:00
@@ -20,6 +20,11 @@
 #include 
 #include 
 
+#if 0
+#define DEBUGP printk
+#else
+#define DEBUGP(fmt , a...)
+#endif
 
 struct resource ioport_resource = {
 	.name	= "PCI IO",
@@ -155,6 +160,8 @@
 	unsigned long end = new->end;
 	struct resource *tmp, **p;
 
+	DEBUGP("%s: resource request at 0x%lx-0x%lx\n", __FUNCTION__, new->start, new->end);
+
 	if (end < start)
 		return root;
 	if (start < root->start)
@@ -168,11 +175,13 @@
 			new->sibling = tmp;
 			*p = new;
 			new->parent = root;
+			DEBUGP("%s: resource allocated\n", __FUNCTION__);
 			return NULL;
 		}
 		p = &tmp->sibling;
 		if (tmp->end < start)
 			continue;
+		DEBUGP("%s: resource conflicted with 0x%lx-0x%lx\n", __FUNCTION__, tmp->start, tmp->end);
 		return tmp;
 	}
 }
@@ -181,6 +190,7 @@
 {
 	struct resource *tmp, **p;
 
+	DEBUGP("%s: resource release for 0x%lx-0x%lx\n", __FUNCTION__, old->start, old->end);
 	p = &old->parent->child;
 	for (;;) {
 		tmp = *p;
@@ -189,10 +199,12 @@
 		if (tmp == old) {
 			*p = tmp->sibling;
 			old->parent = NULL;
+			DEBUGP("%s: resource free'd\n", __FUNCTION__);
 			return 0;
 		}
 		p = &tmp->sibling;
 	}
+	DEBUGP("%s: resource cannot be released\n", __FUNCTION__);
 	return -EINVAL;
 }
 
@@ -432,6 +444,7 @@
 {
 	struct resource *res = kmalloc(sizeof(*res), GFP_KERNEL);
 
+	DEBUGP("%s: requesting region 0x%lx - 0x%lx\n", __FUNCTION__, start, start + n - 1);
 	if (res) {
 		memset(res, 0, sizeof(*res));
 		res->name = name;
@@ -445,8 +458,10 @@
 			struct resource *conflict;
 
 			conflict = __request_resource(parent, res);
-			if (!conflict)
+			if (!conflict) {
+DEBUGP("%s: region assigned\n", __FUNCTION__);
 break;
+			}
 			if (conflict != parent) {
 parent = conflict;
 if (!(conflict->flags & IORESOURCE_BUSY))
@@ -454,6 +469,8 @@
 			}
 
 			/* Uhhuh, that didn't work out.. */
+			DEBUGP("%s: request for region 0x%lx - 0x%lx failed\n", __FUNCTION__, res->start, 
+res->end);
 			kfree(res);
 			res = NULL;
 			break;
@@ -504,6 +521,7 @@
 break;
 			*p = res->sibling;
 			write_unlock(&resource_lock);
+			DEBUGP("%s: releasing region 0x%lx - 0x%lx\n", __FUNCTION__, res->start, res->end);
 			kfree(res);
 			return;
 		}
@@ -512,6 +530,7 @@
 
 	write_unlock(&resource_lock);
 
+	DEBUGP("%s: release regions  0x%lx - 0x%lx failed\n", __FUNCTION__, start, end);
 	printk(KERN_WARNING "Trying to free nonexistent resource <%08lx-%08lx>\n", start, end);
 }

Re: [PATCH RFC]: DEBUG for PCI IO & MEM allocation

2005-03-21 Thread Prarit Bhargava

Thanks Andrew.
Shouldn't this also be printing the ->name of the new resource?
A lot of the statements which you're adding will look screwy in an 80-col
xterm.  Please wrap 'em.
-- new patch with Andrew's comments fixed.
P.
Index: io_debug/kernel/resource.c
===
RCS file: /usr/local/src/cvsroot/bk/linux-2.5/kernel/resource.c,v
retrieving revision 1.1.1.1
diff -u -1 -0 -p -r1.1.1.1 resource.c
--- io_debug/kernel/resource.c  16 Mar 2005 18:48:54 -  1.1.1.1
+++ io_debug/kernel/resource.c  21 Mar 2005 20:00:16 -
@@ -13,20 +13,25 @@
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 
+#if 0
+#define DEBUGP printk
+#else
+#define DEBUGP(fmt , a...)
+#endif
 
 struct resource ioport_resource = {
.name   = "PCI IO",
.start  = 0x,
.end= IO_SPACE_LIMIT,
.flags  = IORESOURCE_IO,
 };
 
 EXPORT_SYMBOL(ioport_resource);
 
@@ -148,58 +153,70 @@ __initcall(ioresources_init);
 
 #endif /* CONFIG_PROC_FS */
 
 /* Return the conflict entry if you can't request it */
 static struct resource * __request_resource(struct resource *root, struct 
resource *new)
 {
unsigned long start = new->start;
unsigned long end = new->end;
struct resource *tmp, **p;
 
+   DEBUGP("%s: %s resource request at 0x%lx-0x%lx\n", __FUNCTION__, 
+  new->name, new->start, new->end);
+
if (end < start)
return root;
if (start < root->start)
return root;
if (end > root->end)
return root;
p = &root->child;
for (;;) {
tmp = *p;
if (!tmp || tmp->start > end) {
new->sibling = tmp;
*p = new;
new->parent = root;
+   DEBUGP("%s: %s resource allocated\n", __FUNCTION__, 
+  new->name);
return NULL;
}
p = &tmp->sibling;
if (tmp->end < start)
continue;
+   DEBUGP("%s: %s resource conflicted with 0x%lx-0x%lx\n", 
+  __FUNCTION__, new->name, tmp->start, tmp->end);
return tmp;
}
 }
 
 static int __release_resource(struct resource *old)
 {
struct resource *tmp, **p;
 
+   DEBUGP("%s: %s resource release for 0x%lx-0x%lx\n", __FUNCTION__, 
+  old->name, old->start, old->end);
p = &old->parent->child;
for (;;) {
tmp = *p;
if (!tmp)
break;
if (tmp == old) {
*p = tmp->sibling;
old->parent = NULL;
+   DEBUGP("%s: %s resource released\n", __FUNCTION__, 
+  old->name);
return 0;
}
p = &tmp->sibling;
}
+   DEBUGP("%s: %s resource cannot be released\n", __FUNCTION__, old->name);
return -EINVAL;
 }
 
 int request_resource(struct resource *root, struct resource *new)
 {
struct resource *conflict;
 
write_lock(&resource_lock);
conflict = __request_resource(root, new);
write_unlock(&resource_lock);
@@ -425,42 +442,49 @@ EXPORT_SYMBOL(adjust_resource);
  * Request-region creates a new busy region.
  *
  * Check-region returns non-zero if the area is already busy
  *
  * Release-region releases a matching busy region.
  */
 struct resource * __request_region(struct resource *parent, unsigned long 
start, unsigned long n, const char *name)
 {
struct resource *res = kmalloc(sizeof(*res), GFP_KERNEL);
 
+   DEBUGP("%s: %s requesting region 0x%lx - 0x%lx\n", __FUNCTION__, 
+  name, start, start + n - 1);
if (res) {
memset(res, 0, sizeof(*res));
res->name = name;
res->start = start;
res->end = start + n - 1;
res->flags = IORESOURCE_BUSY;
 
write_lock(&resource_lock);
 
for (;;) {
struct resource *conflict;
 
conflict = __request_resource(parent, res);
-   if (!conflict)
+   if (!conflict) {
+   DEBUGP("%s: %s region assigned\n", __FUNCTION__,
+  name);
break;
+   }
if (conflict != parent) {
parent = conflict;
if (!(conflict->flags & IORESOURCE_BUSY))
continue;
}
 
/* Uhhuh, that didn't work out.. */
+   DEBUGP("%s: %s request for region 0x%lx - 0x%lx fail\n",
+

Re: [PATCH][RFC]: Clean up resource allocation in i8042 driver

2005-02-14 Thread Prarit Bhargava

I didn't see a final ACK on this patch -- just checking for one :)
P.
Prarit Bhargava wrote:
I've taken into account Dmitry's comments (thanks Dmitry!) and 
generated a new patch.

Thanks,
P.
Jesse Barnes wrote:
On Friday, January 21, 2005 8:35 am, Vojtech Pavlik wrote:
 

No. But vacant ports usually return 0xff. The problem here is that 0xff
is a valid value for the status register, too. Fortunately this patch
checks for 0xff only after the timeout failed.
  

On PCs you'll get all 1s, but on some ia64 platforms and others, 
you'll take a hard machine check exception if you try to access 
non-existent memory (mmio, port space, or otherwise).

Jesse
-
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 


= i8042.c 1.71 vs edited =
--- 1.71/drivers/input/serio/i8042.c2005-01-03 08:11:49 -05:00
+++ edited/i8042.c  2005-01-21 11:50:11 -05:00
@@ -696,7 +696,10 @@
unsigned char param;
if (i8042_command(¶m, I8042_CMD_CTL_TEST)) {
-   printk(KERN_ERR "i8042.c: i8042 controller self test 
timeout.\n");
+   if (i8042_read_status() != 0xFF)
+   printk(KERN_ERR "i8042.c: i8042 controller self test 
timeout.\n");
+   else
+   printk(KERN_ERR "i8042.c: no i8042 controller 
found.\n");
return -1;
}
@@ -1016,16 +1019,22 @@
i8042_aux_values.irq = I8042_AUX_IRQ;
i8042_kbd_values.irq = I8042_KBD_IRQ;
-   if (i8042_controller_init())
+   if (i8042_controller_init()) {
+   i8042_platform_exit();
return -ENODEV;
+   }
err = driver_register(&i8042_driver);
-   if (err)
+   if (err) {
+   i8042_platform_exit();
return err;
+   }
i8042_platform_device = platform_device_register_simple("i8042", -1, 
NULL, 0);
if (IS_ERR(i8042_platform_device)) {
driver_unregister(&i8042_driver);
+   i8042_platform_exit();
+   del_timer_sync(&i8042_timer);
return PTR_ERR(i8042_platform_device);
}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][RFC]: Clean up resource allocation in i8042 driver

2005-01-21 Thread Prarit Bhargava

Hi,
The following patch cleans up resource allocations in the i8042 driver 
when initialization fails.

Please consider for tree application.  Patch is generated against 
current bk pull.

Thanks,
P.
Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>
= i8042.c 1.71 vs edited =
--- 1.71/drivers/input/serio/i8042.c2005-01-03 08:11:49 -05:00
+++ edited/i8042.c  2005-01-21 10:02:20 -05:00
@@ -696,7 +696,10 @@
   unsigned char param;
   if (i8042_command(¶m, I8042_CMD_CTL_TEST)) {
-   printk(KERN_ERR "i8042.c: i8042 controller self test 
timeout.\n");
+   if (i8042_read_status() != 0xFF)
+   printk(KERN_ERR "i8042.c: i8042 controller self test 
timeout.\n");
+   else
+   printk(KERN_ERR "i8042.c: no i8042 controller 
found.\n");
   return -1;
   }
   }
@@ -1011,21 +1014,34 @@
   i8042_timer.function = i8042_timer_func;
   if (i8042_platform_init())
+   {
+   del_timer_sync(&i8042_timer);
   return -EBUSY;
+   }
   i8042_aux_values.irq = I8042_AUX_IRQ;
   i8042_kbd_values.irq = I8042_KBD_IRQ;
   if (i8042_controller_init())
+   {
+   i8042_platform_exit();
+   del_timer_sync(&i8042_timer);
   return -ENODEV;
+   }
   err = driver_register(&i8042_driver);
   if (err)
+   {
+   i8042_platform_exit();
+   del_timer_sync(&i8042_timer);
   return err;
+   }
   i8042_platform_device = platform_device_register_simple("i8042", -1, 
NULL, 0);
   if (IS_ERR(i8042_platform_device)) {
   driver_unregister(&i8042_driver);
+   i8042_platform_exit();
+   del_timer_sync(&i8042_timer);
   return PTR_ERR(i8042_platform_device);
   }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC]: Clean up resource allocation in i8042 driver

2005-01-21 Thread Prarit Bhargava

I've taken into account Dmitry's comments (thanks Dmitry!) and generated 
a new patch.

Thanks,
P.
Jesse Barnes wrote:
On Friday, January 21, 2005 8:35 am, Vojtech Pavlik wrote:
 

No. But vacant ports usually return 0xff. The problem here is that 0xff
is a valid value for the status register, too. Fortunately this patch
checks for 0xff only after the timeout failed.
   

On PCs you'll get all 1s, but on some ia64 platforms and others, you'll take a 
hard machine check exception if you try to access non-existent memory (mmio, 
port space, or otherwise).

Jesse
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 

= i8042.c 1.71 vs edited =
--- 1.71/drivers/input/serio/i8042.c2005-01-03 08:11:49 -05:00
+++ edited/i8042.c  2005-01-21 11:50:11 -05:00
@@ -696,7 +696,10 @@
unsigned char param;
 
if (i8042_command(¶m, I8042_CMD_CTL_TEST)) {
-   printk(KERN_ERR "i8042.c: i8042 controller self test 
timeout.\n");
+   if (i8042_read_status() != 0xFF)
+   printk(KERN_ERR "i8042.c: i8042 controller self 
test timeout.\n");
+   else
+   printk(KERN_ERR "i8042.c: no i8042 controller 
found.\n");
return -1;
}
 
@@ -1016,16 +1019,22 @@
i8042_aux_values.irq = I8042_AUX_IRQ;
i8042_kbd_values.irq = I8042_KBD_IRQ;
 
-   if (i8042_controller_init())
+   if (i8042_controller_init()) {
+   i8042_platform_exit();
return -ENODEV;
+   }
 
err = driver_register(&i8042_driver);
-   if (err)
+   if (err) {
+   i8042_platform_exit();
return err;
+   }
 
i8042_platform_device = platform_device_register_simple("i8042", -1, 
NULL, 0);
if (IS_ERR(i8042_platform_device)) {
driver_unregister(&i8042_driver);
+   i8042_platform_exit();
+   del_timer_sync(&i8042_timer);
return PTR_ERR(i8042_platform_device);
}

[PATCH]: Stop bogus softlockup warnings in debug_show_all_locks (2nd try)

2007-09-05 Thread Prarit Bhargava

Trying this again with a wider audience.

Prevent bogus softlockup warnings when dumping lock debug info during a
sysrq-t.

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 734da57..376b398 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -3183,8 +3183,10 @@ retry:
printk("\n");
printk("=\n\n");
 
-   if (unlock)
+   if (unlock) {
+   touch_all_softlockup_watchdogs();
read_unlock(&tasklist_lock);
+   }
 }
 
 EXPORT_SYMBOL_GPL(debug_show_all_locks);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix race in efi variable delete code.

2007-01-22 Thread Prarit Bhargava

Fix race when deleting an EFI variable and issuing another EFI command on the
same variable.  The removal of the variable from the efivars_list should be
done in efivar_delete and not delayed until the kprobes release.

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

diff --git a/drivers/firmware/efivars.c b/drivers/firmware/efivars.c
index 5ab5e39..bf2ca97 100644
--- a/drivers/firmware/efivars.c
+++ b/drivers/firmware/efivars.c
@@ -385,10 +385,8 @@ static struct sysfs_ops efivar_attr_ops = {
 
 static void efivar_release(struct kobject *kobj)
 {
-   struct efivar_entry *var = container_of(kobj, struct efivar_entry, 
kobj);
-   spin_lock(&efivars_lock);
-   list_del(&var->list);
-   spin_unlock(&efivars_lock);
+   struct efivar_entry *var = container_of(kobj, struct efivar_entry,
+   kobj);
kfree(var);
 }
 
@@ -537,6 +535,9 @@ efivar_delete(struct subsystem *sub, const char *buf, 
size_t count)
spin_unlock(&efivars_lock);
return -EIO;
}
+
+   list_del(&search_efivar->list);
+
/* We need to release this lock before unregistering. */
spin_unlock(&efivars_lock);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH]: Remove __exit from mon_bin_exit & mon_text_exit.

2007-02-21 Thread Prarit Bhargava

Remove __exit from mon_bin_exit & mon_text_exit.  Both functions are used
in error code paths in __init functions.

Resolves MODPOST warnings similar to:

`mon_bin_exit' referenced in section `.init.text' of drivers/built-in.o: 
defined in discarded section `.exit.text' of drivers/built-in.o
`mon_text_exit' referenced in section `.init.text' of drivers/built-in.o: 
defined in discarded section `.exit.text' of drivers/built-in.o
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

diff --git a/drivers/usb/mon/mon_bin.c b/drivers/usb/mon/mon_bin.c
index c01dfe6..b2bedd9 100644
--- a/drivers/usb/mon/mon_bin.c
+++ b/drivers/usb/mon/mon_bin.c
@@ -1165,7 +1165,7 @@ err_dev:
return rc;
 }
 
-void __exit mon_bin_exit(void)
+void mon_bin_exit(void)
 {
cdev_del(&mon_bin_cdev);
unregister_chrdev_region(mon_bin_dev0, MON_BIN_MAX_MINOR);
diff --git a/drivers/usb/mon/mon_text.c b/drivers/usb/mon/mon_text.c
index d38a127..494ee3b 100644
--- a/drivers/usb/mon/mon_text.c
+++ b/drivers/usb/mon/mon_text.c
@@ -520,7 +520,7 @@ int __init mon_text_init(void)
return 0;
 }
 
-void __exit mon_text_exit(void)
+void mon_text_exit(void)
 {
debugfs_remove(mon_dir);
 }
diff --git a/drivers/usb/mon/usb_mon.h b/drivers/usb/mon/usb_mon.h
index 4f949ce..efdfd89 100644
--- a/drivers/usb/mon/usb_mon.h
+++ b/drivers/usb/mon/usb_mon.h
@@ -57,9 +57,9 @@ void mon_text_del(struct mon_bus *mbus);
 // void mon_bin_add(struct mon_bus *);
 
 int __init mon_text_init(void);
-void __exit mon_text_exit(void);
+void mon_text_exit(void);
 int __init mon_bin_init(void);
-void __exit mon_bin_exit(void);
+void mon_bin_exit(void);
 
 /*
  * DMA interface.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH]: Fix devinit & devexit declarations in de2104x driver

2007-02-26 Thread Prarit Bhargava

Resending (originally sent 2007-02-14).

__devinit & __devexit cleanups for de2104x driver.

Fixes MODPOST warnings similar to:

WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to
.init.text:de_init_one from .data.rel.local after 'de_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to
.exit.text:de_remove_one from .data.rel.local after 'de_driver' (at offset 0x28)

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

--- linux-2.6.18.ia64.orig/drivers/net/tulip/de2104x.c  2006-09-19 
23:42:06.0 -0400
+++ linux-2.6.18.ia64/drivers/net/tulip/de2104x.c   2007-02-14 
13:43:55.0 -0500
@@ -1685,7 +1685,7 @@ static struct ethtool_ops de_ethtool_ops
.get_regs   = de_get_regs,
 };
 
-static void __init de21040_get_mac_address (struct de_private *de)
+static void __devinit de21040_get_mac_address (struct de_private *de)
 {
unsigned i;
 
@@ -1703,7 +1703,7 @@ static void __init de21040_get_mac_addre
}
 }
 
-static void __init de21040_get_media_info(struct de_private *de)
+static void __devinit de21040_get_media_info(struct de_private *de)
 {
unsigned int i;
 
@@ -1730,7 +1730,7 @@ static void __init de21040_get_media_inf
 }
 
 /* Note: this routine returns extra data bits for size detection. */
-static unsigned __init tulip_read_eeprom(void __iomem *regs, int location, int 
addr_len)
+static unsigned __devinit tulip_read_eeprom(void __iomem *regs, int location, 
int addr_len)
 {
int i;
unsigned retval = 0;
@@ -1765,7 +1765,7 @@ static unsigned __init tulip_read_eeprom
return retval;
 }
 
-static void __init de21041_get_srom_info (struct de_private *de)
+static void __devinit de21041_get_srom_info (struct de_private *de)
 {
unsigned i, sa_offset = 0, ofs;
u8 ee_data[DE_EEPROM_SIZE + 6] = {};
@@ -1926,7 +1926,7 @@ bad_srom:
goto fill_defaults;
 }
 
-static int __init de_init_one (struct pci_dev *pdev,
+static int __devinit de_init_one (struct pci_dev *pdev,
  const struct pci_device_id *ent)
 {
struct net_device *dev;
@@ -2082,7 +2082,7 @@ err_out_free:
return rc;
 }
 
-static void __exit de_remove_one (struct pci_dev *pdev)
+static void __devexit de_remove_one (struct pci_dev *pdev)
 {
struct net_device *dev = pci_get_drvdata(pdev);
struct de_private *de = dev->priv;
@@ -2160,7 +2160,7 @@ static struct pci_driver de_driver = {
.name   = DRV_NAME,
.id_table   = de_pci_tbl,
.probe  = de_init_one,
-   .remove = __exit_p(de_remove_one),
+   .remove = __devexit_p(de_remove_one),
 #ifdef CONFIG_PM
.suspend= de_suspend,
.resume = de_resume,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH]: Use stop_machine_run in the Intel RNG driver

2007-02-27 Thread Prarit Bhargava

Replace call_smp_function with stop_machine_run in the Intel RNG driver.

CPU A has done read_lock(&lock)
CPU B has done write_lock_irq(&lock) and is waiting for A to release the lock.

A third CPU calls call_smp_function and issues the IPI.  CPU A takes CPU C's
IPI.  CPU B is waiting with interrupts disabled and does not see the IPI.
CPU C is stuck waiting for CPU B to respond to the IPI.

Deadlock.

The solution is to use stop_machine_run instead of call_smp_function
(call_smp_function should not be called in situations where the CPUs may
be suspended).

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

--- linux-2.6.18.x86_64.orig/drivers/char/hw_random/intel-rng.c 2007-02-23 
06:55:34.0 -0500
+++ linux-2.6.18.x86_64/drivers/char/hw_random/intel-rng.c  2007-02-26 
04:59:36.0 -0500
@@ -24,10 +24,11 @@
  * warranty of any kind, whether express or implied.
  */
 
-#include 
+#include 
 #include 
+#include 
 #include 
-#include 
+#include 
 #include 
 
 
@@ -217,30 +218,117 @@ static struct hwrng intel_rng = {
.data_read  = intel_rng_data_read,
 };
 
+struct intel_rng_hw {
+   struct pci_dev *dev;
+   void __iomem *mem;
+   u8 bios_cntl_off;
+   u8 bios_cntl_val;
+   u8 fwh_dec_en1_off;
+   u8 fwh_dec_en1_val;
+};
+
+static int __init intel_rng_hw_init(void *_intel_rng_hw)
+{
+   struct intel_rng_hw *intel_rng_hw = _intel_rng_hw;
+   u8 mfc, dvc;
+
+   /* interrupts disabled in stop_machine_run call */
+
+   if (!(intel_rng_hw->fwh_dec_en1_val & FWH_F8_EN_MASK))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->fwh_dec_en1_off,
+ intel_rng_hw->fwh_dec_en1_val |
+ FWH_F8_EN_MASK);
+   if (!(intel_rng_hw->bios_cntl_val & BIOS_CNTL_WRITE_ENABLE_MASK))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->bios_cntl_off,
+ intel_rng_hw->bios_cntl_val |
+ BIOS_CNTL_WRITE_ENABLE_MASK);
+
+   writeb(INTEL_FWH_RESET_CMD, intel_rng_hw->mem);
+   writeb(INTEL_FWH_READ_ID_CMD, intel_rng_hw->mem);
+   mfc = readb(intel_rng_hw->mem + INTEL_FWH_MANUFACTURER_CODE_ADDRESS);
+   dvc = readb(intel_rng_hw->mem + INTEL_FWH_DEVICE_CODE_ADDRESS);
+   writeb(INTEL_FWH_RESET_CMD, intel_rng_hw->mem);
+
+   if (!(intel_rng_hw->bios_cntl_val &
+ (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK)))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->bios_cntl_off,
+ intel_rng_hw->bios_cntl_val);
+   if (!(intel_rng_hw->fwh_dec_en1_val & FWH_F8_EN_MASK))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->fwh_dec_en1_off,
+ intel_rng_hw->fwh_dec_en1_val);
+
+   if (mfc != INTEL_FWH_MANUFACTURER_CODE ||
+   (dvc != INTEL_FWH_DEVICE_CODE_8M &&
+dvc != INTEL_FWH_DEVICE_CODE_4M)) {
+   printk(KERN_ERR PFX "FWH not detected\n");
+   return -ENODEV;
+   }
 
-#ifdef CONFIG_SMP
-static char __initdata waitflag;
+   return 0;
+}
 
-static void __init intel_init_wait(void *unused)
+static int __init intel_init_hw_struct(struct intel_rng_hw *intel_rng_hw,
+   struct pci_dev *dev)
 {
-   while (waitflag)
-   cpu_relax();
+   intel_rng_hw->bios_cntl_val = 0xff;
+   intel_rng_hw->fwh_dec_en1_val = 0xff;
+   intel_rng_hw->dev = dev;
+
+   /* Check for Intel 82802 */
+   if (dev->device < 0x2640) {
+   intel_rng_hw->fwh_dec_en1_off = FWH_DEC_EN1_REG_OLD;
+   intel_rng_hw->bios_cntl_off = BIOS_CNTL_REG_OLD;
+   } else {
+   intel_rng_hw->fwh_dec_en1_off = FWH_DEC_EN1_REG_NEW;
+   intel_rng_hw->bios_cntl_off = BIOS_CNTL_REG_NEW;
+   }
+
+   pci_read_config_byte(dev, intel_rng_hw->fwh_dec_en1_off,
+&intel_rng_hw->fwh_dec_en1_val);
+   pci_read_config_byte(dev, intel_rng_hw->bios_cntl_off,
+&intel_rng_hw->bios_cntl_val);
+
+   if ((intel_rng_hw->bios_cntl_val &
+(BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK))
+   == BIOS_CNTL_LOCK_ENABLE_MASK) {
+   static __initdata /*const*/ char warning[] =
+   KERN_WARNING PFX "Firmware space is locked read-only. "
+   KERN_WARNING PFX "If you can't or\n don't want to "
+   KERN_WA

Re: [PATCH]: Fix devinit & devexit declarations in de2104x driver

2007-02-27 Thread Prarit Bhargava




Jeff Garzik wrote:

Prarit Bhargava wrote:

Resending (originally sent 2007-02-14).

__devinit & __devexit cleanups for de2104x driver.

Fixes MODPOST warnings similar to:

WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to
.init.text:de_init_one from .data.rel.local after 'de_driver' (at 
offset 0x20)

WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to
.exit.text:de_remove_one from .data.rel.local after 'de_driver' (at 
offset 0x28)


Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>


doesn't apply to current kernel


Updated patch to latest git...

P.




__devinit & __devexit cleanups for de2104x driver.

Fixes MODPOST warnings similar to:

WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to
.init.text:de_init_one from .data.rel.local after 'de_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to
.exit.text:de_remove_one from .data.rel.local after 'de_driver' (at offset 0x28)

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

diff --git a/drivers/net/tulip/de2104x.c b/drivers/net/tulip/de2104x.c
index 9d67f11..55739c7 100644
--- a/drivers/net/tulip/de2104x.c
+++ b/drivers/net/tulip/de2104x.c
@@ -1685,7 +1685,7 @@ static const struct ethtool_ops de_ethtool_ops = {
.get_regs   = de_get_regs,
 };
 
-static void __init de21040_get_mac_address (struct de_private *de)
+static void __devinit de21040_get_mac_address (struct de_private *de)
 {
unsigned i;
 
@@ -1703,7 +1703,7 @@ static void __init de21040_get_mac_address (struct 
de_private *de)
}
 }
 
-static void __init de21040_get_media_info(struct de_private *de)
+static void __devinit de21040_get_media_info(struct de_private *de)
 {
unsigned int i;
 
@@ -1765,7 +1765,7 @@ static unsigned __devinit tulip_read_eeprom(void __iomem 
*regs, int location, in
return retval;
 }
 
-static void __init de21041_get_srom_info (struct de_private *de)
+static void __devinit de21041_get_srom_info (struct de_private *de)
 {
unsigned i, sa_offset = 0, ofs;
u8 ee_data[DE_EEPROM_SIZE + 6] = {};

Re: [PATCH]: Use stop_machine_run in the Intel RNG driver

2007-02-27 Thread Prarit Bhargava




Prarit Bhargava wrote:

Replace call_smp_function with stop_machine_run in the Intel RNG driver.

CPU A has done read_lock(&lock)
CPU B has done write_lock_irq(&lock) and is waiting for A to release the lock.

A third CPU calls call_smp_function and issues the IPI.  CPU A takes CPU C's
IPI.  CPU B is waiting with interrupts disabled and does not see the IPI.
CPU C is stuck waiting for CPU B to respond to the IPI.

Deadlock.

The solution is to use stop_machine_run instead of call_smp_function
(call_smp_function should not be called in situations where the CPUs may
be suspended).


  


Updated patch to latest-and-greatest git ...

P.

Replace call_smp_function with stop_machine_run in the Intel RNG driver.

CPU A has done read_lock(&lock)
CPU B has done write_lock_irq(&lock) and is waiting for A to release the lock.

A third CPU calls call_smp_function and issues the IPI.  CPU A takes CPU C's
IPI.  CPU B is waiting with interrupts disabled and does not see the IPI.
CPU C is stuck waiting for CPU B to respond to the IPI.

Deadlock.

The solution is to use stop_machine_run instead of call_smp_function
(call_smp_function should not be called in situations where the CPUs may
be suspended).

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>


diff --git a/drivers/char/hw_random/intel-rng.c 
b/drivers/char/hw_random/intel-rng.c
index cc1046e..ed1ef27 100644
--- a/drivers/char/hw_random/intel-rng.c
+++ b/drivers/char/hw_random/intel-rng.c
@@ -24,10 +24,11 @@
  * warranty of any kind, whether express or implied.
  */
 
-#include 
+#include 
 #include 
+#include 
 #include 
-#include 
+#include 
 #include 
 
 
@@ -217,30 +218,117 @@ static struct hwrng intel_rng = {
.data_read  = intel_rng_data_read,
 };
 
+struct intel_rng_hw {
+   struct pci_dev *dev;
+   void __iomem *mem;
+   u8 bios_cntl_off;
+   u8 bios_cntl_val;
+   u8 fwh_dec_en1_off;
+   u8 fwh_dec_en1_val;
+};
 
-#ifdef CONFIG_SMP
-static char __initdata waitflag;
+static int __init intel_rng_hw_init(void *_intel_rng_hw)
+{
+   struct intel_rng_hw *intel_rng_hw = _intel_rng_hw;
+   u8 mfc, dvc;
+
+   /* interrupts disabled in stop_machine_run call */
+
+   if (!(intel_rng_hw->fwh_dec_en1_val & FWH_F8_EN_MASK))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->fwh_dec_en1_off,
+ intel_rng_hw->fwh_dec_en1_val |
+ FWH_F8_EN_MASK);
+   if (!(intel_rng_hw->bios_cntl_val & BIOS_CNTL_WRITE_ENABLE_MASK))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->bios_cntl_off,
+ intel_rng_hw->bios_cntl_val |
+ BIOS_CNTL_WRITE_ENABLE_MASK);
+
+   writeb(INTEL_FWH_RESET_CMD, intel_rng_hw->mem);
+   writeb(INTEL_FWH_READ_ID_CMD, intel_rng_hw->mem);
+   mfc = readb(intel_rng_hw->mem + INTEL_FWH_MANUFACTURER_CODE_ADDRESS);
+   dvc = readb(intel_rng_hw->mem + INTEL_FWH_DEVICE_CODE_ADDRESS);
+   writeb(INTEL_FWH_RESET_CMD, intel_rng_hw->mem);
+
+   if (!(intel_rng_hw->bios_cntl_val &
+ (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK)))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->bios_cntl_off,
+ intel_rng_hw->bios_cntl_val);
+   if (!(intel_rng_hw->fwh_dec_en1_val & FWH_F8_EN_MASK))
+   pci_write_config_byte(intel_rng_hw->dev,
+ intel_rng_hw->fwh_dec_en1_off,
+ intel_rng_hw->fwh_dec_en1_val);
 
-static void __init intel_init_wait(void *unused)
+   if (mfc != INTEL_FWH_MANUFACTURER_CODE ||
+   (dvc != INTEL_FWH_DEVICE_CODE_8M &&
+dvc != INTEL_FWH_DEVICE_CODE_4M)) {
+   printk(KERN_ERR PFX "FWH not detected\n");
+   return -ENODEV;
+   }
+
+   return 0;
+}
+
+static int __init intel_init_hw_struct(struct intel_rng_hw *intel_rng_hw,
+   struct pci_dev *dev)
 {
-   while (waitflag)
-   cpu_relax();
+   intel_rng_hw->bios_cntl_val = 0xff;
+   intel_rng_hw->fwh_dec_en1_val = 0xff;
+   intel_rng_hw->dev = dev;
+
+   /* Check for Intel 82802 */
+   if (dev->device < 0x2640) {
+   intel_rng_hw->fwh_dec_en1_off = FWH_DEC_EN1_REG_OLD;
+   intel_rng_hw->bios_cntl_off = BIOS_CNTL_REG_OLD;
+   } else {
+   intel_rng_hw->fwh_dec_en1_off = FWH_DEC_EN1_REG_NEW;
+   intel_rng_hw->bios_cntl_off = BIOS_CNTL_REG_NEW;
+   }
+
+   pci_read_config_byte(dev, intel_rng_hw->fwh_dec_en1_off,
+

[PATCH]: init to cpuinit in mtrr code

2007-02-28 Thread Prarit Bhargava

(Resending to wider audience)

__init to __cpuinit in mtrr code.

Resolves warnings similar to:

WARNING: vmlinux - Section mismatch: reference to .init.text:mtrr_bp_init from 
.text between 'identify_cpu' (at offset 0xc040b38e) and 'detect_ht'

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

diff --git a/arch/i386/kernel/cpu/mtrr/amd.c b/arch/i386/kernel/cpu/mtrr/amd.c
index 0949cdb..375752a 100644
--- a/arch/i386/kernel/cpu/mtrr/amd.c
+++ b/arch/i386/kernel/cpu/mtrr/amd.c
@@ -112,7 +112,7 @@ static struct mtrr_ops amd_mtrr_ops = {
.have_wrcomb   = positive_have_wrcomb,
 };
 
-int __init amd_init_mtrr(void)
+int __cpuinit amd_init_mtrr(void)
 {
set_mtrr_ops(&amd_mtrr_ops);
return 0;
diff --git a/arch/i386/kernel/cpu/mtrr/centaur.c 
b/arch/i386/kernel/cpu/mtrr/centaur.c
index cb9aa3a..8b61016 100644
--- a/arch/i386/kernel/cpu/mtrr/centaur.c
+++ b/arch/i386/kernel/cpu/mtrr/centaur.c
@@ -215,7 +215,7 @@ static struct mtrr_ops centaur_mtrr_ops = {
.have_wrcomb   = positive_have_wrcomb,
 };
 
-int __init centaur_init_mtrr(void)
+int __cpuinit centaur_init_mtrr(void)
 {
set_mtrr_ops(¢aur_mtrr_ops);
return 0;
diff --git a/arch/i386/kernel/cpu/mtrr/cyrix.c 
b/arch/i386/kernel/cpu/mtrr/cyrix.c
index 0737a59..df38d8c 100644
--- a/arch/i386/kernel/cpu/mtrr/cyrix.c
+++ b/arch/i386/kernel/cpu/mtrr/cyrix.c
@@ -370,7 +370,7 @@ static struct mtrr_ops cyrix_mtrr_ops = {
.have_wrcomb   = positive_have_wrcomb,
 };
 
-int __init cyrix_init_mtrr(void)
+int __cpuinit cyrix_init_mtrr(void)
 {
set_mtrr_ops(&cyrix_mtrr_ops);
return 0;
diff --git a/arch/i386/kernel/cpu/mtrr/generic.c 
b/arch/i386/kernel/cpu/mtrr/generic.c
index f77fc53..fd97f84 100644
--- a/arch/i386/kernel/cpu/mtrr/generic.c
+++ b/arch/i386/kernel/cpu/mtrr/generic.c
@@ -30,14 +30,14 @@ static __initdata int mtrr_show;
 module_param_named(show, mtrr_show, bool, 0);
 
 /*  Get the MSR pair relating to a var range  */
-static void __init
+static void __cpuinit
 get_mtrr_var_range(unsigned int index, struct mtrr_var_range *vr)
 {
rdmsr(MTRRphysBase_MSR(index), vr->base_lo, vr->base_hi);
rdmsr(MTRRphysMask_MSR(index), vr->mask_lo, vr->mask_hi);
 }
 
-static void __init
+static void __cpuinit
 get_fixed_ranges(mtrr_type * frs)
 {
unsigned int *p = (unsigned int *) frs;
@@ -60,7 +60,7 @@ static void __init print_fixed(unsigned base, unsigned step, 
const mtrr_type*typ
 }
 
 /*  Grab all of the MTRR state for this CPU into *state  */
-void __init get_mtrr_state(void)
+void __cpuinit get_mtrr_state(void)
 {
unsigned int i;
struct mtrr_var_range *vrs;
diff --git a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c
index 0acfb6a..cdbca55 100644
--- a/arch/i386/kernel/cpu/mtrr/main.c
+++ b/arch/i386/kernel/cpu/mtrr/main.c
@@ -103,7 +103,7 @@ static int have_wrcomb(void)
 }
 
 /*  This function returns the number of variable MTRRs  */
-static void __init set_num_var_ranges(void)
+static void __cpuinit set_num_var_ranges(void)
 {
unsigned long config = 0, dummy;
 
@@ -116,7 +116,7 @@ static void __init set_num_var_ranges(void)
num_var_ranges = config & 0xff;
 }
 
-static void __init init_table(void)
+static void __cpuinit init_table(void)
 {
int i, max;
 
@@ -571,7 +571,7 @@ extern void amd_init_mtrr(void);
 extern void cyrix_init_mtrr(void);
 extern void centaur_init_mtrr(void);
 
-static void __init init_ifs(void)
+static void __cpuinit init_ifs(void)
 {
 #ifndef CONFIG_X86_64
amd_init_mtrr();
@@ -639,7 +639,7 @@ static struct sysdev_driver mtrr_sysdev_driver = {
  * initialized (i.e. before smp_init()).
  * 
  */
-void __init mtrr_bp_init(void)
+void __cpuinit mtrr_bp_init(void)
 {
init_ifs();
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH]: Fix __init declarations in Compaq SMART2 Controller driver

2007-02-28 Thread Prarit Bhargava

Fix __init declarations in Compaq SMART2 Controller driver.

Resolves MODPOST warnings similar to:

WARNING: drivers/block/cpqarray.o - Section mismatch: reference to
.init.text:cpqarray_init_one from .data.rel.local between 'cpqarray_pci_driver'
(at offset 0x20) and 'smart1_access'

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>

--- linux-2.6.18.ia64.orig/drivers/block/cpqarray.c 2007-02-14 
11:36:20.0 -0500
+++ linux-2.6.18.ia64/drivers/block/cpqarray.c  2007-02-14 13:08:57.0 
-0500
@@ -212,7 +212,7 @@ static struct proc_dir_entry *proc_array
  * Get us a file in /proc/array that says something about each controller.
  * Create /proc/array if it doesn't exist yet.
  */
-static void __init ida_procinit(int i)
+static void __devinit ida_procinit(int i)
 {
if (proc_array == NULL) {
proc_array = proc_mkdir("cpqarray", proc_root_driver);
@@ -390,7 +390,7 @@ static void __devexit cpqarray_remove_on
 }
 
 /* pdev is NULL for eisa */
-static int __init cpqarray_register_ctlr( int i, struct pci_dev *pdev)
+static int __devinit cpqarray_register_ctlr( int i, struct pci_dev *pdev)
 {
request_queue_t *q;
int j;
@@ -511,7 +511,7 @@ Enomem4:
return -1;
 }
 
-static int __init cpqarray_init_one( struct pci_dev *pdev,
+static int __devinit cpqarray_init_one( struct pci_dev *pdev,
const struct pci_device_id *ent)
 {
int i;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: Use stop_machine_run in the Intel RNG driver

2007-03-01 Thread Prarit Bhargava




I think what you're describing here is just the standard old
smp_call_function() deadlock, rather than anything which is specific to
intel-rng, yes?

It is "well known" that you can't call smp_call_function() with local
interrupts disabled.  In fact i386 will spit a warning if you try it.


intel-rng doesn't do that, but what it _does_ do is:

smp_call_function(..., wait = 0);
local_irq_disable();

so some CPUs will still be entering the IPI while this CPU has gone and
disabled interrupts, thus exposing us to the deadlock, yes?
  


Not quite Andrew.  This was a different and little more complicated.

The deadlock occurs because two CPUs are in contention over a rw_lock 
and the "call_function" puts the CPUs in such a state that no forward 
progress will be made until the calling CPU has completed it's code.


Here's a more detailed example (sorry for the cut-and-paste):

1.  CPU A has done read_lock(&lock), and has acquired the lock.
2.  CPU B has done write_lock_irq(&lock) and is waiting for A to release 
the lock.  CPU B has disabled interrupts while waiting for the interrupt:


void __lockfunc _write_lock_irq(rwlock_t *lock)
{
   local_irq_disable();
   preempt_disable();
   rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
   _raw_write_lock(lock);
}

3.  CPU C issues smp_call_function, as in the case of the intel-rng driver:

   set_mb(waitflag, 1);
   smp_call_function(intel_init_wait, NULL, 1, 0);
   ...
   // do some stuff with interrupts disabled
   ...
   set_mb(waitflag, 0);

where

static char __initdata waitflag;

static void __init intel_init_wait(void *unused)
{
   while (waitflag)
   cpu_relax();
}

In this code the calling processor, C, has issued an IPI and disabled 
interrupts on every processor except itself.  When each processor takes 
the IPI it runs intel_init_wait and waits in a tight loop until  
waitflag is zero.  ie) no forward progress on any CPU.


CPU C will not execute the code below the smp_call_function until all 
processors have started (not completed!) the IPI function.  From 
call_smp_function:


   cpus = num_online_cpus() - 1;
   ...

   /* Send a message to all other CPUs and wait for them to respond */
   send_IPI_allbutself(CALL_FUNCTION_VECTOR);

   /* Wait for response */
   while (atomic_read(&data.started) != cpus)
   cpu_relax();

So CPU C is waiting here.

4.  CPU A, which holds the lock sees the IPI and is in the 
intel_init_wait code, happily waiting.  CPU A has incremented 
data.started.  CPU A will stay in this loop until CPU C sets waitflag = 0.


5.  CPU B, if you recall is _waiting with interrupts disabled_ for CPU A 
to release the  lock.  It does not see the IPI because it has interrupts 
disabled.  It will not see the IPI until CPU A has released the lock.


6.  CPU C is eventually only waiting for CPU B to do the final increment 
of data.started = cpus.  CPU B is waiting for CPU A to release the 
lock.  CPU A is executing a tight loop which it will not exit from until 
CPU C can set waitflag to zero.


That's a 3-way deadlock.

So, the issue is placing the other CPUs in a state that they do not make 
forward progress.  The deadlock occurs before the calling CPU has 
disabled interrupts in the code in step 3.


I also tested this code without the __init tags and explicitly coding 
waitflag=0 to avoid the gcc only setting .bss section variables to zero 
error that someone fixed last week.  I also removed the code that 
disabled interrupts on the calling processor which had no effect -- at 
first I thought it was a simple interrupt issue ...


Maybe smp_call_function needs a written warning that the called function 
should not "suspend" CPUs?


P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1067 matches

Mail list logo