Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-08-26 Thread Takao Indoh
On 2015/07/29 15:08, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> This patch provides Intel PT logging feature. When system boots with a
>> parameter "intel_pt_log", log buffers for Intel PT are allocated and
>> logging starts, then processor flow information is written in the log
>> buffer by hardware like flight recorder. This is very helpful to
>> investigate a cause of kernel panic.
>>
>> The log buffer size is specified by the parameter
>> "intel_pt_log_buf_len=". This buffer is used as circular buffer,
>> therefore old events are overwritten by new events.
> 
> [skip]
> 
>> +static void enable_pt(int enable)
>> +{
>> +u64 ctl;
>> +
>> +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
> 
> Ideally, you shouldn't need this rdmsr(), because in this code you
> should know exactly which ctl bits you need set when you enable.
> 
>> +
>> +if (enable)
>> +ctl |= RTIT_CTL_TRACEEN;
>> +else
>> +ctl &= ~RTIT_CTL_TRACEEN;
>> +
>> +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +}
> 
> But the bigger problem with this approach is that it duplicates the
> existing driver's functionality and some of the code, which just makes
> it harder to maintain amoung other things.
> 
> Instead, we should be able to do use the existing perf functionality to
> enable the system-wide tracing, so that it goes through the
> driver. Another thing to remember is that you'd also need some of the
> sideband data (vm mappings, context switches) to be able to properly
> decode the trace, which also can come from perf. And it'd also be much
> less code. The only missing piece is the code that would allocate the
> ring buffer for such events.

Alexander,

I checked perf code to find out what kinds of information are needed as
side-band data. It seems that the following two events are used.
 - sched:sched_switch
 - dummy(PERF_COUNT_SW_DUMMY)

So, what I need to do is adding kernel counter for three events
(intel_pt, sched:sched_switch, dummy). My understanding is correct?

Thanks,
Takao Indoh

> 
> Something like:
> 
> static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
> 
> static struct perf_event_attr perf_kdump_attr;
> 
> ...
> 
> static int perf_kdump_init(void)
> {
>  struct perf_event *event;
>  int cpu;
> 
>  get_online_cpus();
>  for_each_possible_cpu(cpu) {
>  event = perf_create_kernel_counter(&perf_kdump_attr,
>  cpu, NULL,
>  NULL, NULL);
> 
>   ...
> 
>  ret = rb_alloc_kernel(event, perf_kdump_data_size, 
> perf_kdump_aux_size);
> 
>  ...
>  
>  per_cpu(perf_kdump_event, cpu) = event;
>  }
>  put_online_cpus();
> }
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Alexander Shishkin
Takao Indoh  writes:

> Ok, I'm reading the code around perf_event_create_kernel_counter. It
> seems to work for my purpose, I'll try to update my patch with this.

Thank you.

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Takao Indoh
On 2015/07/29 18:09, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> On 2015/07/29 15:08, Alexander Shishkin wrote:
>>> Instead, we should be able to do use the existing perf functionality to
>>> enable the system-wide tracing, so that it goes through the
>>
>> "existing driver" means PMU driver (perf_event_intel_pt.c)?
> 
> Yes.
> 
>> The feature of these patches is a sort of flight recorder. Once it
>> starts, never stop, not export anything to user, it just captures data
>> with minimum overhead in preparation for kernel panic. This usage is
>> different from perf and therefore I'm not sure whether this feature can
>> be implemented using perf infrastructure.
> 
> Why not? There is an established infrastructure for in-kernel perf
> events already, take a look at the nmi watchdog, for example.

Ok, I'm reading the code around perf_event_create_kernel_counter. It
seems to work for my purpose, I'll try to update my patch with this.

Thanks,
Takao Indoh

> 
>>> driver. Another thing to remember is that you'd also need some of the
>>> sideband data (vm mappings, context switches) to be able to properly
>>> decode the trace, which also can come from perf. And it'd also be much
>>> less code. The only missing piece is the code that would allocate the
>>> ring buffer for such events.
>>
>> The sideband data is needed if we want to reconstruct user program flow,
>> but is it needed to reconstruct kernel panic path?
> 
> You are not really interested in the panic path as much as events
> leading up to the panic and those usually have context, which is much
> easier to reconstruct with sideband info. Some of it you can reconstruct
> by walking kernel's data structures, but that is not reliable after the
> panic.
> 
> Regards,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Alexander Shishkin
Takao Indoh  writes:

> On 2015/07/29 15:08, Alexander Shishkin wrote:
>> Instead, we should be able to do use the existing perf functionality to
>> enable the system-wide tracing, so that it goes through the
>
> "existing driver" means PMU driver (perf_event_intel_pt.c)?

Yes.

> The feature of these patches is a sort of flight recorder. Once it
> starts, never stop, not export anything to user, it just captures data
> with minimum overhead in preparation for kernel panic. This usage is
> different from perf and therefore I'm not sure whether this feature can
> be implemented using perf infrastructure.

Why not? There is an established infrastructure for in-kernel perf
events already, take a look at the nmi watchdog, for example.

>> driver. Another thing to remember is that you'd also need some of the
>> sideband data (vm mappings, context switches) to be able to properly
>> decode the trace, which also can come from perf. And it'd also be much
>> less code. The only missing piece is the code that would allocate the
>> ring buffer for such events.
>
> The sideband data is needed if we want to reconstruct user program flow,
> but is it needed to reconstruct kernel panic path?

You are not really interested in the panic path as much as events
leading up to the panic and those usually have context, which is much
easier to reconstruct with sideband info. Some of it you can reconstruct
by walking kernel's data structures, but that is not reliable after the
panic.

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Takao Indoh
On 2015/07/29 15:08, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> This patch provides Intel PT logging feature. When system boots with a
>> parameter "intel_pt_log", log buffers for Intel PT are allocated and
>> logging starts, then processor flow information is written in the log
>> buffer by hardware like flight recorder. This is very helpful to
>> investigate a cause of kernel panic.
>>
>> The log buffer size is specified by the parameter
>> "intel_pt_log_buf_len=". This buffer is used as circular buffer,
>> therefore old events are overwritten by new events.
> 
> [skip]
> 
>> +static void enable_pt(int enable)
>> +{
>> +u64 ctl;
>> +
>> +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
> 
> Ideally, you shouldn't need this rdmsr(), because in this code you
> should know exactly which ctl bits you need set when you enable.

I see, I'll remove this rdmsr in next version.

> 
>> +
>> +if (enable)
>> +ctl |= RTIT_CTL_TRACEEN;
>> +else
>> +ctl &= ~RTIT_CTL_TRACEEN;
>> +
>> +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +}
> 
> But the bigger problem with this approach is that it duplicates the
> existing driver's functionality and some of the code, which just makes
> it harder to maintain amoung other things.
> 
> Instead, we should be able to do use the existing perf functionality to
> enable the system-wide tracing, so that it goes through the

"existing driver" means PMU driver (perf_event_intel_pt.c)?

The feature of these patches is a sort of flight recorder. Once it
starts, never stop, not export anything to user, it just captures data
with minimum overhead in preparation for kernel panic. This usage is
different from perf and therefore I'm not sure whether this feature can
be implemented using perf infrastructure.

> driver. Another thing to remember is that you'd also need some of the
> sideband data (vm mappings, context switches) to be able to properly
> decode the trace, which also can come from perf. And it'd also be much
> less code. The only missing piece is the code that would allocate the
> ring buffer for such events.

The sideband data is needed if we want to reconstruct user program flow,
but is it needed to reconstruct kernel panic path?

Thanks,
Takao Indoh


> 
> Something like:
> 
> static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
> 
> static struct perf_event_attr perf_kdump_attr;
> 
> ...
> 
> static int perf_kdump_init(void)
> {
>  struct perf_event *event;
>  int cpu;
> 
>  get_online_cpus();
>  for_each_possible_cpu(cpu) {
>  event = perf_create_kernel_counter(&perf_kdump_attr,
>  cpu, NULL,
>  NULL, NULL);
> 
>   ...
> 
>  ret = rb_alloc_kernel(event, perf_kdump_data_size, 
> perf_kdump_aux_size);
> 
>  ...
>  
>  per_cpu(perf_kdump_event, cpu) = event;
>  }
>  put_online_cpus();
> }
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-28 Thread Alexander Shishkin
Takao Indoh  writes:

> This patch provides Intel PT logging feature. When system boots with a
> parameter "intel_pt_log", log buffers for Intel PT are allocated and
> logging starts, then processor flow information is written in the log
> buffer by hardware like flight recorder. This is very helpful to
> investigate a cause of kernel panic.
>
> The log buffer size is specified by the parameter
> "intel_pt_log_buf_len=". This buffer is used as circular buffer,
> therefore old events are overwritten by new events.

[skip]

> +static void enable_pt(int enable)
> +{
> + u64 ctl;
> +
> + rdmsrl(MSR_IA32_RTIT_CTL, ctl);

Ideally, you shouldn't need this rdmsr(), because in this code you
should know exactly which ctl bits you need set when you enable.

> +
> + if (enable)
> + ctl |= RTIT_CTL_TRACEEN;
> + else
> + ctl &= ~RTIT_CTL_TRACEEN;
> +
> + wrmsrl(MSR_IA32_RTIT_CTL, ctl);
> +}

But the bigger problem with this approach is that it duplicates the
existing driver's functionality and some of the code, which just makes
it harder to maintain amoung other things.

Instead, we should be able to do use the existing perf functionality to
enable the system-wide tracing, so that it goes through the
driver. Another thing to remember is that you'd also need some of the
sideband data (vm mappings, context switches) to be able to properly
decode the trace, which also can come from perf. And it'd also be much
less code. The only missing piece is the code that would allocate the
ring buffer for such events.

Something like:

static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);

static struct perf_event_attr perf_kdump_attr;

...

static int perf_kdump_init(void)
{
struct perf_event *event;
int cpu;

get_online_cpus();
for_each_possible_cpu(cpu) {
event = perf_create_kernel_counter(&perf_kdump_attr,
   cpu, NULL,
   NULL, NULL);

...

ret = rb_alloc_kernel(event, perf_kdump_data_size, 
perf_kdump_aux_size);

...

per_cpu(perf_kdump_event, cpu) = event;
}
put_online_cpus();
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-28 Thread Takao Indoh
This patch provides Intel PT logging feature. When system boots with a
parameter "intel_pt_log", log buffers for Intel PT are allocated and
logging starts, then processor flow information is written in the log
buffer by hardware like flight recorder. This is very helpful to
investigate a cause of kernel panic.

The log buffer size is specified by the parameter
"intel_pt_log_buf_len=". This buffer is used as circular buffer,
therefore old events are overwritten by new events.

Signed-off-by: Takao Indoh 
---
 arch/x86/Kconfig   |   16 ++
 arch/x86/kernel/cpu/Makefile   |2 +
 arch/x86/kernel/cpu/intel_pt_log.c |  288 
 3 files changed, 306 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced1..c31400f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1658,6 +1658,22 @@ config X86_INTEL_MPX
 
  If unsure, say N.
 
+config X86_INTEL_PT_LOG
+   prompt "Intel PT logger"
+   def_bool n
+   depends on CPU_SUP_INTEL
+   ---help---
+ Intel PT is a hardware features that can capture information
+ about program execution flow. Once Intel PT is enabled, the
+ events which change program flow, like branch instructions,
+ exceptions, interruptions, traps and so on are logged in
+ the memory.
+
+ This option enables starting Intel PT logging feature at boot
+ time. When kernel panic occurs, Intel PT log buffer can be
+ retrieved from crash dump file and enables to reconstruct the
+ detailed flow that led to the panic.
+
 config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 77d371c..24629ff 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -58,6 +58,8 @@ obj-$(CONFIG_X86_LOCAL_APIC)  += perfctr-watchdog.o 
perf_event_amd_ibs.o
 
 obj-$(CONFIG_HYPERVISOR_GUEST) += vmware.o hypervisor.o mshyperv.o
 
+obj-$(CONFIG_X86_INTEL_PT_LOG) += intel_pt_log.o
+
 ifdef CONFIG_X86_FEATURE_NAMES
 quiet_cmd_mkcapflags = MKCAP   $@
   cmd_mkcapflags = $(CONFIG_SHELL) $(srctree)/$(src)/mkcapflags.sh $< $@
diff --git a/arch/x86/kernel/cpu/intel_pt_log.c 
b/arch/x86/kernel/cpu/intel_pt_log.c
new file mode 100644
index 000..b1c4d66
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_log.c
@@ -0,0 +1,288 @@
+/*
+ * Intel Processor Trace Logger
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+
+#define PT_LOG_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+struct pt_log_buf {
+   int cpu;
+
+   void **region;  /* array of pointer to output region */
+   int region_size;/* size of region array */
+   int region_order;   /* page order of region */
+
+   void **tbl; /* array of pointer to ToPA table */
+   int tbl_size;   /* size of tbl array */
+
+   /* Saved registers on panic */
+   u64 saved_msr_ctl;
+   u64 saved_msr_status;
+   u64 saved_msr_output_base;
+   u64 saved_msr_output_mask;
+};
+
+static int pt_log_enabled;
+static int pt_log_buf_nr_pages = 1024; /* number of pages for log buffer */
+
+static DEFINE_PER_CPU(struct pt_log_buf, pt_log_buf_ptr);
+static struct cpumask pt_cpu_mask;
+
+static void enable_pt(int enable)
+{
+   u64 ctl;
+
+   rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+   if (enable)
+   ctl |= RTIT_CTL_TRACEEN;
+   else
+   ctl &= ~RTIT_CTL_TRACEEN;
+
+   wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+}
+
+void save_intel_pt_registers(void)
+{
+   struct pt_log_buf *buf = this_cpu_ptr(&pt_log_buf_ptr);
+
+   if (!cpumask_test_cpu(smp_processor_id(), &pt_cpu_mask))
+   return;
+
+   enable_pt(0);
+
+   rdmsrl(MSR_IA32_RTIT_CTL, buf->saved_msr_ctl);
+   rdmsrl(MSR_IA32_RTIT_STATUS, buf->saved_msr_status);
+   rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, buf->saved_msr_output_base);
+   rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, buf->saved_msr_output_mask);
+}
+
+static void setup_pt_ctl_register(void)
+{
+   u64 reg;
+
+   rdmsrl(MSR_IA32_RTIT_CTL, reg);
+
+   reg |= 
RTIT_CTL_OS|RTIT_CTL_USR|RTIT_CTL_TOPA|RTIT_CTL_TSC_EN|RTIT_CTL_BRANCH_EN;
+
+   wrmsrl(MSR_IA32_RTIT_CTL, reg);
+}
+
+static void setup_pt_output_register(void *base, unsigned int topa_idx,
+unsigned int output_off)
+{
+   u64 reg;
+
+   wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(base));
+
+   reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+   wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+static void *pt_alloc_pages(void **buf, int *index, int node, int order)
+{
+   struct page *page;
+   void *ptr = NULL;
+
+   page = alloc_pages_node(node, PT_LOG_GFP, order);
+   if (p