Re: [PATCH] nvme: Enable acceleration feature of A64FX processor

2019-02-20 Thread Takao Indoh
On Thu, Feb 14, 2019 at 08:44:48PM +, Elliott, Robert (Persistent Memory) 
wrote:
> 
> 
> > -Original Message-
> > From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf 
> > Of Keith Busch
> > Sent: Tuesday, February 5, 2019 8:39 AM
> > To: Takao Indoh 
> > Cc: Takao Indoh ; s...@grimberg.me; 
> > linux-kernel@vger.kernel.org; linux-
> > n...@lists.infradead.org; ax...@fb.com; h...@lst.de
> > Subject: Re: [PATCH] nvme: Enable acceleration feature of A64FX processor
> > 
> > On Tue, Feb 05, 2019 at 09:56:05PM +0900, Takao Indoh wrote:
> > > On Fri, Feb 01, 2019 at 07:54:14AM -0700, Keith Busch wrote:
> > > > On Fri, Feb 01, 2019 at 09:46:15PM +0900, Takao Indoh wrote:
> > > > > From: Takao Indoh 
> > > > >
> > > > > Fujitsu A64FX processor has a feature to accelerate data transfer of
> > > > > internal bus by relaxed ordering. It is enabled when the bit 56 of dma
> > > > > address is set to 1.
> > > >
> > > > Wait, what? RO is a standard PCIe TLP attribute. Why would we need this?
> > >
> > > I should have explained this patch more carefully.
> > >
> > > Standard PCIe devices can use Relaxed Ordering (RO) by setting Attr
> > > field in the TLP header, however, this mechanism cannot be utilized if
> > > the device does not support RO feature. Fujitsu A64FX processor has an
> > > alternate feature to enable RO in its Root Port by setting the bit 56 of
> > > DMA address. This mechanism enables to utilize RO feature even if the
> > > device does not support standard PCIe RO.
> > 
> > I think you're better of just purchasing devices that support the
> > capability per spec rather than with a non-standard work around.
> > 
> 
> The PCIe and NVMe specifications dosn't standardize a way to tell the device
> when to use RO, which leads to system workarounds like this.
> 
> The Enable Relaxed Ordering bit defined by PCIe tells the device when it
> cannot use RO, but doesn't advise when it should or shall use RO.
> 
> For SCSI Express (SOP+PQI), we were going to allow specifying these
> on a per-command basis:
> * TLP attributes (No Snoop, Relaxed Ordering, ID-based Ordering)
> * TLP processing hints (Processing Hints and Steering Tags)
> 
> to be used by the data transfers for the command. In some systems, one
> setting per queue or per device might suffice. Transactions to the
> queues and doorbells require stronger ordering.
> 
> For this workaround:
> * making an extra pass through the SGL to set the address bit is 
> inefficient; it should be done as the SGL is created.

Thanks for your comment, do you mean this should be done in
nvme_pci_setup_sgls()/nvme_pci_setup_prps()?

> * why doesn't it support PRP Lists?

This patch does not support PRP because PRP is used for small data and
we cannot get enough performance improvement by this feature. But I can
support PRP to improve performance of the device which is compliant with
NVMe Spec 1.0 or does not support SGL.

> * how does this interact with an iommu, if there is one? Must the 
> address with bit 56 also be granted permission, or is that
> stripped off before any iommu comparisons?

The latter. A bit 56 is cleared in Root Port before pass it to iommu.

Thanks,
Takao Indoh


Re: [PATCH] nvme: Enable acceleration feature of A64FX processor

2019-02-13 Thread Takao Indoh
On Tue, Feb 05, 2019 at 05:13:47PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 05, 2019 at 07:39:06AM -0700, Keith Busch wrote:
> > > Standard PCIe devices can use Relaxed Ordering (RO) by setting Attr
> > > field in the TLP header, however, this mechanism cannot be utilized if
> > > the device does not support RO feature. Fujitsu A64FX processor has an
> > > alternate feature to enable RO in its Root Port by setting the bit 56 of
> > > DMA address. This mechanism enables to utilize RO feature even if the
> > > device does not support standard PCIe RO.
> > 
> > I think you're better of just purchasing devices that support the
> > capability per spec rather than with a non-standard work around.
> 
> Agreed, this seems like a pretty gross hack.

Ok, let me think about how I should change this patch.
I'm thinking that the problem of this patch is adding processor specific
code into NVMe common driver, is this correct? Or another problem? It
would be great if you could give me a hint to improve this patch.

Thanks,
Takao Indoh



Re: [PATCH] nvme: Enable acceleration feature of A64FX processor

2019-02-05 Thread Takao Indoh
On Fri, Feb 01, 2019 at 04:51:20PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 01, 2019 at 09:46:15PM +0900, Takao Indoh wrote:
> > From: Takao Indoh 
> > 
> > Fujitsu A64FX processor has a feature to accelerate data transfer of
> > internal bus by relaxed ordering. It is enabled when the bit 56 of dma
> > address is set to 1.
> > 
> > This patch introduces this acceleration feature to the NVMe driver to
> > enhance NVMe device performance.
> 
> This has absolutely no business in a PCIe driver, sorry.
> 

At first let me explain detail of this feature. I wrote the same
explanation in the reply to Keith, but I write here again just in case.

Standard PCIe devices can use Relaxed Ordering (RO) by setting Attr
field in the TLP header, however, this mechanism cannot be utilized if
the device does not support RO feature. Fujitsu A64FX processor has an
alternate feature to enable RO in its Root Port by setting the bit 56 of
DMA address. This mechanism enables to utilize RO feature even if the
device does not support standard PCIe RO.

The data packet with its DMA address bit 56 is set, is transferred from
the device to the PCI root port with Strong Ordering (SO), and then it
is transferred with RO to the host memory.

This patch adds new code into NVMe driver to set bit 56 of DMA address
to utilize this feature. The reason why I do this in NVMe driver is that
here is an only place where we can traverses a sgl list to update the
DMA addresses. We can transfer data buffers with RO, but we cannot use
RO as for writes to the admin completion queue and the I/O completion
queue from the NVMe controller to the host. These writes need to be done
with SO to avoid data corruption. This patch scans data buffers queued
in the sgl list and update their DMA addresses to send data buffers with
RO.

Thanks,
Takao Indoh




Re: [PATCH] nvme: Enable acceleration feature of A64FX processor

2019-02-05 Thread Takao Indoh
On Fri, Feb 01, 2019 at 07:54:14AM -0700, Keith Busch wrote:
> On Fri, Feb 01, 2019 at 09:46:15PM +0900, Takao Indoh wrote:
> > From: Takao Indoh 
> > 
> > Fujitsu A64FX processor has a feature to accelerate data transfer of
> > internal bus by relaxed ordering. It is enabled when the bit 56 of dma
> > address is set to 1.
> 
> Wait, what? RO is a standard PCIe TLP attribute. Why would we need this?

I should have explained this patch more carefully.

Standard PCIe devices can use Relaxed Ordering (RO) by setting Attr
field in the TLP header, however, this mechanism cannot be utilized if
the device does not support RO feature. Fujitsu A64FX processor has an
alternate feature to enable RO in its Root Port by setting the bit 56 of
DMA address. This mechanism enables to utilize RO feature even if the
device does not support standard PCIe RO.

The data packet with its DMA address bit 56 is set, is transferred from
the device to the PCI root port with Strong Ordering (SO), and then it
is transferred with RO to the host memory.

This patch adds new code into NVMe driver to set bit 56 of DMA address
to utilize this feature. The reason why I do this in NVMe driver is that
here is an only place where we can traverses a sgl list to update the
DMA addresses. We can transfer data buffers with RO, but we cannot use
RO as for writes to the admin completion queue and the I/O completion
queue from the NVMe controller to the host. These writes need to be done
with SO to avoid data corruption. This patch scans data buffers queued
in the sgl list and update their DMA addresses to send data buffers with
RO.

Thanks,
Takao Indoh



[PATCH] nvme: Enable acceleration feature of A64FX processor

2019-02-01 Thread Takao Indoh
From: Takao Indoh 

Fujitsu A64FX processor has a feature to accelerate data transfer of
internal bus by relaxed ordering. It is enabled when the bit 56 of dma
address is set to 1.

This patch introduces this acceleration feature to the NVMe driver to
enhance NVMe device performance.

Signed-off-by: Takao Indoh 
---
 drivers/nvme/host/core.c |  6 
 drivers/nvme/host/nvme.h |  7 +
 drivers/nvme/host/pci.c  | 65 
 3 files changed, 78 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 150e49723c15..8167c3756b05 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -37,6 +37,9 @@
 
 #define NVME_MINORS(1U << MINORBITS)
 
+DEFINE_STATIC_KEY_FALSE(nvme_quirk_a64fx_force_relax_key);
+EXPORT_SYMBOL_GPL(nvme_quirk_a64fx_force_relax_key);
+
 unsigned int admin_timeout = 60;
 module_param(admin_timeout, uint, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -2493,6 +2496,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->quirks &= ~NVME_QUIRK_NO_DEEPEST_PS;
}
 
+   if (ctrl->quirks & NVME_QUIRK_A64FX_FORCE_RELAX)
+   static_branch_enable(_quirk_a64fx_force_relax_key);
+
ctrl->crdt[0] = le16_to_cpu(id->crdt1);
ctrl->crdt[1] = le16_to_cpu(id->crdt2);
ctrl->crdt[2] = le16_to_cpu(id->crdt3);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ab961bdeea89..fe02d021ee9c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern unsigned int nvme_io_timeout;
 #define NVME_IO_TIMEOUT(nvme_io_timeout * HZ)
@@ -37,6 +38,8 @@ extern struct workqueue_struct *nvme_wq;
 extern struct workqueue_struct *nvme_reset_wq;
 extern struct workqueue_struct *nvme_delete_wq;
 
+DECLARE_STATIC_KEY_FALSE(nvme_quirk_a64fx_force_relax_key);
+
 enum {
NVME_NS_LBA = 0,
NVME_NS_LIGHTNVM= 1,
@@ -95,6 +98,10 @@ enum nvme_quirks {
 * Ignore device provided subnqn.
 */
NVME_QUIRK_IGNORE_DEV_SUBNQN= (1 << 8),
+   /*
+* Force relaxed ordering for A64FX controller
+*/
+   NVME_QUIRK_A64FX_FORCE_RELAX= (1 << 9),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9bc585415d9b..cffe390d4c41 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -835,6 +835,45 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev 
*dev,
return BLK_STS_OK;
 }
 
+static inline void nvme_pci_enable_a64fx_relax_bit(struct nvme_sgl_desc *sge)
+{
+   sge->addr |= (1ULL << 56);
+}
+
+/*
+ * A64FX's controller allow relaxed order by setting 1 on bit 56 of dma address
+ * for performance enhancement.
+ *
+ * This traverses the sgl list and set the bit on ever dma address for
+ * data read.
+ */
+static void nvme_pci_quirk_a64fx_force_relax(struct request *req,
+   struct nvme_rw_command *cmd, int entries)
+{
+   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+   struct nvme_sgl_desc *sg_list;
+   int i, j;
+
+   /* do nothing if sgl is not used or command is not read */
+   if (!iod->use_sgl || cmd->opcode != nvme_cmd_read)
+   return;
+
+   if (entries == 1) {
+   nvme_pci_enable_a64fx_relax_bit(>dptr.sgl);
+   return;
+   }
+
+   i = 0; j = 0;
+   sg_list = nvme_pci_iod_list(req)[j];
+   do {
+   if (i == SGES_PER_PAGE) {
+   i = 0;
+   sg_list = nvme_pci_iod_list(req)[++j];
+   }
+   nvme_pci_enable_a64fx_relax_bit(_list[i++]);
+   } while (--entries > 0);
+}
+
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
 {
@@ -869,6 +908,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
struct request *req,
if (ret != BLK_STS_OK)
goto out_unmap;
 
+   if (static_branch_unlikely(_quirk_a64fx_force_relax_key))
+   nvme_pci_quirk_a64fx_force_relax(req, >rw, nr_mapped);
+
ret = BLK_STS_IOERR;
if (blk_integrity_rq(req)) {
if (blk_rq_count_integrity_sg(q, req->bio) != 1)
@@ -2748,6 +2790,27 @@ static unsigned long check_vendor_combination_bug(struct 
pci_dev *pdev)
return 0;
 }
 
+/*
+ * PCI vendor id of Fujitsu and device id for root port in the A64FX processor
+ */
+#define PCI_VENDOR_ID_FUJITSU  0x10cf
+#define PCI_DEVICE_ID_FUJITSU_A64FX_ROOTPORT   0x1952
+
+static unsigned long check_system_vendor_acceleration(void)
+{
+   struct pci_dev *pdev_root;
+   /*
+* When Fujitsu A64FX Root Port is found, acceleration feature
+* can be enabled.
+*/
+ 

Re: [PATCH] perf/ring_buffer: Fix invalid page order

2016-11-15 Thread Takao Indoh

On 2016/11/15 17:48, Ingo Molnar wrote:


* Takao Indoh <indou.ta...@jp.fujitsu.com> wrote:


In rb_alloc_aux_page(), a page order is set to MAX_ORDER when order is
greater than MAX_ORDER, but page order should be less than MAX_ORDER,
therefore alloc_pages_node fails at least once. This patch fixes page
order so that it can be always less than MAX_ORDER.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 kernel/events/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 257fa46..3f76fdd 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -502,8 +502,8 @@ static struct page *rb_alloc_aux_page(int node, int order)
 {
struct page *page;

-   if (order > MAX_ORDER)
-   order = MAX_ORDER;
+   if (order >= MAX_ORDER)
+   order = MAX_ORDER - 1;

do {
page = alloc_pages_node(node, PERF_AUX_GFP, order);


I'm wondering under what circumstances this allocation failure was seen in
practice - why did others not hit this?


I found this when I ran perf with -m,2048. I think in the most cases
users use default buffer size hence they does not notice.

Thanks,
Takao Indoh


Re: [PATCH] perf/ring_buffer: Fix invalid page order

2016-11-15 Thread Takao Indoh

On 2016/11/15 17:48, Ingo Molnar wrote:


* Takao Indoh  wrote:


In rb_alloc_aux_page(), a page order is set to MAX_ORDER when order is
greater than MAX_ORDER, but page order should be less than MAX_ORDER,
therefore alloc_pages_node fails at least once. This patch fixes page
order so that it can be always less than MAX_ORDER.

Signed-off-by: Takao Indoh 
---
 kernel/events/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 257fa46..3f76fdd 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -502,8 +502,8 @@ static struct page *rb_alloc_aux_page(int node, int order)
 {
struct page *page;

-   if (order > MAX_ORDER)
-   order = MAX_ORDER;
+   if (order >= MAX_ORDER)
+   order = MAX_ORDER - 1;

do {
page = alloc_pages_node(node, PERF_AUX_GFP, order);


I'm wondering under what circumstances this allocation failure was seen in
practice - why did others not hit this?


I found this when I ran perf with -m,2048. I think in the most cases
users use default buffer size hence they does not notice.

Thanks,
Takao Indoh


[PATCH] perf/ring_buffer: Fix invalid page order

2016-11-14 Thread Takao Indoh
In rb_alloc_aux_page(), a page order is set to MAX_ORDER when order is
greater than MAX_ORDER, but page order should be less than MAX_ORDER,
therefore alloc_pages_node fails at least once. This patch fixes page
order so that it can be always less than MAX_ORDER.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 kernel/events/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 257fa46..3f76fdd 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -502,8 +502,8 @@ static struct page *rb_alloc_aux_page(int node, int order)
 {
struct page *page;
 
-   if (order > MAX_ORDER)
-   order = MAX_ORDER;
+   if (order >= MAX_ORDER)
+   order = MAX_ORDER - 1;
 
do {
page = alloc_pages_node(node, PERF_AUX_GFP, order);
-- 
1.8.3.1



[PATCH] perf/ring_buffer: Fix invalid page order

2016-11-14 Thread Takao Indoh
In rb_alloc_aux_page(), a page order is set to MAX_ORDER when order is
greater than MAX_ORDER, but page order should be less than MAX_ORDER,
therefore alloc_pages_node fails at least once. This patch fixes page
order so that it can be always less than MAX_ORDER.

Signed-off-by: Takao Indoh 
---
 kernel/events/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 257fa46..3f76fdd 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -502,8 +502,8 @@ static struct page *rb_alloc_aux_page(int node, int order)
 {
struct page *page;
 
-   if (order > MAX_ORDER)
-   order = MAX_ORDER;
+   if (order >= MAX_ORDER)
+   order = MAX_ORDER - 1;
 
do {
page = alloc_pages_node(node, PERF_AUX_GFP, order);
-- 
1.8.3.1



[tip:perf/core] perf, x86: Stop Intel PT before kdump starts

2015-11-23 Thread tip-bot for Takao Indoh
Commit-ID:  da06a43d3f3f3df87416f654fe15d29fecb5e321
Gitweb: http://git.kernel.org/tip/da06a43d3f3f3df87416f654fe15d29fecb5e321
Author: Takao Indoh 
AuthorDate: Wed, 4 Nov 2015 14:22:33 +0900
Committer:  Ingo Molnar 
CommitDate: Mon, 23 Nov 2015 09:58:26 +0100

perf, x86: Stop Intel PT before kdump starts

This patch stops Intel PT logging and saves its registers in memory
before kdump is started. This feature is needed to prevent Intel PT from
overwriting its log buffer after panic, and saved registers are needed to
find the last position where Intel PT wrote data.

After the crash dump is captured by kdump, users can retrieve the log buffer
from the vmcore and use it to investigate bad kernel behavior.

Signed-off-by: Takao Indoh 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: H.Peter Anvin 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Vivek Goyal 
Link: 
http://lkml.kernel.org/r/1446614553-6072-3-git-send-email-indou.ta...@jp.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/crash.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 2c1910f..58f3431 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -125,6 +126,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
disable_local_APIC();
 }
 
@@ -169,6 +175,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/core] perf/x86/intel/pt: Add interface to stop Intel PT logging

2015-11-23 Thread tip-bot for Takao Indoh
Commit-ID:  24cc12b17679f8e9046746f92fd377f589efc163
Gitweb: http://git.kernel.org/tip/24cc12b17679f8e9046746f92fd377f589efc163
Author: Takao Indoh 
AuthorDate: Wed, 4 Nov 2015 14:22:32 +0900
Committer:  Ingo Molnar 
CommitDate: Mon, 23 Nov 2015 09:58:26 +0100

perf/x86/intel/pt: Add interface to stop Intel PT logging

This patch add a function for external components to stop Intel PT.
Basically this function is used when kernel panic occurs. When it is
called, the intel_pt driver disables Intel PT and saves its registers
using pt_event_stop(), which is also used by pmu.stop handler.

This function stops Intel PT on the CPU where it is working, therefore
users of it need to call it for each CPU to stop all logging.

Signed-off-by: Takao Indoh 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: H.Peter Anvin 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Vivek Goyal 
Link: 
http://lkml.kernel.org/r/1446614553-6072-2-git-send-email-indou.ta...@jp.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/intel_pt.h   | 10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |  9 +
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..e1a4117
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_INTEL_PT_H
+#define _ASM_X86_INTEL_PT_H
+
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
+void cpu_emergency_stop_pt(void);
+#else
+static inline void cpu_emergency_stop_pt(void) {}
+#endif
+
+#endif /* _ASM_X86_INTEL_PT_H */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 868e119..c0bbd10 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 #include "intel_pt.h"
@@ -1122,6 +1123,14 @@ static int pt_event_init(struct perf_event *event)
return 0;
 }
 
+void cpu_emergency_stop_pt(void)
+{
+   struct pt *pt = this_cpu_ptr(_ctx);
+
+   if (pt->handle.event)
+   pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
+}
+
 static __init int pt_init(void)
 {
int ret, cpu, prior_warn = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/core] perf, x86: Stop Intel PT before kdump starts

2015-11-23 Thread tip-bot for Takao Indoh
Commit-ID:  da06a43d3f3f3df87416f654fe15d29fecb5e321
Gitweb: http://git.kernel.org/tip/da06a43d3f3f3df87416f654fe15d29fecb5e321
Author: Takao Indoh <indou.ta...@jp.fujitsu.com>
AuthorDate: Wed, 4 Nov 2015 14:22:33 +0900
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Mon, 23 Nov 2015 09:58:26 +0100

perf, x86: Stop Intel PT before kdump starts

This patch stops Intel PT logging and saves its registers in memory
before kdump is started. This feature is needed to prevent Intel PT from
overwriting its log buffer after panic, and saved registers are needed to
find the last position where Intel PT wrote data.

After the crash dump is captured by kdump, users can retrieve the log buffer
from the vmcore and use it to investigate bad kernel behavior.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Alexander Shishkin<alexander.shish...@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: H.Peter Anvin <h...@zytor.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: Vivek Goyal <vgo...@redhat.com>
Link: 
http://lkml.kernel.org/r/1446614553-6072-3-git-send-email-indou.ta...@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/kernel/crash.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 2c1910f..58f3431 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -125,6 +126,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
disable_local_APIC();
 }
 
@@ -169,6 +175,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/core] perf/x86/intel/pt: Add interface to stop Intel PT logging

2015-11-23 Thread tip-bot for Takao Indoh
Commit-ID:  24cc12b17679f8e9046746f92fd377f589efc163
Gitweb: http://git.kernel.org/tip/24cc12b17679f8e9046746f92fd377f589efc163
Author: Takao Indoh <indou.ta...@jp.fujitsu.com>
AuthorDate: Wed, 4 Nov 2015 14:22:32 +0900
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Mon, 23 Nov 2015 09:58:26 +0100

perf/x86/intel/pt: Add interface to stop Intel PT logging

This patch add a function for external components to stop Intel PT.
Basically this function is used when kernel panic occurs. When it is
called, the intel_pt driver disables Intel PT and saves its registers
using pt_event_stop(), which is also used by pmu.stop handler.

This function stops Intel PT on the CPU where it is working, therefore
users of it need to call it for each CPU to stop all logging.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Alexander Shishkin<alexander.shish...@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: H.Peter Anvin <h...@zytor.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: Vivek Goyal <vgo...@redhat.com>
Link: 
http://lkml.kernel.org/r/1446614553-6072-2-git-send-email-indou.ta...@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/include/asm/intel_pt.h   | 10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |  9 +
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..e1a4117
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_INTEL_PT_H
+#define _ASM_X86_INTEL_PT_H
+
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
+void cpu_emergency_stop_pt(void);
+#else
+static inline void cpu_emergency_stop_pt(void) {}
+#endif
+
+#endif /* _ASM_X86_INTEL_PT_H */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 868e119..c0bbd10 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 #include "intel_pt.h"
@@ -1122,6 +1123,14 @@ static int pt_event_init(struct perf_event *event)
return 0;
 }
 
+void cpu_emergency_stop_pt(void)
+{
+   struct pt *pt = this_cpu_ptr(_ctx);
+
+   if (pt->handle.event)
+   pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
+}
+
 static __init int pt_init(void)
 {
int ret, cpu, prior_warn = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/2] Stop Intel Processor Trace logging on panic

2015-11-12 Thread Takao Indoh
On 2015/11/12 21:38, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 09:05:11PM +0900, Takao Indoh wrote:
>> Ping, any comments on these patches?
>>
> 
> I've taken them, they should appear in tip sometime after the merge
> window closes.
> 

Ok, thanks.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/2] Stop Intel Processor Trace logging on panic

2015-11-12 Thread Takao Indoh
Ping, any comments on these patches?

Thanks,
Takao Indoh

On 2015/11/04 14:22, Takao Indoh wrote:
> Hi all,
> 
> These patch series provide a feature to stop Intel Processor Trace
> (Intel PT) logging and save its registers in the memory on panic.
> 
> Intel PT is a new feature of Intel CPU "Broadwell", it captures
> information about program execution flow. Here is a article about Intel
> PT.
> https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing
> 
> Once Intel PT is enabled, the events which change program flow, like
> branch instructions, exceptions, interruptions, traps and so on are
> logged in the memory. This is very useful for debugging because we can
> know the detailed behavior of software.
> 
> When kernel panic occurs while you are running a perf command with Intel
> PT (with -e intel_pt// option), these patches disable Intel PT and save
> its registers in the memory. After crash dump is captured by kdump, you
> can retrieve Intel PT log buffer from vmcore and investigate kernel
> behavior. I have not made a tool yet to salvage Intel PT log buffer from
> vmcore, but I'll do once these patches are accepted.
> 
> changelog:
> v2:
> - Define function in intel_pt.h with static inline
> 
> v1:
> https://lkml.org/lkml/2015/10/28/136
> 
> 
> Background:
> These patches are a part of patch series I posted before, the original
> discussion is bellow.
> 
> x86: Intel Processor Trace Logger
> v1: https://lkml.org/lkml/2015/7/29/6
> v2: https://lkml.org/lkml/2015/9/8/24
> 
> The purpose of the original patches is introducing in-kernel logger
> using Intel PT. To implement it I need to add some APIs to control perf
> counter and ring buffer in kernel. Alexander Shishkin is working on such
> APIs for his work to make use of Intel PT for process core dump. Apart
> from such APIs, the feature to save Intel PT registers on panic is
> helpful for normal perf command user as I described above, therefore I
> separate the feature from original patches.
> 
> Takao Indoh (2):
>perf/x86/intel/pt: Add interface to stop Intel PT logging
>x86: Stop Intel PT before kdump starts
> 
>   arch/x86/include/asm/intel_pt.h   |   10 ++
>   arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
>   arch/x86/kernel/crash.c   |   11 +++
>   3 files changed, 30 insertions(+), 0 deletions(-)
>   create mode 100644 arch/x86/include/asm/intel_pt.h
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/2] Stop Intel Processor Trace logging on panic

2015-11-12 Thread Takao Indoh
On 2015/11/12 21:38, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 09:05:11PM +0900, Takao Indoh wrote:
>> Ping, any comments on these patches?
>>
> 
> I've taken them, they should appear in tip sometime after the merge
> window closes.
> 

Ok, thanks.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/2] Stop Intel Processor Trace logging on panic

2015-11-12 Thread Takao Indoh
Ping, any comments on these patches?

Thanks,
Takao Indoh

On 2015/11/04 14:22, Takao Indoh wrote:
> Hi all,
> 
> These patch series provide a feature to stop Intel Processor Trace
> (Intel PT) logging and save its registers in the memory on panic.
> 
> Intel PT is a new feature of Intel CPU "Broadwell", it captures
> information about program execution flow. Here is a article about Intel
> PT.
> https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing
> 
> Once Intel PT is enabled, the events which change program flow, like
> branch instructions, exceptions, interruptions, traps and so on are
> logged in the memory. This is very useful for debugging because we can
> know the detailed behavior of software.
> 
> When kernel panic occurs while you are running a perf command with Intel
> PT (with -e intel_pt// option), these patches disable Intel PT and save
> its registers in the memory. After crash dump is captured by kdump, you
> can retrieve Intel PT log buffer from vmcore and investigate kernel
> behavior. I have not made a tool yet to salvage Intel PT log buffer from
> vmcore, but I'll do once these patches are accepted.
> 
> changelog:
> v2:
> - Define function in intel_pt.h with static inline
> 
> v1:
> https://lkml.org/lkml/2015/10/28/136
> 
> 
> Background:
> These patches are a part of patch series I posted before, the original
> discussion is bellow.
> 
> x86: Intel Processor Trace Logger
> v1: https://lkml.org/lkml/2015/7/29/6
> v2: https://lkml.org/lkml/2015/9/8/24
> 
> The purpose of the original patches is introducing in-kernel logger
> using Intel PT. To implement it I need to add some APIs to control perf
> counter and ring buffer in kernel. Alexander Shishkin is working on such
> APIs for his work to make use of Intel PT for process core dump. Apart
> from such APIs, the feature to save Intel PT registers on panic is
> helpful for normal perf command user as I described above, therefore I
> separate the feature from original patches.
> 
> Takao Indoh (2):
>perf/x86/intel/pt: Add interface to stop Intel PT logging
>x86: Stop Intel PT before kdump starts
> 
>   arch/x86/include/asm/intel_pt.h   |   10 ++
>   arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
>   arch/x86/kernel/crash.c   |   11 +++
>   3 files changed, 30 insertions(+), 0 deletions(-)
>   create mode 100644 arch/x86/include/asm/intel_pt.h
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/2] Stop Intel Processor Trace logging on panic

2015-11-03 Thread Takao Indoh
Hi all,

These patch series provide a feature to stop Intel Processor Trace
(Intel PT) logging and save its registers in the memory on panic.

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

When kernel panic occurs while you are running a perf command with Intel
PT (with -e intel_pt// option), these patches disable Intel PT and save
its registers in the memory. After crash dump is captured by kdump, you
can retrieve Intel PT log buffer from vmcore and investigate kernel
behavior. I have not made a tool yet to salvage Intel PT log buffer from
vmcore, but I'll do once these patches are accepted.

changelog:
v2:
- Define function in intel_pt.h with static inline

v1:
https://lkml.org/lkml/2015/10/28/136


Background:
These patches are a part of patch series I posted before, the original
discussion is bellow.

x86: Intel Processor Trace Logger
v1: https://lkml.org/lkml/2015/7/29/6
v2: https://lkml.org/lkml/2015/9/8/24

The purpose of the original patches is introducing in-kernel logger
using Intel PT. To implement it I need to add some APIs to control perf
counter and ring buffer in kernel. Alexander Shishkin is working on such
APIs for his work to make use of Intel PT for process core dump. Apart
from such APIs, the feature to save Intel PT registers on panic is
helpful for normal perf command user as I described above, therefore I
separate the feature from original patches.

Takao Indoh (2):
  perf/x86/intel/pt: Add interface to stop Intel PT logging
  x86: Stop Intel PT before kdump starts

 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 arch/x86/kernel/crash.c   |   11 +++
 3 files changed, 30 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] perf/x86/intel/pt: Add interface to stop Intel PT logging

2015-11-03 Thread Takao Indoh
This patch add a function for external component to stop Intel PT.
Basically this function is used when kernel panic occurs. When it is
called, intel_pt driver disables Intel PT and saves its registers using
pt_event_stop, which is also used by pmu.stop handler. This function
stops Intel PT on the cpu where it is working, therefore user need to
call it for each cpu to stop all logging.

Signed-off-by: Takao Indoh 
---
 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 2 files changed, 19 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..e1a4117
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_INTEL_PT_H
+#define _ASM_X86_INTEL_PT_H
+
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
+void cpu_emergency_stop_pt(void);
+#else
+static inline void cpu_emergency_stop_pt(void) {}
+#endif
+
+#endif /* _ASM_X86_INTEL_PT_H */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 4216928..a638b5b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 #include "intel_pt.h"
@@ -1125,6 +1126,14 @@ static int pt_event_init(struct perf_event *event)
return 0;
 }
 
+void cpu_emergency_stop_pt(void)
+{
+   struct pt *pt = this_cpu_ptr(_ctx);
+
+   if (pt->handle.event)
+   pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
+}
+
 static __init int pt_init(void)
 {
int ret, cpu, prior_warn = 0;
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/2] x86: Stop Intel PT before kdump starts

2015-11-03 Thread Takao Indoh
This patch stops Intel PT logging and saves its registers in the memory
before kdump is started. This feature is needed to prevent Intel PT from
overwrite its log buffer after panic, and saved registers are needed to
find the last position where Intel PT wrote data. After crash dump is
captured by kdump, user can retrieve the log buffer from vmcore and use
it to investigate kernel behavior.

Signed-off-by: Takao Indoh 
---
 arch/x86/kernel/crash.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 74ca2fe..5f383d2 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
disable_local_APIC();
 }
 
@@ -172,6 +178,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] perf/x86/intel/pt: Add interface to stop Intel PT logging

2015-11-03 Thread Takao Indoh
This patch add a function for external component to stop Intel PT.
Basically this function is used when kernel panic occurs. When it is
called, intel_pt driver disables Intel PT and saves its registers using
pt_event_stop, which is also used by pmu.stop handler. This function
stops Intel PT on the cpu where it is working, therefore user need to
call it for each cpu to stop all logging.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 2 files changed, 19 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..e1a4117
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_INTEL_PT_H
+#define _ASM_X86_INTEL_PT_H
+
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
+void cpu_emergency_stop_pt(void);
+#else
+static inline void cpu_emergency_stop_pt(void) {}
+#endif
+
+#endif /* _ASM_X86_INTEL_PT_H */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 4216928..a638b5b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 #include "intel_pt.h"
@@ -1125,6 +1126,14 @@ static int pt_event_init(struct perf_event *event)
return 0;
 }
 
+void cpu_emergency_stop_pt(void)
+{
+   struct pt *pt = this_cpu_ptr(_ctx);
+
+   if (pt->handle.event)
+   pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
+}
+
 static __init int pt_init(void)
 {
int ret, cpu, prior_warn = 0;
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/2] x86: Stop Intel PT before kdump starts

2015-11-03 Thread Takao Indoh
This patch stops Intel PT logging and saves its registers in the memory
before kdump is started. This feature is needed to prevent Intel PT from
overwrite its log buffer after panic, and saved registers are needed to
find the last position where Intel PT wrote data. After crash dump is
captured by kdump, user can retrieve the log buffer from vmcore and use
it to investigate kernel behavior.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 arch/x86/kernel/crash.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 74ca2fe..5f383d2 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
disable_local_APIC();
 }
 
@@ -172,6 +178,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/2] Stop Intel Processor Trace logging on panic

2015-11-03 Thread Takao Indoh
Hi all,

These patch series provide a feature to stop Intel Processor Trace
(Intel PT) logging and save its registers in the memory on panic.

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

When kernel panic occurs while you are running a perf command with Intel
PT (with -e intel_pt// option), these patches disable Intel PT and save
its registers in the memory. After crash dump is captured by kdump, you
can retrieve Intel PT log buffer from vmcore and investigate kernel
behavior. I have not made a tool yet to salvage Intel PT log buffer from
vmcore, but I'll do once these patches are accepted.

changelog:
v2:
- Define function in intel_pt.h with static inline

v1:
https://lkml.org/lkml/2015/10/28/136


Background:
These patches are a part of patch series I posted before, the original
discussion is bellow.

x86: Intel Processor Trace Logger
v1: https://lkml.org/lkml/2015/7/29/6
v2: https://lkml.org/lkml/2015/9/8/24

The purpose of the original patches is introducing in-kernel logger
using Intel PT. To implement it I need to add some APIs to control perf
counter and ring buffer in kernel. Alexander Shishkin is working on such
APIs for his work to make use of Intel PT for process core dump. Apart
from such APIs, the feature to save Intel PT registers on panic is
helpful for normal perf command user as I described above, therefore I
separate the feature from original patches.

Takao Indoh (2):
  perf/x86/intel/pt: Add interface to stop Intel PT logging
  x86: Stop Intel PT before kdump starts

 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 arch/x86/kernel/crash.c   |   11 +++
 3 files changed, 30 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] perf/x86/intel/pt: Add interface to stop Intel PT logging

2015-10-28 Thread Takao Indoh
This patch add a function for external component to stop Intel PT.
Basically this function is used when kernel panic occurs. When it is
called, intel_pt driver disables Intel PT and saves its registers using
pt_event_stop, which is also used by pmu.stop handler. This function
stops Intel PT on the cpu where it is working, therefore user need to
call it for each cpu to stop all logging.

Signed-off-by: Takao Indoh 
---
 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 2 files changed, 19 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..3bfe971
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_INTEL_PT_H
+#define _ASM_X86_INTEL_PT_H
+
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
+void cpu_emergency_stop_pt(void);
+#else
+void cpu_emergency_stop_pt(void) {}
+#endif
+
+#endif /* _ASM_X86_INTEL_PT_H */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 4216928..a638b5b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 #include "intel_pt.h"
@@ -1125,6 +1126,14 @@ static int pt_event_init(struct perf_event *event)
return 0;
 }
 
+void cpu_emergency_stop_pt(void)
+{
+   struct pt *pt = this_cpu_ptr(_ctx);
+
+   if (pt->handle.event)
+   pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
+}
+
 static __init int pt_init(void)
 {
int ret, cpu, prior_warn = 0;
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86: Stop Intel PT before kdump starts

2015-10-28 Thread Takao Indoh
This patch stops Intel PT logging and saves its registers in the memory
before kdump is started. This feature is needed to prevent Intel PT from
overwrite its log buffer after panic, and saved registers are needed to
find the last position where Intel PT wrote data. After crash dump is
captured by kdump, user can retrieve the log buffer from vmcore and use
it to investigate kernel behavior.

Signed-off-by: Takao Indoh 
---
 arch/x86/kernel/crash.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 74ca2fe..5f383d2 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
disable_local_APIC();
 }
 
@@ -172,6 +178,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Stop Intel Processor Trace logging on panic

2015-10-28 Thread Takao Indoh
Hi all,

These patch series provide a feature to stop Intel Processor Trace
(Intel PT) logging and save its registers in the memory on panic.

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

When kernel panic occurs while you are running a perf command with Intel
PT (with -e intel_pt// option), these patches disable Intel PT and save
its registers in the memory. After crash dump is captured by kdump, you
can retrieve Intel PT log buffer from vmcore and investigate kernel
behavior. I have not made a tool yet to salvage Intel PT log buffer from
vmcore, but I'll do once these patches are accepted.

These patches are a part of patch series I posted before, the original
discussion is bellow.

x86: Intel Processor Trace Logger
v1: https://lkml.org/lkml/2015/7/29/6
v2: https://lkml.org/lkml/2015/9/8/24

The purpose of the original patches is introducing in-kernel logger
using Intel PT. To implement it I need to add some APIs to control perf
counter and ring buffer in kernel. Alexander Shishkin is working on such
APIs for his work to make use of Intel PT for process core dump. Apart
from such APIs, the feature to save Intel PT registers on panic is
helpful for normal perf command user as I described above, therefore I
separate the feature from original patches.


Takao Indoh (2):
  perf/x86/intel/pt: Add interface to stop Intel PT logging
  x86: Stop Intel PT before kdump starts

 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 arch/x86/kernel/crash.c   |   11 +++
 3 files changed, 30 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Stop Intel Processor Trace logging on panic

2015-10-28 Thread Takao Indoh
Hi all,

These patch series provide a feature to stop Intel Processor Trace
(Intel PT) logging and save its registers in the memory on panic.

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

When kernel panic occurs while you are running a perf command with Intel
PT (with -e intel_pt// option), these patches disable Intel PT and save
its registers in the memory. After crash dump is captured by kdump, you
can retrieve Intel PT log buffer from vmcore and investigate kernel
behavior. I have not made a tool yet to salvage Intel PT log buffer from
vmcore, but I'll do once these patches are accepted.

These patches are a part of patch series I posted before, the original
discussion is bellow.

x86: Intel Processor Trace Logger
v1: https://lkml.org/lkml/2015/7/29/6
v2: https://lkml.org/lkml/2015/9/8/24

The purpose of the original patches is introducing in-kernel logger
using Intel PT. To implement it I need to add some APIs to control perf
counter and ring buffer in kernel. Alexander Shishkin is working on such
APIs for his work to make use of Intel PT for process core dump. Apart
from such APIs, the feature to save Intel PT registers on panic is
helpful for normal perf command user as I described above, therefore I
separate the feature from original patches.


Takao Indoh (2):
  perf/x86/intel/pt: Add interface to stop Intel PT logging
  x86: Stop Intel PT before kdump starts

 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 arch/x86/kernel/crash.c   |   11 +++
 3 files changed, 30 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86: Stop Intel PT before kdump starts

2015-10-28 Thread Takao Indoh
This patch stops Intel PT logging and saves its registers in the memory
before kdump is started. This feature is needed to prevent Intel PT from
overwrite its log buffer after panic, and saved registers are needed to
find the last position where Intel PT wrote data. After crash dump is
captured by kdump, user can retrieve the log buffer from vmcore and use
it to investigate kernel behavior.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 arch/x86/kernel/crash.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 74ca2fe..5f383d2 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
disable_local_APIC();
 }
 
@@ -172,6 +178,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+   /*
+* Disable Intel PT to stop its logging
+*/
+   cpu_emergency_stop_pt();
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] perf/x86/intel/pt: Add interface to stop Intel PT logging

2015-10-28 Thread Takao Indoh
This patch add a function for external component to stop Intel PT.
Basically this function is used when kernel panic occurs. When it is
called, intel_pt driver disables Intel PT and saves its registers using
pt_event_stop, which is also used by pmu.stop handler. This function
stops Intel PT on the cpu where it is working, therefore user need to
call it for each cpu to stop all logging.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 arch/x86/include/asm/intel_pt.h   |   10 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |9 +
 2 files changed, 19 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..3bfe971
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_INTEL_PT_H
+#define _ASM_X86_INTEL_PT_H
+
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
+void cpu_emergency_stop_pt(void);
+#else
+void cpu_emergency_stop_pt(void) {}
+#endif
+
+#endif /* _ASM_X86_INTEL_PT_H */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 4216928..a638b5b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 #include "intel_pt.h"
@@ -1125,6 +1126,14 @@ static int pt_event_init(struct perf_event *event)
return 0;
 }
 
+void cpu_emergency_stop_pt(void)
+{
+   struct pt *pt = this_cpu_ptr(_ctx);
+
+   if (pt->handle.event)
+   pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
+}
+
 static __init int pt_init(void)
 {
int ret, cpu, prior_warn = 0;
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] perf/x86/intel/pt: Add Intel PT logger

2015-09-08 Thread Takao Indoh
On 2015/09/08 18:48, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> +/* intel_pt */
>> +static struct perf_event_attr pt_attr_pt = {
>> +.config = 0x400, /* bit10: TSCEn */
> 
> Doesn't it make sense to make these things configurable via sysfs or
> whatnot?

That make sense, will do.

> 
>> +static int pt_log_buf_nr_pages = 128; /* number of pages for log buffer */
> 
> Same here.
> 
>> +static struct cpumask pt_log_cpu_mask;
>> +
>> +static DEFINE_PER_CPU(struct perf_event *, pt_perf_event_pt);
>> +static DEFINE_PER_CPU(struct perf_event *, pt_perf_event_sched);
>> +static DEFINE_PER_CPU(struct perf_event *, pt_perf_event_dummy);
>> +
>> +/* Saved registers on panic */
>> +static DEFINE_PER_CPU(u64, saved_msr_ctl);
>> +static DEFINE_PER_CPU(u64, saved_msr_status);
>> +static DEFINE_PER_CPU(u64, saved_msr_output_base);
>> +static DEFINE_PER_CPU(u64, saved_msr_output_mask);
>> +
>> +void save_intel_pt_registers(void)
>> +{
>> +int cpu = smp_processor_id();
>> +u64 ctl;
>> +
>> +if (!cpumask_test_cpu(cpu, _log_cpu_mask))
>> +return;
>> +
>> +/* Save RTIT_CTL register */
>> +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +per_cpu(saved_msr_ctl, cpu) = ctl;
>> +
>> +/* Stop tracing */
>> +ctl &= ~RTIT_CTL_TRACEEN;
>> +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +
>> +/* Save other registers */
>> +rdmsrl(MSR_IA32_RTIT_STATUS, per_cpu(saved_msr_status, cpu));
>> +rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, per_cpu(saved_msr_output_base, cpu));
>> +rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, per_cpu(saved_msr_output_mask, cpu));
> 
> I'd really like to keep the PT msr accesses confined to the intel_pt
> driver. Maybe have a similar function there? That way you could also use
> pt_config_start() instead of clearing TraceEn by hand.
> 
> Do you need these saved msr values for the crash tool? I'm guessing
> you'd need the write pointer to figure out where the most recent data
> is. But then again, if you go the perf_event_disable() path, it'll all
> happen automatically in the driver. Or rather __perf_event_disable()
> type of thing since this is strictly cpu-local. Or even
> event::pmu::stop() would do the trick. The buffer's write head would
> then be in this_cpu_ptr(_ctx)->handle.head.

Yes, what I need is the last position where Intel PT hardware wrote
data. Once kernel panic occurs, basically we should minimize the access
to kernel data or functions because they may be broken. That is why I
touch msr directly in this patch. But I agree to limit the access to msr
except intel_pt driver. Using pmu.stop() or pt_event_stop() looks good
to me.

Thanks,
Takao Indoh


> 
> Thanks,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/4] perf: Add function to enable perf events in kernel with ring buffer

2015-09-08 Thread Takao Indoh
On 2015/09/08 18:32, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> perf_event_create_kernel_counter is used to enable perf events in kernel
>> without buffer for logging its events. This patch add new fucntion which
>> enable perf events with ring buffer. Intel PT logger uses this to enable
>> Intel PT and some associated events with its log buffer.
> 
> Have you seen [1] and related patches? I haven't gotten around to
> updating them yet, but hopefully it's going to happen soon.
> 
> The problem is that for such api to work, this memory needs to be
> accounted, especially when you start handling event inheritance. For
> system crash dump it doesn't really matter, but I also need a similar
> api for per-task core dumps, for example.

I have not seen this, I'll check it. You or someone else are working on
api for process core dump?

Thanks,
Takao Indoh

> 
> [1] https://lkml.org/lkml/2014/10/13/290
> 
> Thanks,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/4] perf: Add function to enable perf events in kernel with ring buffer

2015-09-08 Thread Takao Indoh
On 2015/09/08 18:32, Alexander Shishkin wrote:
> Takao Indoh <indou.ta...@jp.fujitsu.com> writes:
> 
>> perf_event_create_kernel_counter is used to enable perf events in kernel
>> without buffer for logging its events. This patch add new fucntion which
>> enable perf events with ring buffer. Intel PT logger uses this to enable
>> Intel PT and some associated events with its log buffer.
> 
> Have you seen [1] and related patches? I haven't gotten around to
> updating them yet, but hopefully it's going to happen soon.
> 
> The problem is that for such api to work, this memory needs to be
> accounted, especially when you start handling event inheritance. For
> system crash dump it doesn't really matter, but I also need a similar
> api for per-task core dumps, for example.

I have not seen this, I'll check it. You or someone else are working on
api for process core dump?

Thanks,
Takao Indoh

> 
> [1] https://lkml.org/lkml/2014/10/13/290
> 
> Thanks,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] perf/x86/intel/pt: Add Intel PT logger

2015-09-08 Thread Takao Indoh
On 2015/09/08 18:48, Alexander Shishkin wrote:
> Takao Indoh <indou.ta...@jp.fujitsu.com> writes:
> 
>> +/* intel_pt */
>> +static struct perf_event_attr pt_attr_pt = {
>> +.config = 0x400, /* bit10: TSCEn */
> 
> Doesn't it make sense to make these things configurable via sysfs or
> whatnot?

That make sense, will do.

> 
>> +static int pt_log_buf_nr_pages = 128; /* number of pages for log buffer */
> 
> Same here.
> 
>> +static struct cpumask pt_log_cpu_mask;
>> +
>> +static DEFINE_PER_CPU(struct perf_event *, pt_perf_event_pt);
>> +static DEFINE_PER_CPU(struct perf_event *, pt_perf_event_sched);
>> +static DEFINE_PER_CPU(struct perf_event *, pt_perf_event_dummy);
>> +
>> +/* Saved registers on panic */
>> +static DEFINE_PER_CPU(u64, saved_msr_ctl);
>> +static DEFINE_PER_CPU(u64, saved_msr_status);
>> +static DEFINE_PER_CPU(u64, saved_msr_output_base);
>> +static DEFINE_PER_CPU(u64, saved_msr_output_mask);
>> +
>> +void save_intel_pt_registers(void)
>> +{
>> +int cpu = smp_processor_id();
>> +u64 ctl;
>> +
>> +if (!cpumask_test_cpu(cpu, _log_cpu_mask))
>> +return;
>> +
>> +/* Save RTIT_CTL register */
>> +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +per_cpu(saved_msr_ctl, cpu) = ctl;
>> +
>> +/* Stop tracing */
>> +ctl &= ~RTIT_CTL_TRACEEN;
>> +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +
>> +/* Save other registers */
>> +rdmsrl(MSR_IA32_RTIT_STATUS, per_cpu(saved_msr_status, cpu));
>> +rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, per_cpu(saved_msr_output_base, cpu));
>> +rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, per_cpu(saved_msr_output_mask, cpu));
> 
> I'd really like to keep the PT msr accesses confined to the intel_pt
> driver. Maybe have a similar function there? That way you could also use
> pt_config_start() instead of clearing TraceEn by hand.
> 
> Do you need these saved msr values for the crash tool? I'm guessing
> you'd need the write pointer to figure out where the most recent data
> is. But then again, if you go the perf_event_disable() path, it'll all
> happen automatically in the driver. Or rather __perf_event_disable()
> type of thing since this is strictly cpu-local. Or even
> event::pmu::stop() would do the trick. The buffer's write head would
> then be in this_cpu_ptr(_ctx)->handle.head.

Yes, what I need is the last position where Intel PT hardware wrote
data. Once kernel panic occurs, basically we should minimize the access
to kernel data or functions because they may be broken. That is why I
touch msr directly in this patch. But I agree to limit the access to msr
except intel_pt driver. Using pmu.stop() or pt_event_stop() looks good
to me.

Thanks,
Takao Indoh


> 
> Thanks,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/4] perf: Add function to enable perf events in kernel with ring buffer

2015-09-07 Thread Takao Indoh
perf_event_create_kernel_counter is used to enable perf events in kernel
without buffer for logging its events. This patch add new fucntion which
enable perf events with ring buffer. Intel PT logger uses this to enable
Intel PT and some associated events with its log buffer.

Signed-off-by: Takao Indoh 
---
 include/linux/perf_event.h |   10 ++
 kernel/events/core.c   |   70 ---
 2 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..34ada8c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -657,6 +657,16 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr,
struct task_struct *task,
perf_overflow_handler_t callback,
void *context);
+extern struct perf_event *
+perf_event_create_kernel_counter_with_buffer(struct perf_event_attr *attr,
+   int cpu,
+   struct task_struct *task,
+   perf_overflow_handler_t callback,
+   void *context,
+   int flags,
+   int nr_pages,
+   int nr_pages_aux,
+   struct perf_event *output_event);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
 extern u64 perf_event_read_value(struct perf_event *event,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ae16867..c9d8a59 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8356,21 +8356,33 @@ err_fd:
 }
 
 /**
- * perf_event_create_kernel_counter
+ * perf_event_create_kernel_counter_with_buffer
  *
  * @attr: attributes of the counter to create
  * @cpu: cpu in which the counter is bound
  * @task: task to profile (NULL for percpu)
+ * @overflow_handler: handler for overflow event
+ * @context: target context
+ * @flags: flags of ring buffer
+ * @nr_pages: size (number of pages) of buffer
+ * @nr_pages_aux: size (number of pages) of aux buffer
+ * @output_event: event to be attached
  */
 struct perf_event *
-perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
-struct task_struct *task,
-perf_overflow_handler_t overflow_handler,
-void *context)
+perf_event_create_kernel_counter_with_buffer(struct perf_event_attr *attr,
+   int cpu,
+   struct task_struct *task,
+   perf_overflow_handler_t overflow_handler,
+   void *context,
+   int flags,
+   int nr_pages,
+   int nr_pages_aux,
+   struct perf_event *output_event)
 {
struct perf_event_context *ctx;
struct perf_event *event;
int err;
+   struct ring_buffer *rb = NULL;
 
/*
 * Get the target context (task or percpu):
@@ -8383,6 +8395,31 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr, int cpu,
goto err;
}
 
+   if (output_event) {
+   err = perf_event_set_output(event, output_event);
+   if (err)
+   goto err_free;
+   } else if (nr_pages) {
+   rb = rb_alloc(nr_pages,
+ event->attr.watermark ? event->attr.wakeup_watermark : 0,
+ event->cpu, flags);
+
+   if (!rb) {
+   err = -ENOMEM;
+   goto err_free;
+   }
+
+   ring_buffer_attach(event, rb);
+
+   if (nr_pages_aux) {
+   err = rb_alloc_aux(rb, event, 0, nr_pages_aux,
+  event->attr.aux_watermark, flags);
+
+   if (err)
+   goto err_free;
+   }
+   }
+
/* Mark owner so we could distinguish it from user events. */
event->owner = EVENT_OWNER_KERNEL;
 
@@ -8411,10 +8448,33 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr, int cpu,
return event;
 
 err_free:
+   if (rb && rb->aux_pages)
+   rb_free_aux(rb);
+   if (rb)
+   rb_free(rb);
free_event(event);
 err:
return ERR_PTR(err);
 }
+EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter_with_buffer);
+
+/**
+ * perf_event_create_kernel_counter
+ *
+ * @attr: attributes of the counter to create
+ * @cpu: cpu in which the counter is bound
+ * @task: task to profile (NULL for percpu)
+ */
+struct perf_event *
+perf_e

[PATCH v2 4/4] x86: Stop Intel PT and save its registers when panic occurs

2015-09-07 Thread Takao Indoh
When panic occurs, Intel PT logging is stopped to prevent it from
overwrite its log buffer. The registers of Intel PT are saved in the
memory on panic, they are needed for debugger to find the last position
where Intel PT wrote data.

Signed-off-by: Takao Indoh 
---
 arch/x86/kernel/crash.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..78deceb 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,10 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
disable_local_APIC();
 }
 
@@ -172,6 +177,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/4] x86: Intel Processor Trace Logger

2015-09-07 Thread Takao Indoh
Hi all,

These patch series provide logging feature for Intel Processor Trace
(Intel PT).

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

This patch creates log buffer for Intel PT and enable logging at boot
time. When kernel panic occurs, we can get this log buffer from
crashdump file by kdump, and reconstruct the flow that led to the panic.

changelog:
v2:
- Reimplement using perf_event_create_kernel_counter

v1:
https://lkml.org/lkml/2015/7/29/6

Takao Indoh (4):
  perf/trace: Add function to find event type by name
  perf: Add function to enable perf events in kernel with ring buffer
  perf/x86/intel/pt: Add Intel PT logger
  x86: Stop Intel PT and save its registers when panic occurs

 arch/x86/Kconfig  |   16 +++
 arch/x86/include/asm/intel_pt_log.h   |   13 ++
 arch/x86/kernel/cpu/Makefile  |2 +
 arch/x86/kernel/cpu/intel_pt_log.c|  178 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c |6 +
 arch/x86/kernel/crash.c   |9 ++
 include/linux/perf_event.h|   10 ++
 include/linux/trace_events.h  |2 +
 kernel/events/core.c  |   70 +++-
 kernel/trace/trace_event_perf.c   |   22 
 10 files changed, 323 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt_log.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/4] perf/x86/intel/pt: Add Intel PT logger

2015-09-07 Thread Takao Indoh
This patch provides Intel PT logging feature. When system boots with a
parameter "intel_pt_log", log buffers for Intel PT are allocated and
logging starts, then processor flow information is written in the log
buffer by hardware like flight recorder. This is very helpful to
investigate a cause of kernel panic.

The log buffer size is specified by the parameter
"intel_pt_log_buf_len=". This buffer is used as circular buffer,
therefore old events are overwritten by new events.

Signed-off-by: Takao Indoh 
---
 arch/x86/Kconfig  |   16 +++
 arch/x86/include/asm/intel_pt_log.h   |   13 ++
 arch/x86/kernel/cpu/Makefile  |2 +
 arch/x86/kernel/cpu/intel_pt_log.c|  178 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c |6 +
 5 files changed, 215 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt_log.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f37010f..2b99ba2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1722,6 +1722,22 @@ config X86_INTEL_MPX
 
  If unsure, say N.
 
+config X86_INTEL_PT_LOG
+   prompt "Intel PT logger"
+   def_bool n
+   depends on PERF_EVENTS && CPU_SUP_INTEL
+   ---help---
+ Intel PT is a hardware features that can capture information
+ about program execution flow. Once Intel PT is enabled, the
+ events which change program flow, like branch instructions,
+ exceptions, interruptions, traps and so on are logged in
+ the memory.
+
+ This option enables starting Intel PT logging feature at boot
+ time. When kernel panic occurs, Intel PT log buffer can be
+ retrieved from crash dump file and enables to reconstruct the
+ detailed flow that led to the panic.
+
 config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/include/asm/intel_pt_log.h 
b/arch/x86/include/asm/intel_pt_log.h
new file mode 100644
index 000..cef63f7
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt_log.h
@@ -0,0 +1,13 @@
+#ifndef __INTEL_PT_LOG_H__
+#define __INTEL_PT_LOG_H__
+
+#if defined(CONFIG_X86_INTEL_PT_LOG)
+
+#include 
+
+void pt_log_start(struct pmu *pmu);
+void save_intel_pt_registers(void);
+
+#endif
+
+#endif /* __INTEL_PT_LOG_H__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4eb065c..67c17f0 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -48,6 +48,8 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)+= 
perf_event_intel_uncore.o \
   perf_event_intel_uncore_nhmex.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_msr.o
 obj-$(CONFIG_CPU_SUP_AMD)  += perf_event_msr.o
+
+obj-$(CONFIG_X86_INTEL_PT_LOG) += intel_pt_log.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/intel_pt_log.c 
b/arch/x86/kernel/cpu/intel_pt_log.c
new file mode 100644
index 000..eb345fd
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_log.c
@@ -0,0 +1,178 @@
+/*
+ * Intel Processor Trace Logger
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+
+#define SAMPLE_TYPE_BASE \
+   (PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_IDENTIFIER)
+#define SAMPLE_TYPE_PT \
+   (SAMPLE_TYPE_BASE|PERF_SAMPLE_CPU|PERF_SAMPLE_RAW)
+#define SAMPLE_TYPE_SCHED \
+   (SAMPLE_TYPE_BASE|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW)
+#define SAMPLE_TYPE_DUMMY \
+   (SAMPLE_TYPE_BASE)
+
+/* intel_pt */
+static struct perf_event_attr pt_attr_pt = {
+   .config = 0x400, /* bit10: TSCEn */
+   .size   = sizeof(struct perf_event_attr),
+   .sample_type= SAMPLE_TYPE_PT,
+   .read_format= PERF_FORMAT_ID,
+   .inherit= 1,
+   .pinned = 1,
+   .sample_id_all  = 1,
+   .exclude_guest  = 1
+};
+
+/* sched:sched_switch */
+static struct perf_event_attr pt_attr_sched = {
+   .type   = PERF_TYPE_TRACEPOINT,
+   .size   = sizeof(struct perf_event_attr),
+   .sample_type= SAMPLE_TYPE_SCHED,
+   .read_format= PERF_FORMAT_ID,
+   .inherit= 1,
+   .sample_id_all  = 1,
+   .exclude_guest  = 1
+};
+
+/* dummy:u */
+static struct perf_event_attr pt_attr_dummy = {
+   .type   = PERF_TYPE_SOFTWARE,
+   .config = PERF_COUNT_SW_DUMMY,
+   .size   = sizeof(struct perf_event_attr),
+   .sample_type= SAMPLE_TYPE_DUMMY,
+   .read_format= PERF_FORMAT_ID,
+   .inherit= 1,
+   .exclude_kernel = 1,
+   .exclude_hv = 1,
+   .comm   = 1,
+   .task   = 1,
+   .sample_id_all  = 1,
+   .comm_exec  = 1
+};
+
+static int pt_log_enabled;
+static int pt_log_buf_nr_pages = 128; /* number of page

[PATCH v2 1/4] perf/trace: Add function to find event type by name

2015-09-07 Thread Takao Indoh
This patch adds function to find struct trace_event by event name like
"sched_switch" , and return its type so that Intel PT logger can enable
the trace event in kernel. Intel PT logger needs this because it needs
sched_switch tracing to collect side-band data.

Signed-off-by: Takao Indoh 
---
 include/linux/trace_events.h|2 ++
 kernel/trace/trace_event_perf.c |   22 ++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index ed27917..d3cae4b 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -616,6 +616,8 @@ perf_trace_buf_submit(void *raw_data, int size, int rctx, 
u64 addr,
 {
perf_tp_event(addr, count, raw_data, size, regs, head, rctx, task);
 }
+
+int perf_trace_event_get_type_by_name(char *system, char *name);
 #endif
 
 #endif /* _LINUX_TRACE_EVENT_H */
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index abfc903..1a851d5 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -21,6 +21,28 @@ typedef typeof(unsigned long [PERF_MAX_TRACE_SIZE / 
sizeof(unsigned long)])
 /* Count the events in use (per event id, not per instance) */
 static int total_ref_count;
 
+int perf_trace_event_get_type_by_name(char *system, char *name)
+{
+   struct trace_event_call *tp_event;
+   int ret = 0;
+   /*
+* All type is larger than __TRACE_LAST_TYPE + 1. Therefore return zero
+* as a invalid type if not found.
+*/
+
+   mutex_lock(_mutex);
+   list_for_each_entry(tp_event, _events, list) {
+   if (!strcmp(tp_event->class->system, system) &&
+   !strcmp(trace_event_name(tp_event), name)) {
+   ret = tp_event->event.type;
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+
+   return ret;
+}
+
 static int perf_trace_event_perm(struct trace_event_call *tp_event,
 struct perf_event *p_event)
 {
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/4] perf/x86/intel/pt: Add Intel PT logger

2015-09-07 Thread Takao Indoh
This patch provides Intel PT logging feature. When system boots with a
parameter "intel_pt_log", log buffers for Intel PT are allocated and
logging starts, then processor flow information is written in the log
buffer by hardware like flight recorder. This is very helpful to
investigate a cause of kernel panic.

The log buffer size is specified by the parameter
"intel_pt_log_buf_len=". This buffer is used as circular buffer,
therefore old events are overwritten by new events.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 arch/x86/Kconfig  |   16 +++
 arch/x86/include/asm/intel_pt_log.h   |   13 ++
 arch/x86/kernel/cpu/Makefile  |2 +
 arch/x86/kernel/cpu/intel_pt_log.c|  178 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c |6 +
 5 files changed, 215 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt_log.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f37010f..2b99ba2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1722,6 +1722,22 @@ config X86_INTEL_MPX
 
  If unsure, say N.
 
+config X86_INTEL_PT_LOG
+   prompt "Intel PT logger"
+   def_bool n
+   depends on PERF_EVENTS && CPU_SUP_INTEL
+   ---help---
+ Intel PT is a hardware features that can capture information
+ about program execution flow. Once Intel PT is enabled, the
+ events which change program flow, like branch instructions,
+ exceptions, interruptions, traps and so on are logged in
+ the memory.
+
+ This option enables starting Intel PT logging feature at boot
+ time. When kernel panic occurs, Intel PT log buffer can be
+ retrieved from crash dump file and enables to reconstruct the
+ detailed flow that led to the panic.
+
 config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/include/asm/intel_pt_log.h 
b/arch/x86/include/asm/intel_pt_log.h
new file mode 100644
index 000..cef63f7
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt_log.h
@@ -0,0 +1,13 @@
+#ifndef __INTEL_PT_LOG_H__
+#define __INTEL_PT_LOG_H__
+
+#if defined(CONFIG_X86_INTEL_PT_LOG)
+
+#include 
+
+void pt_log_start(struct pmu *pmu);
+void save_intel_pt_registers(void);
+
+#endif
+
+#endif /* __INTEL_PT_LOG_H__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4eb065c..67c17f0 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -48,6 +48,8 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)+= 
perf_event_intel_uncore.o \
   perf_event_intel_uncore_nhmex.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_msr.o
 obj-$(CONFIG_CPU_SUP_AMD)  += perf_event_msr.o
+
+obj-$(CONFIG_X86_INTEL_PT_LOG) += intel_pt_log.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/intel_pt_log.c 
b/arch/x86/kernel/cpu/intel_pt_log.c
new file mode 100644
index 000..eb345fd
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_log.c
@@ -0,0 +1,178 @@
+/*
+ * Intel Processor Trace Logger
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+
+#define SAMPLE_TYPE_BASE \
+   (PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_IDENTIFIER)
+#define SAMPLE_TYPE_PT \
+   (SAMPLE_TYPE_BASE|PERF_SAMPLE_CPU|PERF_SAMPLE_RAW)
+#define SAMPLE_TYPE_SCHED \
+   (SAMPLE_TYPE_BASE|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW)
+#define SAMPLE_TYPE_DUMMY \
+   (SAMPLE_TYPE_BASE)
+
+/* intel_pt */
+static struct perf_event_attr pt_attr_pt = {
+   .config = 0x400, /* bit10: TSCEn */
+   .size   = sizeof(struct perf_event_attr),
+   .sample_type= SAMPLE_TYPE_PT,
+   .read_format= PERF_FORMAT_ID,
+   .inherit= 1,
+   .pinned = 1,
+   .sample_id_all  = 1,
+   .exclude_guest  = 1
+};
+
+/* sched:sched_switch */
+static struct perf_event_attr pt_attr_sched = {
+   .type   = PERF_TYPE_TRACEPOINT,
+   .size   = sizeof(struct perf_event_attr),
+   .sample_type= SAMPLE_TYPE_SCHED,
+   .read_format= PERF_FORMAT_ID,
+   .inherit= 1,
+   .sample_id_all  = 1,
+   .exclude_guest  = 1
+};
+
+/* dummy:u */
+static struct perf_event_attr pt_attr_dummy = {
+   .type   = PERF_TYPE_SOFTWARE,
+   .config = PERF_COUNT_SW_DUMMY,
+   .size   = sizeof(struct perf_event_attr),
+   .sample_type= SAMPLE_TYPE_DUMMY,
+   .read_format= PERF_FORMAT_ID,
+   .inherit= 1,
+   .exclude_kernel = 1,
+   .exclude_hv = 1,
+   .comm   = 1,
+   .task   = 1,
+   .sample_id_all  = 1,
+   .comm_exec  = 1
+};
+
+static int pt_log_enabled;
+static int pt_log_

[PATCH v2 4/4] x86: Stop Intel PT and save its registers when panic occurs

2015-09-07 Thread Takao Indoh
When panic occurs, Intel PT logging is stopped to prevent it from
overwrite its log buffer. The registers of Intel PT are saved in the
memory on panic, they are needed for debugger to find the last position
where Intel PT wrote data.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 arch/x86/kernel/crash.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..78deceb 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,10 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
disable_local_APIC();
 }
 
@@ -172,6 +177,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/4] perf/trace: Add function to find event type by name

2015-09-07 Thread Takao Indoh
This patch adds function to find struct trace_event by event name like
"sched_switch" , and return its type so that Intel PT logger can enable
the trace event in kernel. Intel PT logger needs this because it needs
sched_switch tracing to collect side-band data.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 include/linux/trace_events.h|2 ++
 kernel/trace/trace_event_perf.c |   22 ++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index ed27917..d3cae4b 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -616,6 +616,8 @@ perf_trace_buf_submit(void *raw_data, int size, int rctx, 
u64 addr,
 {
perf_tp_event(addr, count, raw_data, size, regs, head, rctx, task);
 }
+
+int perf_trace_event_get_type_by_name(char *system, char *name);
 #endif
 
 #endif /* _LINUX_TRACE_EVENT_H */
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index abfc903..1a851d5 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -21,6 +21,28 @@ typedef typeof(unsigned long [PERF_MAX_TRACE_SIZE / 
sizeof(unsigned long)])
 /* Count the events in use (per event id, not per instance) */
 static int total_ref_count;
 
+int perf_trace_event_get_type_by_name(char *system, char *name)
+{
+   struct trace_event_call *tp_event;
+   int ret = 0;
+   /*
+* All type is larger than __TRACE_LAST_TYPE + 1. Therefore return zero
+* as a invalid type if not found.
+*/
+
+   mutex_lock(_mutex);
+   list_for_each_entry(tp_event, _events, list) {
+   if (!strcmp(tp_event->class->system, system) &&
+   !strcmp(trace_event_name(tp_event), name)) {
+   ret = tp_event->event.type;
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+
+   return ret;
+}
+
 static int perf_trace_event_perm(struct trace_event_call *tp_event,
 struct perf_event *p_event)
 {
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/4] x86: Intel Processor Trace Logger

2015-09-07 Thread Takao Indoh
Hi all,

These patch series provide logging feature for Intel Processor Trace
(Intel PT).

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

This patch creates log buffer for Intel PT and enable logging at boot
time. When kernel panic occurs, we can get this log buffer from
crashdump file by kdump, and reconstruct the flow that led to the panic.

changelog:
v2:
- Reimplement using perf_event_create_kernel_counter

v1:
https://lkml.org/lkml/2015/7/29/6

Takao Indoh (4):
  perf/trace: Add function to find event type by name
  perf: Add function to enable perf events in kernel with ring buffer
  perf/x86/intel/pt: Add Intel PT logger
  x86: Stop Intel PT and save its registers when panic occurs

 arch/x86/Kconfig  |   16 +++
 arch/x86/include/asm/intel_pt_log.h   |   13 ++
 arch/x86/kernel/cpu/Makefile  |2 +
 arch/x86/kernel/cpu/intel_pt_log.c|  178 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c |6 +
 arch/x86/kernel/crash.c   |9 ++
 include/linux/perf_event.h|   10 ++
 include/linux/trace_events.h  |2 +
 kernel/events/core.c  |   70 +++-
 kernel/trace/trace_event_perf.c   |   22 
 10 files changed, 323 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt_log.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/4] perf: Add function to enable perf events in kernel with ring buffer

2015-09-07 Thread Takao Indoh
perf_event_create_kernel_counter is used to enable perf events in kernel
without buffer for logging its events. This patch add new fucntion which
enable perf events with ring buffer. Intel PT logger uses this to enable
Intel PT and some associated events with its log buffer.

Signed-off-by: Takao Indoh <indou.ta...@jp.fujitsu.com>
---
 include/linux/perf_event.h |   10 ++
 kernel/events/core.c   |   70 ---
 2 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..34ada8c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -657,6 +657,16 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr,
struct task_struct *task,
perf_overflow_handler_t callback,
void *context);
+extern struct perf_event *
+perf_event_create_kernel_counter_with_buffer(struct perf_event_attr *attr,
+   int cpu,
+   struct task_struct *task,
+   perf_overflow_handler_t callback,
+   void *context,
+   int flags,
+   int nr_pages,
+   int nr_pages_aux,
+   struct perf_event *output_event);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
 extern u64 perf_event_read_value(struct perf_event *event,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ae16867..c9d8a59 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8356,21 +8356,33 @@ err_fd:
 }
 
 /**
- * perf_event_create_kernel_counter
+ * perf_event_create_kernel_counter_with_buffer
  *
  * @attr: attributes of the counter to create
  * @cpu: cpu in which the counter is bound
  * @task: task to profile (NULL for percpu)
+ * @overflow_handler: handler for overflow event
+ * @context: target context
+ * @flags: flags of ring buffer
+ * @nr_pages: size (number of pages) of buffer
+ * @nr_pages_aux: size (number of pages) of aux buffer
+ * @output_event: event to be attached
  */
 struct perf_event *
-perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
-struct task_struct *task,
-perf_overflow_handler_t overflow_handler,
-void *context)
+perf_event_create_kernel_counter_with_buffer(struct perf_event_attr *attr,
+   int cpu,
+   struct task_struct *task,
+   perf_overflow_handler_t overflow_handler,
+   void *context,
+   int flags,
+   int nr_pages,
+   int nr_pages_aux,
+   struct perf_event *output_event)
 {
struct perf_event_context *ctx;
struct perf_event *event;
int err;
+   struct ring_buffer *rb = NULL;
 
/*
 * Get the target context (task or percpu):
@@ -8383,6 +8395,31 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr, int cpu,
goto err;
}
 
+   if (output_event) {
+   err = perf_event_set_output(event, output_event);
+   if (err)
+   goto err_free;
+   } else if (nr_pages) {
+   rb = rb_alloc(nr_pages,
+ event->attr.watermark ? event->attr.wakeup_watermark : 0,
+ event->cpu, flags);
+
+   if (!rb) {
+   err = -ENOMEM;
+   goto err_free;
+   }
+
+   ring_buffer_attach(event, rb);
+
+   if (nr_pages_aux) {
+   err = rb_alloc_aux(rb, event, 0, nr_pages_aux,
+  event->attr.aux_watermark, flags);
+
+   if (err)
+   goto err_free;
+   }
+   }
+
/* Mark owner so we could distinguish it from user events. */
event->owner = EVENT_OWNER_KERNEL;
 
@@ -8411,10 +8448,33 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr, int cpu,
return event;
 
 err_free:
+   if (rb && rb->aux_pages)
+   rb_free_aux(rb);
+   if (rb)
+   rb_free(rb);
free_event(event);
 err:
return ERR_PTR(err);
 }
+EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter_with_buffer);
+
+/**
+ * perf_event_create_kernel_counter
+ *
+ * @attr: attributes of the counter to create
+ * @cpu: cpu in which the counter is bound
+ * @task: task to profile (NULL for

Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-08-26 Thread Takao Indoh
On 2015/07/29 15:08, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> This patch provides Intel PT logging feature. When system boots with a
>> parameter "intel_pt_log", log buffers for Intel PT are allocated and
>> logging starts, then processor flow information is written in the log
>> buffer by hardware like flight recorder. This is very helpful to
>> investigate a cause of kernel panic.
>>
>> The log buffer size is specified by the parameter
>> "intel_pt_log_buf_len=". This buffer is used as circular buffer,
>> therefore old events are overwritten by new events.
> 
> [skip]
> 
>> +static void enable_pt(int enable)
>> +{
>> +u64 ctl;
>> +
>> +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
> 
> Ideally, you shouldn't need this rdmsr(), because in this code you
> should know exactly which ctl bits you need set when you enable.
> 
>> +
>> +if (enable)
>> +ctl |= RTIT_CTL_TRACEEN;
>> +else
>> +ctl &= ~RTIT_CTL_TRACEEN;
>> +
>> +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +}
> 
> But the bigger problem with this approach is that it duplicates the
> existing driver's functionality and some of the code, which just makes
> it harder to maintain amoung other things.
> 
> Instead, we should be able to do use the existing perf functionality to
> enable the system-wide tracing, so that it goes through the
> driver. Another thing to remember is that you'd also need some of the
> sideband data (vm mappings, context switches) to be able to properly
> decode the trace, which also can come from perf. And it'd also be much
> less code. The only missing piece is the code that would allocate the
> ring buffer for such events.

Alexander,

I checked perf code to find out what kinds of information are needed as
side-band data. It seems that the following two events are used.
 - sched:sched_switch
 - dummy(PERF_COUNT_SW_DUMMY)

So, what I need to do is adding kernel counter for three events
(intel_pt, sched:sched_switch, dummy). My understanding is correct?

Thanks,
Takao Indoh

> 
> Something like:
> 
> static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
> 
> static struct perf_event_attr perf_kdump_attr;
> 
> ...
> 
> static int perf_kdump_init(void)
> {
>  struct perf_event *event;
>  int cpu;
> 
>  get_online_cpus();
>  for_each_possible_cpu(cpu) {
>  event = perf_create_kernel_counter(_kdump_attr,
>  cpu, NULL,
>  NULL, NULL);
> 
>   ...
> 
>  ret = rb_alloc_kernel(event, perf_kdump_data_size, 
> perf_kdump_aux_size);
> 
>  ...
>  
>  per_cpu(perf_kdump_event, cpu) = event;
>  }
>  put_online_cpus();
> }
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-08-26 Thread Takao Indoh
On 2015/07/29 15:08, Alexander Shishkin wrote:
 Takao Indoh indou.ta...@jp.fujitsu.com writes:
 
 This patch provides Intel PT logging feature. When system boots with a
 parameter intel_pt_log, log buffers for Intel PT are allocated and
 logging starts, then processor flow information is written in the log
 buffer by hardware like flight recorder. This is very helpful to
 investigate a cause of kernel panic.

 The log buffer size is specified by the parameter
 intel_pt_log_buf_len=size. This buffer is used as circular buffer,
 therefore old events are overwritten by new events.
 
 [skip]
 
 +static void enable_pt(int enable)
 +{
 +u64 ctl;
 +
 +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
 
 Ideally, you shouldn't need this rdmsr(), because in this code you
 should know exactly which ctl bits you need set when you enable.
 
 +
 +if (enable)
 +ctl |= RTIT_CTL_TRACEEN;
 +else
 +ctl = ~RTIT_CTL_TRACEEN;
 +
 +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
 +}
 
 But the bigger problem with this approach is that it duplicates the
 existing driver's functionality and some of the code, which just makes
 it harder to maintain amoung other things.
 
 Instead, we should be able to do use the existing perf functionality to
 enable the system-wide tracing, so that it goes through the
 driver. Another thing to remember is that you'd also need some of the
 sideband data (vm mappings, context switches) to be able to properly
 decode the trace, which also can come from perf. And it'd also be much
 less code. The only missing piece is the code that would allocate the
 ring buffer for such events.

Alexander,

I checked perf code to find out what kinds of information are needed as
side-band data. It seems that the following two events are used.
 - sched:sched_switch
 - dummy(PERF_COUNT_SW_DUMMY)

So, what I need to do is adding kernel counter for three events
(intel_pt, sched:sched_switch, dummy). My understanding is correct?

Thanks,
Takao Indoh

 
 Something like:
 
 static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
 
 static struct perf_event_attr perf_kdump_attr;
 
 ...
 
 static int perf_kdump_init(void)
 {
  struct perf_event *event;
  int cpu;
 
  get_online_cpus();
  for_each_possible_cpu(cpu) {
  event = perf_create_kernel_counter(perf_kdump_attr,
  cpu, NULL,
  NULL, NULL);
 
   ...
 
  ret = rb_alloc_kernel(event, perf_kdump_data_size, 
 perf_kdump_aux_size);
 
  ...
  
  per_cpu(perf_kdump_event, cpu) = event;
  }
  put_online_cpus();
 }
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/core] perf/x86/intel/pt: Clean up files of Intel Processor Trace

2015-08-12 Thread tip-bot for Takao Indoh
Commit-ID:  709bc871923c12b284424f9d47b99dc975ba8b29
Gitweb: http://git.kernel.org/tip/709bc871923c12b284424f9d47b99dc975ba8b29
Author: Takao Indoh 
AuthorDate: Tue, 4 Aug 2015 18:36:55 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 12 Aug 2015 11:43:22 +0200

perf/x86/intel/pt: Clean up files of Intel Processor Trace

This patch just cleans up some files of Intel Processor Trace, does not
change its behavior. This patch removes unused definitions and replaces a
constant value with a macro.

Signed-off-by: Takao Indoh 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo 
Cc: H.Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1438681015-5124-1-git-send-email-indou.ta...@jp.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/intel_pt.h| 33 ++-
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 10 +-
 2 files changed, 11 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index feb293e..336878a 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -25,32 +25,11 @@
  */
 #define TOPA_PMI_MARGIN 512
 
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K,
-   TOPA_1MB,
-   TOPA_2MB,
-   TOPA_4MB,
-   TOPA_8MB,
-   TOPA_16MB,
-   TOPA_32MB,
-   TOPA_64MB,
-   TOPA_128MB,
-   TOPA_SZ_END,
-};
+#define TOPA_SHIFT 12
 
-static inline unsigned int sizes(enum topa_sz tsz)
+static inline unsigned int sizes(unsigned int tsz)
 {
-   return 1 << (tsz + 12);
+   return 1 << (tsz + TOPA_SHIFT);
 };
 
 struct topa_entry {
@@ -66,8 +45,8 @@ struct topa_entry {
u64 rsvd4   : 16;
 };
 
-#define TOPA_SHIFT 12
-#define PT_CPUID_LEAVES 2
+#define PT_CPUID_LEAVES2
+#define PT_CPUID_REGS_NUM  4 /* number of regsters (eax, ebx, ecx, edx) */
 
 enum pt_capabilities {
PT_CAP_max_subleaf = 0,
@@ -85,7 +64,7 @@ enum pt_capabilities {
 
 struct pt_pmu {
struct pmu  pmu;
-   u32 caps[4 * PT_CPUID_LEAVES];
+   u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index e20cfac..4216928 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -79,7 +79,7 @@ static struct pt_cap_desc {
 static u32 pt_cap_get(enum pt_capabilities cap)
 {
struct pt_cap_desc *cd = _caps[cap];
-   u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+   u32 c = pt_pmu.caps[cd->leaf * PT_CPUID_REGS_NUM + cd->reg];
unsigned int shift = __ffs(cd->mask);
 
return (c & cd->mask) >> shift;
@@ -145,10 +145,10 @@ static int __init pt_pmu_hw_init(void)
 
for (i = 0; i < PT_CPUID_LEAVES; i++) {
cpuid_count(20, i,
-   _pmu.caps[CR_EAX + i*4],
-   _pmu.caps[CR_EBX + i*4],
-   _pmu.caps[CR_ECX + i*4],
-   _pmu.caps[CR_EDX + i*4]);
+   _pmu.caps[CR_EAX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_EBX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_ECX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_EDX + i*PT_CPUID_REGS_NUM]);
}
 
ret = -ENOMEM;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/core] perf/x86/intel/pt: Clean up files of Intel Processor Trace

2015-08-12 Thread tip-bot for Takao Indoh
Commit-ID:  709bc871923c12b284424f9d47b99dc975ba8b29
Gitweb: http://git.kernel.org/tip/709bc871923c12b284424f9d47b99dc975ba8b29
Author: Takao Indoh indou.ta...@jp.fujitsu.com
AuthorDate: Tue, 4 Aug 2015 18:36:55 +0900
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 12 Aug 2015 11:43:22 +0200

perf/x86/intel/pt: Clean up files of Intel Processor Trace

This patch just cleans up some files of Intel Processor Trace, does not
change its behavior. This patch removes unused definitions and replaces a
constant value with a macro.

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Cc: Alexander Shishkinalexander.shish...@linux.intel.com
Cc: Arnaldo Carvalho de Melo a...@kernel.org
Cc: H.Peter Anvin h...@zytor.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Link: 
http://lkml.kernel.org/r/1438681015-5124-1-git-send-email-indou.ta...@jp.fujitsu.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/cpu/intel_pt.h| 33 ++-
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 10 +-
 2 files changed, 11 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index feb293e..336878a 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -25,32 +25,11 @@
  */
 #define TOPA_PMI_MARGIN 512
 
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K,
-   TOPA_1MB,
-   TOPA_2MB,
-   TOPA_4MB,
-   TOPA_8MB,
-   TOPA_16MB,
-   TOPA_32MB,
-   TOPA_64MB,
-   TOPA_128MB,
-   TOPA_SZ_END,
-};
+#define TOPA_SHIFT 12
 
-static inline unsigned int sizes(enum topa_sz tsz)
+static inline unsigned int sizes(unsigned int tsz)
 {
-   return 1  (tsz + 12);
+   return 1  (tsz + TOPA_SHIFT);
 };
 
 struct topa_entry {
@@ -66,8 +45,8 @@ struct topa_entry {
u64 rsvd4   : 16;
 };
 
-#define TOPA_SHIFT 12
-#define PT_CPUID_LEAVES 2
+#define PT_CPUID_LEAVES2
+#define PT_CPUID_REGS_NUM  4 /* number of regsters (eax, ebx, ecx, edx) */
 
 enum pt_capabilities {
PT_CAP_max_subleaf = 0,
@@ -85,7 +64,7 @@ enum pt_capabilities {
 
 struct pt_pmu {
struct pmu  pmu;
-   u32 caps[4 * PT_CPUID_LEAVES];
+   u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index e20cfac..4216928 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -79,7 +79,7 @@ static struct pt_cap_desc {
 static u32 pt_cap_get(enum pt_capabilities cap)
 {
struct pt_cap_desc *cd = pt_caps[cap];
-   u32 c = pt_pmu.caps[cd-leaf * 4 + cd-reg];
+   u32 c = pt_pmu.caps[cd-leaf * PT_CPUID_REGS_NUM + cd-reg];
unsigned int shift = __ffs(cd-mask);
 
return (c  cd-mask)  shift;
@@ -145,10 +145,10 @@ static int __init pt_pmu_hw_init(void)
 
for (i = 0; i  PT_CPUID_LEAVES; i++) {
cpuid_count(20, i,
-   pt_pmu.caps[CR_EAX + i*4],
-   pt_pmu.caps[CR_EBX + i*4],
-   pt_pmu.caps[CR_ECX + i*4],
-   pt_pmu.caps[CR_EDX + i*4]);
+   pt_pmu.caps[CR_EAX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_EBX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_ECX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_EDX + i*PT_CPUID_REGS_NUM]);
}
 
ret = -ENOMEM;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] x86: Clean up files of Intel Processor Trace

2015-08-04 Thread Takao Indoh
This patch just cleans up some files of Intel Processor Trace, does not
change its behavior. This patch removes unused definition and replaces a
constant value with macro.

changelog:
v2:
- Remove unnecessary fix

v1:
https://lkml.org/lkml/2015/8/3/96

Signed-off-by: Takao Indoh 
---
 arch/x86/kernel/cpu/intel_pt.h|   33 +---
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   10 
 2 files changed, 11 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index 1c338b0..4d0de6f 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -25,32 +25,11 @@
  */
 #define TOPA_PMI_MARGIN 512
 
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K,
-   TOPA_1MB,
-   TOPA_2MB,
-   TOPA_4MB,
-   TOPA_8MB,
-   TOPA_16MB,
-   TOPA_32MB,
-   TOPA_64MB,
-   TOPA_128MB,
-   TOPA_SZ_END,
-};
+#define TOPA_SHIFT 12
 
-static inline unsigned int sizes(enum topa_sz tsz)
+static inline unsigned int sizes(unsigned int tsz)
 {
-   return 1 << (tsz + 12);
+   return 1 << (tsz + TOPA_SHIFT);
 };
 
 struct topa_entry {
@@ -66,8 +45,8 @@ struct topa_entry {
u64 rsvd4   : 16;
 };
 
-#define TOPA_SHIFT 12
-#define PT_CPUID_LEAVES 2
+#define PT_CPUID_LEAVES2
+#define PT_CPUID_REGS_NUM  4 /* number of regsters (eax, ebx, ecx, edx) */
 
 enum pt_capabilities {
PT_CAP_max_subleaf = 0,
@@ -79,7 +58,7 @@ enum pt_capabilities {
 
 struct pt_pmu {
struct pmu  pmu;
-   u32 caps[4 * PT_CPUID_LEAVES];
+   u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 183de71..cc381f5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -73,7 +73,7 @@ static struct pt_cap_desc {
 static u32 pt_cap_get(enum pt_capabilities cap)
 {
struct pt_cap_desc *cd = _caps[cap];
-   u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+   u32 c = pt_pmu.caps[cd->leaf * PT_CPUID_REGS_NUM + cd->reg];
unsigned int shift = __ffs(cd->mask);
 
return (c & cd->mask) >> shift;
@@ -129,10 +129,10 @@ static int __init pt_pmu_hw_init(void)
 
for (i = 0; i < PT_CPUID_LEAVES; i++) {
cpuid_count(20, i,
-   _pmu.caps[CR_EAX + i*4],
-   _pmu.caps[CR_EBX + i*4],
-   _pmu.caps[CR_ECX + i*4],
-   _pmu.caps[CR_EDX + i*4]);
+   _pmu.caps[CR_EAX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_EBX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_ECX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_EDX + i*PT_CPUID_REGS_NUM]);
}
 
ret = -ENOMEM;
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] x86: Clean up files of Intel Processor Trace

2015-08-04 Thread Takao Indoh
This patch just cleans up some files of Intel Processor Trace, does not
change its behavior. This patch removes unused definition and replaces a
constant value with macro.

changelog:
v2:
- Remove unnecessary fix

v1:
https://lkml.org/lkml/2015/8/3/96

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 arch/x86/kernel/cpu/intel_pt.h|   33 +---
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   10 
 2 files changed, 11 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index 1c338b0..4d0de6f 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -25,32 +25,11 @@
  */
 #define TOPA_PMI_MARGIN 512
 
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K,
-   TOPA_1MB,
-   TOPA_2MB,
-   TOPA_4MB,
-   TOPA_8MB,
-   TOPA_16MB,
-   TOPA_32MB,
-   TOPA_64MB,
-   TOPA_128MB,
-   TOPA_SZ_END,
-};
+#define TOPA_SHIFT 12
 
-static inline unsigned int sizes(enum topa_sz tsz)
+static inline unsigned int sizes(unsigned int tsz)
 {
-   return 1  (tsz + 12);
+   return 1  (tsz + TOPA_SHIFT);
 };
 
 struct topa_entry {
@@ -66,8 +45,8 @@ struct topa_entry {
u64 rsvd4   : 16;
 };
 
-#define TOPA_SHIFT 12
-#define PT_CPUID_LEAVES 2
+#define PT_CPUID_LEAVES2
+#define PT_CPUID_REGS_NUM  4 /* number of regsters (eax, ebx, ecx, edx) */
 
 enum pt_capabilities {
PT_CAP_max_subleaf = 0,
@@ -79,7 +58,7 @@ enum pt_capabilities {
 
 struct pt_pmu {
struct pmu  pmu;
-   u32 caps[4 * PT_CPUID_LEAVES];
+   u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 183de71..cc381f5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -73,7 +73,7 @@ static struct pt_cap_desc {
 static u32 pt_cap_get(enum pt_capabilities cap)
 {
struct pt_cap_desc *cd = pt_caps[cap];
-   u32 c = pt_pmu.caps[cd-leaf * 4 + cd-reg];
+   u32 c = pt_pmu.caps[cd-leaf * PT_CPUID_REGS_NUM + cd-reg];
unsigned int shift = __ffs(cd-mask);
 
return (c  cd-mask)  shift;
@@ -129,10 +129,10 @@ static int __init pt_pmu_hw_init(void)
 
for (i = 0; i  PT_CPUID_LEAVES; i++) {
cpuid_count(20, i,
-   pt_pmu.caps[CR_EAX + i*4],
-   pt_pmu.caps[CR_EBX + i*4],
-   pt_pmu.caps[CR_ECX + i*4],
-   pt_pmu.caps[CR_EDX + i*4]);
+   pt_pmu.caps[CR_EAX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_EBX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_ECX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_EDX + i*PT_CPUID_REGS_NUM]);
}
 
ret = -ENOMEM;
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Clean up files of Intel Processor Trace

2015-08-03 Thread Takao Indoh
On 2015/08/03 20:03, Borislav Petkov wrote:
> On Mon, Aug 03, 2015 at 11:08:07AM +0200, Peter Zijlstra wrote:
>> For those of us suffering OCDs and all, its a good change though. The
>> alfabet song does go: A, B, C, D etc.. after all. Not: A, C, D, B ...
> 
> ... except that x86 encoding orders regs like it was originally: AX,
> CX, DX, BX, ... Don't ask me why - looks like someone thought that the
> C (count) and D (double precision - AX extension) registers were more
> important than B (base).
> 
> Or someone was simply illiterate.
> 

I thought this was typo. If it is intentional, I'll keep it intact.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Clean up files of Intel Processor Trace

2015-08-03 Thread Takao Indoh
On 2015/08/03 18:44, Alexander Shishkin wrote:
> On 3 August 2015 at 12:08, Peter Zijlstra  wrote:
>> On Mon, Aug 03, 2015 at 12:03:13PM +0300, Alexander Shishkin wrote:
>>> Takao Indoh  writes:
>>
>>> Even though TOPA_SHIFT happens to be the same as PAGE_SHIFT, it is a
>>> property of a separate hardware block, not mmu. PAGE_SHIFT is 12, but
>>> 12 is not always PAGE_SHIFT.
>>
>> PAGE_SHIFT is _always_ 12 on x86. Changing that will require changing
>> the page table format, a rather unlikely thing to go happen.
> 
> Of course. Yet that doesn't justify turning every 12 into PAGE_SHIFT
> is what I'm saying.
> 
> Oh, look, it's PAGE_SHIFT o'clock on x86, time for lunch. :)

I thought the base address of output region is page aligned. I took a
look at Intel SDM again, it just says the base address is 4K-aligned
physical address, does not mention page size. So, logically TOPA_SHIFT
and PAGE_SHIFT are different things and I'll remove this change in next
version.

Thanks,
Takao Indoh

> 
> Regards,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: Clean up files of Intel Processor Trace

2015-08-03 Thread Takao Indoh
This patch just cleans up some files of Intel Processor Trace, does not
change its behavior. Removing unused definition, replace a constant
value with macro, etc.

Signed-off-by: Takao Indoh 
---
 arch/x86/kernel/cpu/intel_pt.h|   33 +---
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   14 ++--
 2 files changed, 13 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index 1c338b0..6b48ba8 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -25,32 +25,11 @@
  */
 #define TOPA_PMI_MARGIN 512
 
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K,
-   TOPA_1MB,
-   TOPA_2MB,
-   TOPA_4MB,
-   TOPA_8MB,
-   TOPA_16MB,
-   TOPA_32MB,
-   TOPA_64MB,
-   TOPA_128MB,
-   TOPA_SZ_END,
-};
+#define TOPA_SHIFT PAGE_SHIFT
 
-static inline unsigned int sizes(enum topa_sz tsz)
+static inline unsigned int sizes(unsigned int tsz)
 {
-   return 1 << (tsz + 12);
+   return 1 << (tsz + TOPA_SHIFT);
 };
 
 struct topa_entry {
@@ -66,8 +45,8 @@ struct topa_entry {
u64 rsvd4   : 16;
 };
 
-#define TOPA_SHIFT 12
-#define PT_CPUID_LEAVES 2
+#define PT_CPUID_LEAVES2
+#define PT_CPUID_REGS_NUM  4 /* number of regsters (eax, ebx, ecx, edx) */
 
 enum pt_capabilities {
PT_CAP_max_subleaf = 0,
@@ -79,7 +58,7 @@ enum pt_capabilities {
 
 struct pt_pmu {
struct pmu  pmu;
-   u32 caps[4 * PT_CPUID_LEAVES];
+   u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 183de71..1e7d89e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -37,9 +37,9 @@ static struct pt_pmu pt_pmu;
 
 enum cpuid_regs {
CR_EAX = 0,
+   CR_EBX,
CR_ECX,
-   CR_EDX,
-   CR_EBX
+   CR_EDX
 };
 
 /*
@@ -73,7 +73,7 @@ static struct pt_cap_desc {
 static u32 pt_cap_get(enum pt_capabilities cap)
 {
struct pt_cap_desc *cd = _caps[cap];
-   u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+   u32 c = pt_pmu.caps[cd->leaf * PT_CPUID_REGS_NUM + cd->reg];
unsigned int shift = __ffs(cd->mask);
 
return (c & cd->mask) >> shift;
@@ -129,10 +129,10 @@ static int __init pt_pmu_hw_init(void)
 
for (i = 0; i < PT_CPUID_LEAVES; i++) {
cpuid_count(20, i,
-   _pmu.caps[CR_EAX + i*4],
-   _pmu.caps[CR_EBX + i*4],
-   _pmu.caps[CR_ECX + i*4],
-   _pmu.caps[CR_EDX + i*4]);
+   _pmu.caps[CR_EAX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_EBX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_ECX + i*PT_CPUID_REGS_NUM],
+   _pmu.caps[CR_EDX + i*PT_CPUID_REGS_NUM]);
}
 
ret = -ENOMEM;
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Clean up files of Intel Processor Trace

2015-08-03 Thread Takao Indoh
On 2015/08/03 20:03, Borislav Petkov wrote:
 On Mon, Aug 03, 2015 at 11:08:07AM +0200, Peter Zijlstra wrote:
 For those of us suffering OCDs and all, its a good change though. The
 alfabet song does go: A, B, C, D etc.. after all. Not: A, C, D, B ...
 
 ... except that x86 encoding orders regs like it was originally: AX,
 CX, DX, BX, ... Don't ask me why - looks like someone thought that the
 C (count) and D (double precision - AX extension) registers were more
 important than B (base).
 
 Or someone was simply illiterate.
 

I thought this was typo. If it is intentional, I'll keep it intact.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Clean up files of Intel Processor Trace

2015-08-03 Thread Takao Indoh
On 2015/08/03 18:44, Alexander Shishkin wrote:
 On 3 August 2015 at 12:08, Peter Zijlstra pet...@infradead.org wrote:
 On Mon, Aug 03, 2015 at 12:03:13PM +0300, Alexander Shishkin wrote:
 Takao Indoh indou.ta...@jp.fujitsu.com writes:

 Even though TOPA_SHIFT happens to be the same as PAGE_SHIFT, it is a
 property of a separate hardware block, not mmu. PAGE_SHIFT is 12, but
 12 is not always PAGE_SHIFT.

 PAGE_SHIFT is _always_ 12 on x86. Changing that will require changing
 the page table format, a rather unlikely thing to go happen.
 
 Of course. Yet that doesn't justify turning every 12 into PAGE_SHIFT
 is what I'm saying.
 
 Oh, look, it's PAGE_SHIFT o'clock on x86, time for lunch. :)

I thought the base address of output region is page aligned. I took a
look at Intel SDM again, it just says the base address is 4K-aligned
physical address, does not mention page size. So, logically TOPA_SHIFT
and PAGE_SHIFT are different things and I'll remove this change in next
version.

Thanks,
Takao Indoh

 
 Regards,
 --
 Alex
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: Clean up files of Intel Processor Trace

2015-08-03 Thread Takao Indoh
This patch just cleans up some files of Intel Processor Trace, does not
change its behavior. Removing unused definition, replace a constant
value with macro, etc.

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 arch/x86/kernel/cpu/intel_pt.h|   33 +---
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   14 ++--
 2 files changed, 13 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index 1c338b0..6b48ba8 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -25,32 +25,11 @@
  */
 #define TOPA_PMI_MARGIN 512
 
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K,
-   TOPA_1MB,
-   TOPA_2MB,
-   TOPA_4MB,
-   TOPA_8MB,
-   TOPA_16MB,
-   TOPA_32MB,
-   TOPA_64MB,
-   TOPA_128MB,
-   TOPA_SZ_END,
-};
+#define TOPA_SHIFT PAGE_SHIFT
 
-static inline unsigned int sizes(enum topa_sz tsz)
+static inline unsigned int sizes(unsigned int tsz)
 {
-   return 1  (tsz + 12);
+   return 1  (tsz + TOPA_SHIFT);
 };
 
 struct topa_entry {
@@ -66,8 +45,8 @@ struct topa_entry {
u64 rsvd4   : 16;
 };
 
-#define TOPA_SHIFT 12
-#define PT_CPUID_LEAVES 2
+#define PT_CPUID_LEAVES2
+#define PT_CPUID_REGS_NUM  4 /* number of regsters (eax, ebx, ecx, edx) */
 
 enum pt_capabilities {
PT_CAP_max_subleaf = 0,
@@ -79,7 +58,7 @@ enum pt_capabilities {
 
 struct pt_pmu {
struct pmu  pmu;
-   u32 caps[4 * PT_CPUID_LEAVES];
+   u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c 
b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 183de71..1e7d89e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -37,9 +37,9 @@ static struct pt_pmu pt_pmu;
 
 enum cpuid_regs {
CR_EAX = 0,
+   CR_EBX,
CR_ECX,
-   CR_EDX,
-   CR_EBX
+   CR_EDX
 };
 
 /*
@@ -73,7 +73,7 @@ static struct pt_cap_desc {
 static u32 pt_cap_get(enum pt_capabilities cap)
 {
struct pt_cap_desc *cd = pt_caps[cap];
-   u32 c = pt_pmu.caps[cd-leaf * 4 + cd-reg];
+   u32 c = pt_pmu.caps[cd-leaf * PT_CPUID_REGS_NUM + cd-reg];
unsigned int shift = __ffs(cd-mask);
 
return (c  cd-mask)  shift;
@@ -129,10 +129,10 @@ static int __init pt_pmu_hw_init(void)
 
for (i = 0; i  PT_CPUID_LEAVES; i++) {
cpuid_count(20, i,
-   pt_pmu.caps[CR_EAX + i*4],
-   pt_pmu.caps[CR_EBX + i*4],
-   pt_pmu.caps[CR_ECX + i*4],
-   pt_pmu.caps[CR_EDX + i*4]);
+   pt_pmu.caps[CR_EAX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_EBX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_ECX + i*PT_CPUID_REGS_NUM],
+   pt_pmu.caps[CR_EDX + i*PT_CPUID_REGS_NUM]);
}
 
ret = -ENOMEM;
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 1/3] x86: Add Intel PT common files

2015-08-02 Thread Takao Indoh
On 2015/08/02 19:02, Thomas Gleixner wrote:
> On Wed, 29 Jul 2015, Takao Indoh wrote:
>> +/*
>> + * Table of Physical Addresses bits
>> + */
>> +enum topa_sz {
>> +TOPA_4K = 0,
>> +TOPA_8K,
>> +TOPA_16K,
>> +TOPA_32K,
>> +TOPA_64K,
>> +TOPA_128K,
>> +TOPA_256K,
>> +TOPA_512K,
>> +TOPA_1MB,
>> +TOPA_2MB,
>> +TOPA_4MB,
>> +TOPA_8MB,
>> +TOPA_16MB,
>> +TOPA_32MB,
>> +TOPA_64MB,
>> +TOPA_128MB,
>> +TOPA_SZ_END,
>> +};
> 
> While moving this around, can we pretty please clean that up? That
> enum just pointless. None of the values is ever used and they hardly
> have any value as they are just computable.

Ok, I'll update my patches based on Alex's comments, but before that
I'll clean up intel_pt.h and perf_event_intel_pt.c.

Thanks,
Takao Indoh

> 
>> +static inline unsigned int sizes(enum topa_sz tsz)
>> +{
>> +return 1 << (tsz + 12);
> 
> 12?? PAGE_SHIFT perhaps?
> 
>> +#define TOPA_SHIFT 12
> 
> Sigh.
> 
>> diff --git a/arch/x86/kernel/cpu/intel_pt_cap.c 
>> b/arch/x86/kernel/cpu/intel_pt_cap.c
>> new file mode 100644
>> index 000..a2cfbfc
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/intel_pt_cap.c
>> @@ -0,0 +1,69 @@
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include 
>> +#include 
>> +
>> +enum cpuid_regs {
>> +CR_EAX = 0,
>> +CR_ECX,
>> +CR_EDX,
>> +CR_EBX
>> +};
>> +
>> +static u32 cpuid_cache[4 * PT_CPUID_LEAVES];
> 
> 4 ? Magic constant pulled from thin air?
> 
>> +static int pt_cap_initialized;
>> +
>> +#define PT_CAP(_n, _l, _r, _m)  
>> \
>> +[PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l,\
>> +.reg = _r, .mask = _m }
>> +
>> +static struct pt_cap_desc {
>> +const char  *name;
>> +u32 leaf;
>> +u8  reg;
>> +u32 mask;
>> +} pt_caps[] = {
>> +PT_CAP(max_subleaf, 0, CR_EAX, 0x),
>> +PT_CAP(cr3_filtering,   0, CR_EBX, BIT(0)),
>> +PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
>> +PT_CAP(topa_multiple_entries,   0, CR_ECX, BIT(1)),
>> +PT_CAP(payloads_lip,0, CR_ECX, BIT(31)),
>> +};
>> +
>> +u32 pt_cap_get(enum pt_capabilities cap)
>> +{
>> +struct pt_cap_desc *cd = _caps[cap];
>> +u32 c = cpuid_cache[cd->leaf * 4 + cd->reg];
> 
> Ditto
> 
>> +unsigned int shift = __ffs(cd->mask);
>> +
>> +return (c & cd->mask) >> shift;
>> +}
>> +
>> +const char *pt_cap_name(enum pt_capabilities cap)
>> +{
>> +return pt_caps[cap].name;
>> +}
>> +
>> +int pt_cap_num(void)
>> +{
>> +return ARRAY_SIZE(pt_caps);
>> +}
>> +
>> +void __init pt_cap_init(void)
>> +{
>> +int i;
>> +
>> +if (pt_cap_initialized)
>> +return;
>> +
>> +for (i = 0; i < PT_CPUID_LEAVES; i++) {
>> +cpuid_count(20, i,
>> +_cache[CR_EAX + i*4],
> 
> Once more.
> 
> Thanks,
> 
>   tglx
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 1/3] x86: Add Intel PT common files

2015-08-02 Thread Takao Indoh
On 2015/08/02 19:02, Thomas Gleixner wrote:
 On Wed, 29 Jul 2015, Takao Indoh wrote:
 +/*
 + * Table of Physical Addresses bits
 + */
 +enum topa_sz {
 +TOPA_4K = 0,
 +TOPA_8K,
 +TOPA_16K,
 +TOPA_32K,
 +TOPA_64K,
 +TOPA_128K,
 +TOPA_256K,
 +TOPA_512K,
 +TOPA_1MB,
 +TOPA_2MB,
 +TOPA_4MB,
 +TOPA_8MB,
 +TOPA_16MB,
 +TOPA_32MB,
 +TOPA_64MB,
 +TOPA_128MB,
 +TOPA_SZ_END,
 +};
 
 While moving this around, can we pretty please clean that up? That
 enum just pointless. None of the values is ever used and they hardly
 have any value as they are just computable.

Ok, I'll update my patches based on Alex's comments, but before that
I'll clean up intel_pt.h and perf_event_intel_pt.c.

Thanks,
Takao Indoh

 
 +static inline unsigned int sizes(enum topa_sz tsz)
 +{
 +return 1  (tsz + 12);
 
 12?? PAGE_SHIFT perhaps?
 
 +#define TOPA_SHIFT 12
 
 Sigh.
 
 diff --git a/arch/x86/kernel/cpu/intel_pt_cap.c 
 b/arch/x86/kernel/cpu/intel_pt_cap.c
 new file mode 100644
 index 000..a2cfbfc
 --- /dev/null
 +++ b/arch/x86/kernel/cpu/intel_pt_cap.c
 @@ -0,0 +1,69 @@
 +#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
 +
 +#include linux/mm.h
 +#include asm/intel_pt.h
 +
 +enum cpuid_regs {
 +CR_EAX = 0,
 +CR_ECX,
 +CR_EDX,
 +CR_EBX
 +};
 +
 +static u32 cpuid_cache[4 * PT_CPUID_LEAVES];
 
 4 ? Magic constant pulled from thin air?
 
 +static int pt_cap_initialized;
 +
 +#define PT_CAP(_n, _l, _r, _m)  
 \
 +[PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l,\
 +.reg = _r, .mask = _m }
 +
 +static struct pt_cap_desc {
 +const char  *name;
 +u32 leaf;
 +u8  reg;
 +u32 mask;
 +} pt_caps[] = {
 +PT_CAP(max_subleaf, 0, CR_EAX, 0x),
 +PT_CAP(cr3_filtering,   0, CR_EBX, BIT(0)),
 +PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
 +PT_CAP(topa_multiple_entries,   0, CR_ECX, BIT(1)),
 +PT_CAP(payloads_lip,0, CR_ECX, BIT(31)),
 +};
 +
 +u32 pt_cap_get(enum pt_capabilities cap)
 +{
 +struct pt_cap_desc *cd = pt_caps[cap];
 +u32 c = cpuid_cache[cd-leaf * 4 + cd-reg];
 
 Ditto
 
 +unsigned int shift = __ffs(cd-mask);
 +
 +return (c  cd-mask)  shift;
 +}
 +
 +const char *pt_cap_name(enum pt_capabilities cap)
 +{
 +return pt_caps[cap].name;
 +}
 +
 +int pt_cap_num(void)
 +{
 +return ARRAY_SIZE(pt_caps);
 +}
 +
 +void __init pt_cap_init(void)
 +{
 +int i;
 +
 +if (pt_cap_initialized)
 +return;
 +
 +for (i = 0; i  PT_CPUID_LEAVES; i++) {
 +cpuid_count(20, i,
 +cpuid_cache[CR_EAX + i*4],
 
 Once more.
 
 Thanks,
 
   tglx
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Takao Indoh
On 2015/07/29 18:09, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> On 2015/07/29 15:08, Alexander Shishkin wrote:
>>> Instead, we should be able to do use the existing perf functionality to
>>> enable the system-wide tracing, so that it goes through the
>>
>> "existing driver" means PMU driver (perf_event_intel_pt.c)?
> 
> Yes.
> 
>> The feature of these patches is a sort of flight recorder. Once it
>> starts, never stop, not export anything to user, it just captures data
>> with minimum overhead in preparation for kernel panic. This usage is
>> different from perf and therefore I'm not sure whether this feature can
>> be implemented using perf infrastructure.
> 
> Why not? There is an established infrastructure for in-kernel perf
> events already, take a look at the nmi watchdog, for example.

Ok, I'm reading the code around perf_event_create_kernel_counter. It
seems to work for my purpose, I'll try to update my patch with this.

Thanks,
Takao Indoh

> 
>>> driver. Another thing to remember is that you'd also need some of the
>>> sideband data (vm mappings, context switches) to be able to properly
>>> decode the trace, which also can come from perf. And it'd also be much
>>> less code. The only missing piece is the code that would allocate the
>>> ring buffer for such events.
>>
>> The sideband data is needed if we want to reconstruct user program flow,
>> but is it needed to reconstruct kernel panic path?
> 
> You are not really interested in the panic path as much as events
> leading up to the panic and those usually have context, which is much
> easier to reconstruct with sideband info. Some of it you can reconstruct
> by walking kernel's data structures, but that is not reliable after the
> panic.
> 
> Regards,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Takao Indoh
On 2015/07/29 15:08, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> This patch provides Intel PT logging feature. When system boots with a
>> parameter "intel_pt_log", log buffers for Intel PT are allocated and
>> logging starts, then processor flow information is written in the log
>> buffer by hardware like flight recorder. This is very helpful to
>> investigate a cause of kernel panic.
>>
>> The log buffer size is specified by the parameter
>> "intel_pt_log_buf_len=". This buffer is used as circular buffer,
>> therefore old events are overwritten by new events.
> 
> [skip]
> 
>> +static void enable_pt(int enable)
>> +{
>> +u64 ctl;
>> +
>> +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
> 
> Ideally, you shouldn't need this rdmsr(), because in this code you
> should know exactly which ctl bits you need set when you enable.

I see, I'll remove this rdmsr in next version.

> 
>> +
>> +if (enable)
>> +ctl |= RTIT_CTL_TRACEEN;
>> +else
>> +ctl &= ~RTIT_CTL_TRACEEN;
>> +
>> +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +}
> 
> But the bigger problem with this approach is that it duplicates the
> existing driver's functionality and some of the code, which just makes
> it harder to maintain amoung other things.
> 
> Instead, we should be able to do use the existing perf functionality to
> enable the system-wide tracing, so that it goes through the

"existing driver" means PMU driver (perf_event_intel_pt.c)?

The feature of these patches is a sort of flight recorder. Once it
starts, never stop, not export anything to user, it just captures data
with minimum overhead in preparation for kernel panic. This usage is
different from perf and therefore I'm not sure whether this feature can
be implemented using perf infrastructure.

> driver. Another thing to remember is that you'd also need some of the
> sideband data (vm mappings, context switches) to be able to properly
> decode the trace, which also can come from perf. And it'd also be much
> less code. The only missing piece is the code that would allocate the
> ring buffer for such events.

The sideband data is needed if we want to reconstruct user program flow,
but is it needed to reconstruct kernel panic path?

Thanks,
Takao Indoh


> 
> Something like:
> 
> static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
> 
> static struct perf_event_attr perf_kdump_attr;
> 
> ...
> 
> static int perf_kdump_init(void)
> {
>  struct perf_event *event;
>  int cpu;
> 
>  get_online_cpus();
>  for_each_possible_cpu(cpu) {
>  event = perf_create_kernel_counter(_kdump_attr,
>  cpu, NULL,
>  NULL, NULL);
> 
>   ...
> 
>  ret = rb_alloc_kernel(event, perf_kdump_data_size, 
> perf_kdump_aux_size);
> 
>  ...
>  
>  per_cpu(perf_kdump_event, cpu) = event;
>  }
>  put_online_cpus();
> }
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Takao Indoh
On 2015/07/29 18:09, Alexander Shishkin wrote:
 Takao Indoh indou.ta...@jp.fujitsu.com writes:
 
 On 2015/07/29 15:08, Alexander Shishkin wrote:
 Instead, we should be able to do use the existing perf functionality to
 enable the system-wide tracing, so that it goes through the

 existing driver means PMU driver (perf_event_intel_pt.c)?
 
 Yes.
 
 The feature of these patches is a sort of flight recorder. Once it
 starts, never stop, not export anything to user, it just captures data
 with minimum overhead in preparation for kernel panic. This usage is
 different from perf and therefore I'm not sure whether this feature can
 be implemented using perf infrastructure.
 
 Why not? There is an established infrastructure for in-kernel perf
 events already, take a look at the nmi watchdog, for example.

Ok, I'm reading the code around perf_event_create_kernel_counter. It
seems to work for my purpose, I'll try to update my patch with this.

Thanks,
Takao Indoh

 
 driver. Another thing to remember is that you'd also need some of the
 sideband data (vm mappings, context switches) to be able to properly
 decode the trace, which also can come from perf. And it'd also be much
 less code. The only missing piece is the code that would allocate the
 ring buffer for such events.

 The sideband data is needed if we want to reconstruct user program flow,
 but is it needed to reconstruct kernel panic path?
 
 You are not really interested in the panic path as much as events
 leading up to the panic and those usually have context, which is much
 easier to reconstruct with sideband info. Some of it you can reconstruct
 by walking kernel's data structures, but that is not reliable after the
 panic.
 
 Regards,
 --
 Alex
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-29 Thread Takao Indoh
On 2015/07/29 15:08, Alexander Shishkin wrote:
 Takao Indoh indou.ta...@jp.fujitsu.com writes:
 
 This patch provides Intel PT logging feature. When system boots with a
 parameter intel_pt_log, log buffers for Intel PT are allocated and
 logging starts, then processor flow information is written in the log
 buffer by hardware like flight recorder. This is very helpful to
 investigate a cause of kernel panic.

 The log buffer size is specified by the parameter
 intel_pt_log_buf_len=size. This buffer is used as circular buffer,
 therefore old events are overwritten by new events.
 
 [skip]
 
 +static void enable_pt(int enable)
 +{
 +u64 ctl;
 +
 +rdmsrl(MSR_IA32_RTIT_CTL, ctl);
 
 Ideally, you shouldn't need this rdmsr(), because in this code you
 should know exactly which ctl bits you need set when you enable.

I see, I'll remove this rdmsr in next version.

 
 +
 +if (enable)
 +ctl |= RTIT_CTL_TRACEEN;
 +else
 +ctl = ~RTIT_CTL_TRACEEN;
 +
 +wrmsrl(MSR_IA32_RTIT_CTL, ctl);
 +}
 
 But the bigger problem with this approach is that it duplicates the
 existing driver's functionality and some of the code, which just makes
 it harder to maintain amoung other things.
 
 Instead, we should be able to do use the existing perf functionality to
 enable the system-wide tracing, so that it goes through the

existing driver means PMU driver (perf_event_intel_pt.c)?

The feature of these patches is a sort of flight recorder. Once it
starts, never stop, not export anything to user, it just captures data
with minimum overhead in preparation for kernel panic. This usage is
different from perf and therefore I'm not sure whether this feature can
be implemented using perf infrastructure.

 driver. Another thing to remember is that you'd also need some of the
 sideband data (vm mappings, context switches) to be able to properly
 decode the trace, which also can come from perf. And it'd also be much
 less code. The only missing piece is the code that would allocate the
 ring buffer for such events.

The sideband data is needed if we want to reconstruct user program flow,
but is it needed to reconstruct kernel panic path?

Thanks,
Takao Indoh


 
 Something like:
 
 static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
 
 static struct perf_event_attr perf_kdump_attr;
 
 ...
 
 static int perf_kdump_init(void)
 {
  struct perf_event *event;
  int cpu;
 
  get_online_cpus();
  for_each_possible_cpu(cpu) {
  event = perf_create_kernel_counter(perf_kdump_attr,
  cpu, NULL,
  NULL, NULL);
 
   ...
 
  ret = rb_alloc_kernel(event, perf_kdump_data_size, 
 perf_kdump_aux_size);
 
  ...
  
  per_cpu(perf_kdump_event, cpu) = event;
  }
  put_online_cpus();
 }
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/3] x86: Intel Processor Trace Logger

2015-07-28 Thread Takao Indoh
On 2015/07/29 14:44, Alexander Shishkin wrote:
> Takao Indoh  writes:
> 
>> Hi all,
>>
>> This patch creates log buffer for Intel PT and enable logging at boot
>> time. When kernel panic occurs, we can get this log buffer from
>> crashdump file by kdump, and reconstruct the flow that led to the panic.
> 
> Good to see this work going forward!
> 
>> Takao Indoh (3):
>>x86: Add Intel PT common files
>>x86: Add Intel PT logger
>>x86: Stop Intel PT and save its registers when panic occurs
>>
>>   arch/x86/Kconfig  |   16 ++
>>   arch/x86/include/asm/intel_pt.h   |   84 +
>>   arch/x86/kernel/cpu/Makefile  |3 +
>>   arch/x86/kernel/cpu/intel_pt.h|  131 -
>>   arch/x86/kernel/cpu/intel_pt_cap.c|   69 +++
>>   arch/x86/kernel/cpu/intel_pt_log.c|  288 
>> +
>>   arch/x86/kernel/cpu/intel_pt_perf.h   |   78 
>>   arch/x86/kernel/cpu/perf_event_intel_pt.c |   54 +-
>>   arch/x86/kernel/crash.c   |9 +
>>   9 files changed, 556 insertions(+), 176 deletions(-)
>>   create mode 100644 arch/x86/include/asm/intel_pt.h
>>   delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
>>   create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
>>   create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c
>>   create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h
> 
> One note here: you want to use -M with git-format-patch so that renames
> are handled better.

Thank you, I didn't know this option. I'll do next time.

Thanks,
Takao Indoh


> 
> Regards,
> --
> Alex
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-28 Thread Takao Indoh
This patch provides Intel PT logging feature. When system boots with a
parameter "intel_pt_log", log buffers for Intel PT are allocated and
logging starts, then processor flow information is written in the log
buffer by hardware like flight recorder. This is very helpful to
investigate a cause of kernel panic.

The log buffer size is specified by the parameter
"intel_pt_log_buf_len=". This buffer is used as circular buffer,
therefore old events are overwritten by new events.

Signed-off-by: Takao Indoh 
---
 arch/x86/Kconfig   |   16 ++
 arch/x86/kernel/cpu/Makefile   |2 +
 arch/x86/kernel/cpu/intel_pt_log.c |  288 
 3 files changed, 306 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced1..c31400f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1658,6 +1658,22 @@ config X86_INTEL_MPX
 
  If unsure, say N.
 
+config X86_INTEL_PT_LOG
+   prompt "Intel PT logger"
+   def_bool n
+   depends on CPU_SUP_INTEL
+   ---help---
+ Intel PT is a hardware features that can capture information
+ about program execution flow. Once Intel PT is enabled, the
+ events which change program flow, like branch instructions,
+ exceptions, interruptions, traps and so on are logged in
+ the memory.
+
+ This option enables starting Intel PT logging feature at boot
+ time. When kernel panic occurs, Intel PT log buffer can be
+ retrieved from crash dump file and enables to reconstruct the
+ detailed flow that led to the panic.
+
 config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 77d371c..24629ff 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -58,6 +58,8 @@ obj-$(CONFIG_X86_LOCAL_APIC)  += perfctr-watchdog.o 
perf_event_amd_ibs.o
 
 obj-$(CONFIG_HYPERVISOR_GUEST) += vmware.o hypervisor.o mshyperv.o
 
+obj-$(CONFIG_X86_INTEL_PT_LOG) += intel_pt_log.o
+
 ifdef CONFIG_X86_FEATURE_NAMES
 quiet_cmd_mkcapflags = MKCAP   $@
   cmd_mkcapflags = $(CONFIG_SHELL) $(srctree)/$(src)/mkcapflags.sh $< $@
diff --git a/arch/x86/kernel/cpu/intel_pt_log.c 
b/arch/x86/kernel/cpu/intel_pt_log.c
new file mode 100644
index 000..b1c4d66
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_log.c
@@ -0,0 +1,288 @@
+/*
+ * Intel Processor Trace Logger
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+
+#define PT_LOG_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+struct pt_log_buf {
+   int cpu;
+
+   void **region;  /* array of pointer to output region */
+   int region_size;/* size of region array */
+   int region_order;   /* page order of region */
+
+   void **tbl; /* array of pointer to ToPA table */
+   int tbl_size;   /* size of tbl array */
+
+   /* Saved registers on panic */
+   u64 saved_msr_ctl;
+   u64 saved_msr_status;
+   u64 saved_msr_output_base;
+   u64 saved_msr_output_mask;
+};
+
+static int pt_log_enabled;
+static int pt_log_buf_nr_pages = 1024; /* number of pages for log buffer */
+
+static DEFINE_PER_CPU(struct pt_log_buf, pt_log_buf_ptr);
+static struct cpumask pt_cpu_mask;
+
+static void enable_pt(int enable)
+{
+   u64 ctl;
+
+   rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+   if (enable)
+   ctl |= RTIT_CTL_TRACEEN;
+   else
+   ctl &= ~RTIT_CTL_TRACEEN;
+
+   wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+}
+
+void save_intel_pt_registers(void)
+{
+   struct pt_log_buf *buf = this_cpu_ptr(_log_buf_ptr);
+
+   if (!cpumask_test_cpu(smp_processor_id(), _cpu_mask))
+   return;
+
+   enable_pt(0);
+
+   rdmsrl(MSR_IA32_RTIT_CTL, buf->saved_msr_ctl);
+   rdmsrl(MSR_IA32_RTIT_STATUS, buf->saved_msr_status);
+   rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, buf->saved_msr_output_base);
+   rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, buf->saved_msr_output_mask);
+}
+
+static void setup_pt_ctl_register(void)
+{
+   u64 reg;
+
+   rdmsrl(MSR_IA32_RTIT_CTL, reg);
+
+   reg |= 
RTIT_CTL_OS|RTIT_CTL_USR|RTIT_CTL_TOPA|RTIT_CTL_TSC_EN|RTIT_CTL_BRANCH_EN;
+
+   wrmsrl(MSR_IA32_RTIT_CTL, reg);
+}
+
+static void setup_pt_output_register(void *base, unsigned int topa_idx,
+unsigned int output_off)
+{
+   u64 reg;
+
+   wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(base));
+
+   reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+   wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+static void *pt_alloc_pages(void **buf, int *index, int node, int order)
+{
+   struct page *page;
+   void *ptr = N

[PATCH RFC 3/3] x86: Stop Intel PT and save its registers when panic occurs

2015-07-28 Thread Takao Indoh
When panic occurs, Intel PT logging is stopped to prevent it from
overwrite its log buffer. The registers of Intel PT are saved in the
memory on panic, they are needed for debugger to find the last position
where Intel PT wrote data.

Signed-off-by: Takao Indoh 
---
 arch/x86/include/asm/intel_pt.h |2 ++
 arch/x86/kernel/crash.c |9 +
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
index 7cb16e1..71bcd8d 100644
--- a/arch/x86/include/asm/intel_pt.h
+++ b/arch/x86/include/asm/intel_pt.h
@@ -79,4 +79,6 @@ u32 pt_cap_get(enum pt_capabilities cap);
 const char *pt_cap_name(enum pt_capabilities cap);
 int pt_cap_num(void);
 
+void save_intel_pt_registers(void);
+
 #endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..953c086 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,10 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
disable_local_APIC();
 }
 
@@ -172,6 +177,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 1/3] x86: Add Intel PT common files

2015-07-28 Thread Takao Indoh
Rename existing intel_pt.h to intel_pt_perf.h as a perf-specific header,
and make a new intel_pt.h as a common header of Intel PT feature. Also
add intel_pt_cap.c for Intel PT capability stuff.

Signed-off-by: Takao Indoh 
---
 arch/x86/include/asm/intel_pt.h   |   82 ++
 arch/x86/kernel/cpu/Makefile  |1 +
 arch/x86/kernel/cpu/intel_pt.h|  131 -
 arch/x86/kernel/cpu/intel_pt_cap.c|   69 +++
 arch/x86/kernel/cpu/intel_pt_perf.h   |   78 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   54 ++--
 6 files changed, 239 insertions(+), 176 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h
 delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
 create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..7cb16e1
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,82 @@
+/*
+ * Intel(R) Processor Trace common header
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+   TOPA_4K = 0,
+   TOPA_8K,
+   TOPA_16K,
+   TOPA_32K,
+   TOPA_64K,
+   TOPA_128K,
+   TOPA_256K,
+   TOPA_512K,
+   TOPA_1MB,
+   TOPA_2MB,
+   TOPA_4MB,
+   TOPA_8MB,
+   TOPA_16MB,
+   TOPA_32MB,
+   TOPA_64MB,
+   TOPA_128MB,
+   TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+   return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+   u64 end : 1;
+   u64 rsvd0   : 1;
+   u64 intr: 1;
+   u64 rsvd1   : 1;
+   u64 stop: 1;
+   u64 rsvd2   : 1;
+   u64 size: 4;
+   u64 rsvd3   : 2;
+   u64 base: 36;
+   u64 rsvd4   : 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+/*
+ * Capability stuff
+ */
+enum pt_capabilities {
+   PT_CAP_max_subleaf = 0,
+   PT_CAP_cr3_filtering,
+   PT_CAP_topa_output,
+   PT_CAP_topa_multiple_entries,
+   PT_CAP_payloads_lip,
+};
+
+void pt_cap_init(void);
+u32 pt_cap_get(enum pt_capabilities cap);
+const char *pt_cap_name(enum pt_capabilities cap);
+int pt_cap_num(void);
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..77d371c 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_CPU_SUP_CYRIX_32)+= cyrix.o
 obj-$(CONFIG_CPU_SUP_CENTAUR)  += centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)   += umc.o
+obj-$(CONFIG_CPU_SUP_INTEL)+= intel_pt_cap.o
 
 obj-$(CONFIG_PERF_EVENTS)  += perf_event.o
 
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
deleted file mode 100644
index 1c338b0..000
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ /dev/null
@@ -1,131 +0,0 @@
-/*
- * Intel(R) Processor Trace PMU driver for perf
- * Copyright (c) 2013-2014, Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
- *
- * Intel PT is specified in the Intel Architecture Instruction Set Extensions
- * Programming Reference:
- * http://software.intel.com/en-us/intel-isa-extensions
- */
-
-#ifndef __INTEL_PT_H__
-#define __INTEL_PT_H__
-
-/*
- * Single-entry ToPA: when this close to region boundary, switch
- * buffers to avoid losing data.
- */
-#define TOPA_PMI_MARGIN 512
-
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K,
-   TOPA_1MB,
-   TO

[PATCH RFC 0/3] x86: Intel Processor Trace Logger

2015-07-28 Thread Takao Indoh
Hi all,

These patch series provide logging feature for Intel Processor Trace
(Intel PT).

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

This patch creates log buffer for Intel PT and enable logging at boot
time. When kernel panic occurs, we can get this log buffer from
crashdump file by kdump, and reconstruct the flow that led to the panic.

Takao Indoh (3):
  x86: Add Intel PT common files
  x86: Add Intel PT logger
  x86: Stop Intel PT and save its registers when panic occurs

 arch/x86/Kconfig  |   16 ++
 arch/x86/include/asm/intel_pt.h   |   84 +
 arch/x86/kernel/cpu/Makefile  |3 +
 arch/x86/kernel/cpu/intel_pt.h|  131 -
 arch/x86/kernel/cpu/intel_pt_cap.c|   69 +++
 arch/x86/kernel/cpu/intel_pt_log.c|  288 +
 arch/x86/kernel/cpu/intel_pt_perf.h   |   78 
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   54 +-
 arch/x86/kernel/crash.c   |9 +
 9 files changed, 556 insertions(+), 176 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h
 delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c
 create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/3] x86: Intel Processor Trace Logger

2015-07-28 Thread Takao Indoh
Hi all,

These patch series provide logging feature for Intel Processor Trace
(Intel PT).

Intel PT is a new feature of Intel CPU Broadwell, it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

This patch creates log buffer for Intel PT and enable logging at boot
time. When kernel panic occurs, we can get this log buffer from
crashdump file by kdump, and reconstruct the flow that led to the panic.

Takao Indoh (3):
  x86: Add Intel PT common files
  x86: Add Intel PT logger
  x86: Stop Intel PT and save its registers when panic occurs

 arch/x86/Kconfig  |   16 ++
 arch/x86/include/asm/intel_pt.h   |   84 +
 arch/x86/kernel/cpu/Makefile  |3 +
 arch/x86/kernel/cpu/intel_pt.h|  131 -
 arch/x86/kernel/cpu/intel_pt_cap.c|   69 +++
 arch/x86/kernel/cpu/intel_pt_log.c|  288 +
 arch/x86/kernel/cpu/intel_pt_perf.h   |   78 
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   54 +-
 arch/x86/kernel/crash.c   |9 +
 9 files changed, 556 insertions(+), 176 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h
 delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c
 create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 2/3] x86: Add Intel PT logger

2015-07-28 Thread Takao Indoh
This patch provides Intel PT logging feature. When system boots with a
parameter intel_pt_log, log buffers for Intel PT are allocated and
logging starts, then processor flow information is written in the log
buffer by hardware like flight recorder. This is very helpful to
investigate a cause of kernel panic.

The log buffer size is specified by the parameter
intel_pt_log_buf_len=size. This buffer is used as circular buffer,
therefore old events are overwritten by new events.

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 arch/x86/Kconfig   |   16 ++
 arch/x86/kernel/cpu/Makefile   |2 +
 arch/x86/kernel/cpu/intel_pt_log.c |  288 
 3 files changed, 306 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced1..c31400f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1658,6 +1658,22 @@ config X86_INTEL_MPX
 
  If unsure, say N.
 
+config X86_INTEL_PT_LOG
+   prompt Intel PT logger
+   def_bool n
+   depends on CPU_SUP_INTEL
+   ---help---
+ Intel PT is a hardware features that can capture information
+ about program execution flow. Once Intel PT is enabled, the
+ events which change program flow, like branch instructions,
+ exceptions, interruptions, traps and so on are logged in
+ the memory.
+
+ This option enables starting Intel PT logging feature at boot
+ time. When kernel panic occurs, Intel PT log buffer can be
+ retrieved from crash dump file and enables to reconstruct the
+ detailed flow that led to the panic.
+
 config EFI
bool EFI runtime service support
depends on ACPI
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 77d371c..24629ff 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -58,6 +58,8 @@ obj-$(CONFIG_X86_LOCAL_APIC)  += perfctr-watchdog.o 
perf_event_amd_ibs.o
 
 obj-$(CONFIG_HYPERVISOR_GUEST) += vmware.o hypervisor.o mshyperv.o
 
+obj-$(CONFIG_X86_INTEL_PT_LOG) += intel_pt_log.o
+
 ifdef CONFIG_X86_FEATURE_NAMES
 quiet_cmd_mkcapflags = MKCAP   $@
   cmd_mkcapflags = $(CONFIG_SHELL) $(srctree)/$(src)/mkcapflags.sh $ $@
diff --git a/arch/x86/kernel/cpu/intel_pt_log.c 
b/arch/x86/kernel/cpu/intel_pt_log.c
new file mode 100644
index 000..b1c4d66
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_log.c
@@ -0,0 +1,288 @@
+/*
+ * Intel Processor Trace Logger
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
+
+#include linux/mm.h
+#include linux/slab.h
+#include asm/intel_pt.h
+
+#define PT_LOG_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+struct pt_log_buf {
+   int cpu;
+
+   void **region;  /* array of pointer to output region */
+   int region_size;/* size of region array */
+   int region_order;   /* page order of region */
+
+   void **tbl; /* array of pointer to ToPA table */
+   int tbl_size;   /* size of tbl array */
+
+   /* Saved registers on panic */
+   u64 saved_msr_ctl;
+   u64 saved_msr_status;
+   u64 saved_msr_output_base;
+   u64 saved_msr_output_mask;
+};
+
+static int pt_log_enabled;
+static int pt_log_buf_nr_pages = 1024; /* number of pages for log buffer */
+
+static DEFINE_PER_CPU(struct pt_log_buf, pt_log_buf_ptr);
+static struct cpumask pt_cpu_mask;
+
+static void enable_pt(int enable)
+{
+   u64 ctl;
+
+   rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+   if (enable)
+   ctl |= RTIT_CTL_TRACEEN;
+   else
+   ctl = ~RTIT_CTL_TRACEEN;
+
+   wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+}
+
+void save_intel_pt_registers(void)
+{
+   struct pt_log_buf *buf = this_cpu_ptr(pt_log_buf_ptr);
+
+   if (!cpumask_test_cpu(smp_processor_id(), pt_cpu_mask))
+   return;
+
+   enable_pt(0);
+
+   rdmsrl(MSR_IA32_RTIT_CTL, buf-saved_msr_ctl);
+   rdmsrl(MSR_IA32_RTIT_STATUS, buf-saved_msr_status);
+   rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, buf-saved_msr_output_base);
+   rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, buf-saved_msr_output_mask);
+}
+
+static void setup_pt_ctl_register(void)
+{
+   u64 reg;
+
+   rdmsrl(MSR_IA32_RTIT_CTL, reg);
+
+   reg |= 
RTIT_CTL_OS|RTIT_CTL_USR|RTIT_CTL_TOPA|RTIT_CTL_TSC_EN|RTIT_CTL_BRANCH_EN;
+
+   wrmsrl(MSR_IA32_RTIT_CTL, reg);
+}
+
+static void setup_pt_output_register(void *base, unsigned int topa_idx,
+unsigned int output_off)
+{
+   u64 reg;
+
+   wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(base));
+
+   reg = 0x7f | ((u64)topa_idx  7) | ((u64)output_off  32);
+
+   wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+static void *pt_alloc_pages(void **buf, int *index, int node, int order)
+{
+   struct page *page;
+   void *ptr = NULL;
+
+   page

Re: [PATCH RFC 0/3] x86: Intel Processor Trace Logger

2015-07-28 Thread Takao Indoh
On 2015/07/29 14:44, Alexander Shishkin wrote:
 Takao Indoh indou.ta...@jp.fujitsu.com writes:
 
 Hi all,

 This patch creates log buffer for Intel PT and enable logging at boot
 time. When kernel panic occurs, we can get this log buffer from
 crashdump file by kdump, and reconstruct the flow that led to the panic.
 
 Good to see this work going forward!
 
 Takao Indoh (3):
x86: Add Intel PT common files
x86: Add Intel PT logger
x86: Stop Intel PT and save its registers when panic occurs

   arch/x86/Kconfig  |   16 ++
   arch/x86/include/asm/intel_pt.h   |   84 +
   arch/x86/kernel/cpu/Makefile  |3 +
   arch/x86/kernel/cpu/intel_pt.h|  131 -
   arch/x86/kernel/cpu/intel_pt_cap.c|   69 +++
   arch/x86/kernel/cpu/intel_pt_log.c|  288 
 +
   arch/x86/kernel/cpu/intel_pt_perf.h   |   78 
   arch/x86/kernel/cpu/perf_event_intel_pt.c |   54 +-
   arch/x86/kernel/crash.c   |9 +
   9 files changed, 556 insertions(+), 176 deletions(-)
   create mode 100644 arch/x86/include/asm/intel_pt.h
   delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
   create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
   create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c
   create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h
 
 One note here: you want to use -M with git-format-patch so that renames
 are handled better.

Thank you, I didn't know this option. I'll do next time.

Thanks,
Takao Indoh


 
 Regards,
 --
 Alex
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 3/3] x86: Stop Intel PT and save its registers when panic occurs

2015-07-28 Thread Takao Indoh
When panic occurs, Intel PT logging is stopped to prevent it from
overwrite its log buffer. The registers of Intel PT are saved in the
memory on panic, they are needed for debugger to find the last position
where Intel PT wrote data.

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 arch/x86/include/asm/intel_pt.h |2 ++
 arch/x86/kernel/crash.c |9 +
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
index 7cb16e1..71bcd8d 100644
--- a/arch/x86/include/asm/intel_pt.h
+++ b/arch/x86/include/asm/intel_pt.h
@@ -79,4 +79,6 @@ u32 pt_cap_get(enum pt_capabilities cap);
 const char *pt_cap_name(enum pt_capabilities cap);
 int pt_cap_num(void);
 
+void save_intel_pt_registers(void);
+
 #endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..953c086 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
 #include asm/cpu.h
 #include asm/reboot.h
 #include asm/virtext.h
+#include asm/intel_pt.h
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -127,6 +128,10 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
disable_local_APIC();
 }
 
@@ -172,6 +177,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
 
+#ifdef CONFIG_X86_INTEL_PT_LOG
+   save_intel_pt_registers();
+#endif
+
 #ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 1/3] x86: Add Intel PT common files

2015-07-28 Thread Takao Indoh
Rename existing intel_pt.h to intel_pt_perf.h as a perf-specific header,
and make a new intel_pt.h as a common header of Intel PT feature. Also
add intel_pt_cap.c for Intel PT capability stuff.

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 arch/x86/include/asm/intel_pt.h   |   82 ++
 arch/x86/kernel/cpu/Makefile  |1 +
 arch/x86/kernel/cpu/intel_pt.h|  131 -
 arch/x86/kernel/cpu/intel_pt_cap.c|   69 +++
 arch/x86/kernel/cpu/intel_pt_perf.h   |   78 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c |   54 ++--
 6 files changed, 239 insertions(+), 176 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pt.h
 delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
 create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 000..7cb16e1
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,82 @@
+/*
+ * Intel(R) Processor Trace common header
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+   TOPA_4K = 0,
+   TOPA_8K,
+   TOPA_16K,
+   TOPA_32K,
+   TOPA_64K,
+   TOPA_128K,
+   TOPA_256K,
+   TOPA_512K,
+   TOPA_1MB,
+   TOPA_2MB,
+   TOPA_4MB,
+   TOPA_8MB,
+   TOPA_16MB,
+   TOPA_32MB,
+   TOPA_64MB,
+   TOPA_128MB,
+   TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+   return 1  (tsz + 12);
+};
+
+struct topa_entry {
+   u64 end : 1;
+   u64 rsvd0   : 1;
+   u64 intr: 1;
+   u64 rsvd1   : 1;
+   u64 stop: 1;
+   u64 rsvd2   : 1;
+   u64 size: 4;
+   u64 rsvd3   : 2;
+   u64 base: 36;
+   u64 rsvd4   : 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+/*
+ * Capability stuff
+ */
+enum pt_capabilities {
+   PT_CAP_max_subleaf = 0,
+   PT_CAP_cr3_filtering,
+   PT_CAP_topa_output,
+   PT_CAP_topa_multiple_entries,
+   PT_CAP_payloads_lip,
+};
+
+void pt_cap_init(void);
+u32 pt_cap_get(enum pt_capabilities cap);
+const char *pt_cap_name(enum pt_capabilities cap);
+int pt_cap_num(void);
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..77d371c 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_CPU_SUP_CYRIX_32)+= cyrix.o
 obj-$(CONFIG_CPU_SUP_CENTAUR)  += centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)   += umc.o
+obj-$(CONFIG_CPU_SUP_INTEL)+= intel_pt_cap.o
 
 obj-$(CONFIG_PERF_EVENTS)  += perf_event.o
 
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
deleted file mode 100644
index 1c338b0..000
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ /dev/null
@@ -1,131 +0,0 @@
-/*
- * Intel(R) Processor Trace PMU driver for perf
- * Copyright (c) 2013-2014, Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
- *
- * Intel PT is specified in the Intel Architecture Instruction Set Extensions
- * Programming Reference:
- * http://software.intel.com/en-us/intel-isa-extensions
- */
-
-#ifndef __INTEL_PT_H__
-#define __INTEL_PT_H__
-
-/*
- * Single-entry ToPA: when this close to region boundary, switch
- * buffers to avoid losing data.
- */
-#define TOPA_PMI_MARGIN 512
-
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
-   TOPA_4K = 0,
-   TOPA_8K,
-   TOPA_16K,
-   TOPA_32K,
-   TOPA_64K,
-   TOPA_128K,
-   TOPA_256K,
-   TOPA_512K

Re: [PATCH v7 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-01-07 Thread Takao Indoh
On 2015/01/07 17:52, Li, ZhenHua wrote:
> Well, that's quite good news.
> Looking forward Takao's testing on his system.

Unfortunately DMAR fault still occurs with this patch...
I attach console log.

Thanks,
Takao Indoh

> 
> Regards
> Zhenhua
> On 01/07/2015 04:28 PM, Baoquan He wrote:
>> On 01/07/15 at 01:25pm, Li, ZhenHua wrote:
>>> It is same as the last one I send to you yesterday.
>>>
>>> The continuous memory that needed for data in this patchset:
>>> RE: PAGE_SIZE, 4096 Bytes;
>>> IRTE: 65536 * 16 ; 1M Bytes;
>>>
>>> It should use same memory as the old versions of this patchset. The
>>> changes for the last version do not need more memory.
>>
>> Hi Zhenhua,
>>
>> It was my mistake because I didn't strip the debug info of modules, then
>> initramfs is bloated very big. Just now I tested the latest version, it
>> works well and dump is successful. No dmar fault and intr-remap fault
>> seen any more, good job!
>>
>> Thanks
>> Baoquan
>>
>>
>>>
>>> Regards
>>> Zhenhua
>>>
>>> On 01/07/2015 01:02 PM, Baoquan He wrote:
>>>> On 01/07/15 at 12:11pm, Li, ZhenHua wrote:
>>>>> Many thanks to Takao Indoh and Baoquan He, for your testing on more
>>>>> different systems.
>>>>>
>>>>> The calling of flush functions are added to this version.
>>>>>
>>>>> The usage of __iommu_flush_cache function :
>>>>> 1. Fixes a dump on Takao's system.
>>>>> 2. Reduces the count of faults on Baoquan's system.
>>>>
>>>> I am testing the version you sent to me yesterday afternoon. Is that
>>>> different with this patchset? I found your patchset man reserve a big
>>>> contiguous memory region under 896M, this will cause the crashkernel
>>>> reservation failed when I set crashkernel=320M. The reason I increase
>>>> the crashkerenl reservation to 320M is 256M is not enough and cause OOM
>>>> when that patchset is tested.
>>>>
>>>> I am checking what happened.
>>>>
>>>>
>>>> Thanks
>>>> Baoquan
>>>>
>>>>>
>>>>> Regards
>>>>> Zhenhua
>>>>>
>>>>> On 01/07/2015 12:04 PM, Li, Zhen-Hua wrote:
>>>>>> This patchset is an update of Bill Sumner's patchset, implements a fix 
>>>>>> for:
>>>>>> If a kernel boots with intel_iommu=on on a system that supports intel 
>>>>>> vt-d,
>>>>>> when a panic happens, the kdump kernel will boot with these faults:
>>>>>>
>>>>>>  dmar: DRHD: handling fault status reg 102
>>>>>>  dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
>>>>>>  DMAR:[fault reason 01] Present bit in root entry is clear
>>>>>>
>>>>>>  dmar: DRHD: handling fault status reg 2
>>>>>>  dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
>>>>>>  INTR-REMAP:[fault reason 34] Present field in the IRTE entry is 
>>>>>> clear
>>>>>>
>>>>>> On some system, the interrupt remapping fault will also happen even if 
>>>>>> the
>>>>>> intel_iommu is not set to on, because the interrupt remapping will be 
>>>>>> enabled
>>>>>> when x2apic is needed by the system.
>>>>>>
>>>>>> The cause of the DMA fault is described in Bill's original version, and 
>>>>>> the
>>>>>> INTR-Remap fault is caused by a similar reason. In short, the 
>>>>>> initialization
>>>>>> of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
>>>>>> response.
>>>>>>
>>>>>> To fix this problem, we modifies the behaviors of the intel vt-d in the
>>>>>> crashdump kernel:
>>>>>>
>>>>>> For DMA Remapping:
>>>>>> 1. To accept the vt-d hardware in an active state,
>>>>>> 2. Do not disable and re-enable the translation, keep it enabled.
>>>>>> 3. Use the old root entry table, do not rewrite the RTA register.
>>>>>> 4. Malloc and use new context entry table and page table, copy data from 
>>>>>> the
>>>>>> old ones that used by the old kernel.
>>>>>> 5. to

Re: [PATCH v7 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-01-07 Thread Takao Indoh
On 2015/01/07 17:52, Li, ZhenHua wrote:
 Well, that's quite good news.
 Looking forward Takao's testing on his system.

Unfortunately DMAR fault still occurs with this patch...
I attach console log.

Thanks,
Takao Indoh

 
 Regards
 Zhenhua
 On 01/07/2015 04:28 PM, Baoquan He wrote:
 On 01/07/15 at 01:25pm, Li, ZhenHua wrote:
 It is same as the last one I send to you yesterday.

 The continuous memory that needed for data in this patchset:
 RE: PAGE_SIZE, 4096 Bytes;
 IRTE: 65536 * 16 ; 1M Bytes;

 It should use same memory as the old versions of this patchset. The
 changes for the last version do not need more memory.

 Hi Zhenhua,

 It was my mistake because I didn't strip the debug info of modules, then
 initramfs is bloated very big. Just now I tested the latest version, it
 works well and dump is successful. No dmar fault and intr-remap fault
 seen any more, good job!

 Thanks
 Baoquan



 Regards
 Zhenhua

 On 01/07/2015 01:02 PM, Baoquan He wrote:
 On 01/07/15 at 12:11pm, Li, ZhenHua wrote:
 Many thanks to Takao Indoh and Baoquan He, for your testing on more
 different systems.

 The calling of flush functions are added to this version.

 The usage of __iommu_flush_cache function :
 1. Fixes a dump on Takao's system.
 2. Reduces the count of faults on Baoquan's system.

 I am testing the version you sent to me yesterday afternoon. Is that
 different with this patchset? I found your patchset man reserve a big
 contiguous memory region under 896M, this will cause the crashkernel
 reservation failed when I set crashkernel=320M. The reason I increase
 the crashkerenl reservation to 320M is 256M is not enough and cause OOM
 when that patchset is tested.

 I am checking what happened.


 Thanks
 Baoquan


 Regards
 Zhenhua

 On 01/07/2015 12:04 PM, Li, Zhen-Hua wrote:
 This patchset is an update of Bill Sumner's patchset, implements a fix 
 for:
 If a kernel boots with intel_iommu=on on a system that supports intel 
 vt-d,
 when a panic happens, the kdump kernel will boot with these faults:

  dmar: DRHD: handling fault status reg 102
  dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
  DMAR:[fault reason 01] Present bit in root entry is clear

  dmar: DRHD: handling fault status reg 2
  dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
  INTR-REMAP:[fault reason 34] Present field in the IRTE entry is 
 clear

 On some system, the interrupt remapping fault will also happen even if 
 the
 intel_iommu is not set to on, because the interrupt remapping will be 
 enabled
 when x2apic is needed by the system.

 The cause of the DMA fault is described in Bill's original version, and 
 the
 INTR-Remap fault is caused by a similar reason. In short, the 
 initialization
 of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
 response.

 To fix this problem, we modifies the behaviors of the intel vt-d in the
 crashdump kernel:

 For DMA Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the translation, keep it enabled.
 3. Use the old root entry table, do not rewrite the RTA register.
 4. Malloc and use new context entry table and page table, copy data from 
 the
 old ones that used by the old kernel.
 5. to use different portions of the iova address ranges for the device 
 drivers
 in the crashdump kernel than the iova ranges that were in-use at the 
 time
 of the panic.
 6. After device driver is loaded, when it issues the first dma_map 
 command,
 free the dmar_domain structure for this device, and generate a new 
 one, so
 that the device can be assigned a new and empty page table.
 7. When a new context entry table is generated, we also save its address 
 to
 the old root entry table.

 For Interrupt Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
 3. Use the old interrupt remapping table, do not rewrite the IRTA 
 register.
 4. When ioapic entry is setup, the interrupt remapping table is changed, 
 and
 the updated data will be stored to the old interrupt remapping table.

 Advantages of this approach:
 1. All manipulation of the IO-device is done by the Linux device-driver
 for that device.
 2. This approach behaves in a manner very similar to operation without an
 active iommu.
 3. Any activity between the IO-device and its RMRR areas is handled by 
 the
 device-driver in the same manner as during a non-kdump boot.
 4. If an IO-device has no driver in the kdump kernel, it is simply left 
 alone.
 This supports the practice of creating a special kdump kernel without
 drivers for any devices that are not required for taking a crashdump.
 5. Minimal code-changes among the existing mainline intel vt-d code.

 Summary of changes in this patch set:
 1. Added some useful function for root entry table in code intel-iommu.c
 2. Added new members to struct root_entry

Re: [PATCH 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-01-05 Thread Takao Indoh
On 2014/12/29 12:15, Li, ZhenHua wrote:
> Hi Takao Indoh,
> 
> Happy New Year, and thank you very much for you help.  The flush is quite

Happy new year!

>   a problem,  as there are several places the flush function should be called,
> I think the flush should be placed in functions like __iommu_update_old_*.
> Created a small patch for this, it is attached.
> 
> 
> 
> As I cannot reproduce your problems on my system, so could you please try
> these steps?
> 1. Apply the latest patchset, including 9/10 and 10/10, and then apply the
> attached patch_for_flush.patch.  And then test the kernel.

No inter-remap fault, but there is still DMAR fault message.

> 
> 2.  If 1 does not fix the DMAR fault  problems, then it might be caused by
> 7/10, so please *unpatch* it from the kernel (others and the  attached one
> should be patched), and then test the kernel.

DMAR fault still occurs. I'll dig iommu driver code to find out the
reason.

Thanks,
Takao Indoh

> 
> Regards
> Zhenhua
> 
> On 12/26/2014 03:27 PM, Takao Indoh wrote:
>> On 2014/12/26 15:46, Li, ZhenHua wrote:
>>> Hi Takao Indoh,
>>>
>>> Thank you very much for your testing. I will add your update in next
>>> version.
>>> Also I think a flush for __iommu_update_old_root_entry is also necessary.
>>>
>>> Currently I have no idea about your fault, does it happen before or
>>> during its loading? Could you send me your full kernel log as an
>>> attachment?
>> Sure, see attached file.
>>
>> I removed 9/10 and 10/10 patches from my kernel to avoid panic problem I
>> reported in previous mail, and then tested kdump. So please ignore
>> intr-remap fault message in log file. Also please ignore stack trace
>> starting with the following message, it's a problem of my box.
>>
>>Flags mismatch irq 0. 0080 (i801_smbus) vs. 00015a00 (timer)
>>
>> Thanks,
>> Takao Indoh
>>
>>> Regards and Merry Christmas.
>>> Zhenhua
>>>
>>> On 12/26/2014 01:13 PM, Takao Indoh wrote:
>>>> Hi Zhen-Hua,
>>>>
>>>> I tested your patch and found two problems.
>>>>
>>>> [1]
>>>> Kenel panic occurs during 2nd kernel boot.
>>>>
>>>> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>>>> Kernel panic - not syncing: timer doesn't work through Interrupt-remapped 
>>>> IO-APIC
>>>> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0 #25
>>>> Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015 
>>>> Rev.3D81.3030 02/10/2012
>>>>0002 880036167d08 815b1c6a 
>>>>817f7670 880036167d88 815b19f1 0008
>>>>880036167d98 880036167d38 810a5d2f 880036167d98
>>>> Call Trace:
>>>>[] dump_stack+0x48/0x5e
>>>>[] panic+0xbb/0x1fa
>>>>[] ? vprintk_default+0x1f/0x30
>>>>[] panic_if_irq_remap+0x1c/0x20
>>>>[] check_timer+0x1e7/0x5ed
>>>>[] ? radix_tree_lookup+0xd/0x10
>>>>[] setup_IO_APIC+0x261/0x292
>>>>[] native_smp_prepare_cpus+0x214/0x25d
>>>>[] kernel_init_freeable+0x1dc/0x28c
>>>>[] ? rest_init+0x80/0x80
>>>>[] kernel_init+0xe/0xf0
>>>>[] ret_from_fork+0x7c/0xb0
>>>>[] ? rest_init+0x80/0x80
>>>> ---[ end Kernel panic - not syncing: timer doesn't work through 
>>>> Interrupt-remapped IO-APIC
>>>>
>>>>
>>>> This panic seems to be related to unflushed cache. I confirmed this
>>>> problem was fixed by the following patch.
>>>>
>>>> --- a/drivers/iommu/intel_irq_remapping.c
>>>> +++ b/drivers/iommu/intel_irq_remapping.c
>>>> @@ -200,8 +200,13 @@ static int modify_irte(int irq, struct irte 
>>>> *irte_modified)
>>>>set_64bit(>high, irte_modified->high);
>>>>
>>>>#ifdef CONFIG_CRASH_DUMP
>>>> -  if (is_kdump_kernel())
>>>> +  if (is_kdump_kernel()) {
>>>>__iommu_update_old_irte(iommu, index);
>>>> +  __iommu_flush_cache(iommu,
>>>> +  iommu->ir_table->base_old_virt +
>>>> +  index * sizeof(struct irte),
>>>> +  sizeof(struct irte));
>>>> +  }
>>>>#endif
>>>>__iommu_flush_cache(iommu, irte, sizeof(*irte));
>>>>
>&

Re: [PATCH 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-01-05 Thread Takao Indoh
On 2014/12/29 12:15, Li, ZhenHua wrote:
 Hi Takao Indoh,
 
 Happy New Year, and thank you very much for you help.  The flush is quite

Happy new year!

   a problem,  as there are several places the flush function should be called,
 I think the flush should be placed in functions like __iommu_update_old_*.
 Created a small patch for this, it is attached.
 
 
 
 As I cannot reproduce your problems on my system, so could you please try
 these steps?
 1. Apply the latest patchset, including 9/10 and 10/10, and then apply the
 attached patch_for_flush.patch.  And then test the kernel.

No inter-remap fault, but there is still DMAR fault message.

 
 2.  If 1 does not fix the DMAR fault  problems, then it might be caused by
 7/10, so please *unpatch* it from the kernel (others and the  attached one
 should be patched), and then test the kernel.

DMAR fault still occurs. I'll dig iommu driver code to find out the
reason.

Thanks,
Takao Indoh

 
 Regards
 Zhenhua
 
 On 12/26/2014 03:27 PM, Takao Indoh wrote:
 On 2014/12/26 15:46, Li, ZhenHua wrote:
 Hi Takao Indoh,

 Thank you very much for your testing. I will add your update in next
 version.
 Also I think a flush for __iommu_update_old_root_entry is also necessary.

 Currently I have no idea about your fault, does it happen before or
 during its loading? Could you send me your full kernel log as an
 attachment?
 Sure, see attached file.

 I removed 9/10 and 10/10 patches from my kernel to avoid panic problem I
 reported in previous mail, and then tested kdump. So please ignore
 intr-remap fault message in log file. Also please ignore stack trace
 starting with the following message, it's a problem of my box.

Flags mismatch irq 0. 0080 (i801_smbus) vs. 00015a00 (timer)

 Thanks,
 Takao Indoh

 Regards and Merry Christmas.
 Zhenhua

 On 12/26/2014 01:13 PM, Takao Indoh wrote:
 Hi Zhen-Hua,

 I tested your patch and found two problems.

 [1]
 Kenel panic occurs during 2nd kernel boot.

 ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
 Kernel panic - not syncing: timer doesn't work through Interrupt-remapped 
 IO-APIC
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0 #25
 Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015 
 Rev.3D81.3030 02/10/2012
0002 880036167d08 815b1c6a 
817f7670 880036167d88 815b19f1 0008
880036167d98 880036167d38 810a5d2f 880036167d98
 Call Trace:
[815b1c6a] dump_stack+0x48/0x5e
[815b19f1] panic+0xbb/0x1fa
[810a5d2f] ? vprintk_default+0x1f/0x30
[814c6a6c] panic_if_irq_remap+0x1c/0x20
[81b53985] check_timer+0x1e7/0x5ed
[8129bd9d] ? radix_tree_lookup+0xd/0x10
[81b5413b] setup_IO_APIC+0x261/0x292
[81b50302] native_smp_prepare_cpus+0x214/0x25d
[81b41c65] kernel_init_freeable+0x1dc/0x28c
[815aaf00] ? rest_init+0x80/0x80
[815aaf0e] kernel_init+0xe/0xf0
[815b5d2c] ret_from_fork+0x7c/0xb0
[815aaf00] ? rest_init+0x80/0x80
 ---[ end Kernel panic - not syncing: timer doesn't work through 
 Interrupt-remapped IO-APIC


 This panic seems to be related to unflushed cache. I confirmed this
 problem was fixed by the following patch.

 --- a/drivers/iommu/intel_irq_remapping.c
 +++ b/drivers/iommu/intel_irq_remapping.c
 @@ -200,8 +200,13 @@ static int modify_irte(int irq, struct irte 
 *irte_modified)
set_64bit(irte-high, irte_modified-high);

#ifdef CONFIG_CRASH_DUMP
 -  if (is_kdump_kernel())
 +  if (is_kdump_kernel()) {
__iommu_update_old_irte(iommu, index);
 +  __iommu_flush_cache(iommu,
 +  iommu-ir_table-base_old_virt +
 +  index * sizeof(struct irte),
 +  sizeof(struct irte));
 +  }
#endif
__iommu_flush_cache(iommu, irte, sizeof(*irte));


 [2]
 Some DMAR error messages are still found in 2nd kernel boot.

 dmar: DRHD: handling fault status reg 2
 dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr ffded000
 DMAR:[fault reason 01] Present bit in root entry is clear

 I confiremd your commit 1a2262 was already applied. Any idea?

 Thanks,
 Takao Indoh


 On 2014/12/22 18:15, Li, Zhen-Hua wrote:
 This patchset is an update of Bill Sumner's patchset, implements a fix 
 for:
 If a kernel boots with intel_iommu=on on a system that supports intel 
 vt-d,
 when a panic happens, the kdump kernel will boot with these faults:

dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 2
dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is 
 clear

 On some system, the interrupt remapping fault

Re: [PATCH 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2014-12-25 Thread Takao Indoh
On 2014/12/26 15:46, Li, ZhenHua wrote:
> Hi Takao Indoh,
> 
> Thank you very much for your testing. I will add your update in next
> version.
> Also I think a flush for __iommu_update_old_root_entry is also necessary.
> 
> Currently I have no idea about your fault, does it happen before or
> during its loading? Could you send me your full kernel log as an
> attachment?

Sure, see attached file.

I removed 9/10 and 10/10 patches from my kernel to avoid panic problem I
reported in previous mail, and then tested kdump. So please ignore
intr-remap fault message in log file. Also please ignore stack trace
starting with the following message, it's a problem of my box.

  Flags mismatch irq 0. 0080 (i801_smbus) vs. 00015a00 (timer)

Thanks,
Takao Indoh

> 
> Regards and Merry Christmas.
> Zhenhua
> 
> On 12/26/2014 01:13 PM, Takao Indoh wrote:
>> Hi Zhen-Hua,
>>
>> I tested your patch and found two problems.
>>
>> [1]
>> Kenel panic occurs during 2nd kernel boot.
>>
>> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>> Kernel panic - not syncing: timer doesn't work through Interrupt-remapped 
>> IO-APIC
>> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0 #25
>> Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015 Rev.3D81.3030 
>> 02/10/2012
>>   0002 880036167d08 815b1c6a 
>>   817f7670 880036167d88 815b19f1 0008
>>   880036167d98 880036167d38 810a5d2f 880036167d98
>> Call Trace:
>>   [] dump_stack+0x48/0x5e
>>   [] panic+0xbb/0x1fa
>>   [] ? vprintk_default+0x1f/0x30
>>   [] panic_if_irq_remap+0x1c/0x20
>>   [] check_timer+0x1e7/0x5ed
>>   [] ? radix_tree_lookup+0xd/0x10
>>   [] setup_IO_APIC+0x261/0x292
>>   [] native_smp_prepare_cpus+0x214/0x25d
>>   [] kernel_init_freeable+0x1dc/0x28c
>>   [] ? rest_init+0x80/0x80
>>   [] kernel_init+0xe/0xf0
>>   [] ret_from_fork+0x7c/0xb0
>>   [] ? rest_init+0x80/0x80
>> ---[ end Kernel panic - not syncing: timer doesn't work through 
>> Interrupt-remapped IO-APIC
>>
>>
>> This panic seems to be related to unflushed cache. I confirmed this
>> problem was fixed by the following patch.
>>
>> --- a/drivers/iommu/intel_irq_remapping.c
>> +++ b/drivers/iommu/intel_irq_remapping.c
>> @@ -200,8 +200,13 @@ static int modify_irte(int irq, struct irte 
>> *irte_modified)
>>  set_64bit(>high, irte_modified->high);
>>   
>>   #ifdef CONFIG_CRASH_DUMP
>> -if (is_kdump_kernel())
>> +if (is_kdump_kernel()) {
>>  __iommu_update_old_irte(iommu, index);
>> +__iommu_flush_cache(iommu,
>> +iommu->ir_table->base_old_virt +
>> +index * sizeof(struct irte),
>> +sizeof(struct irte));
>> +}
>>   #endif
>>  __iommu_flush_cache(iommu, irte, sizeof(*irte));
>>   
>>
>> [2]
>> Some DMAR error messages are still found in 2nd kernel boot.
>>
>> dmar: DRHD: handling fault status reg 2
>> dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr ffded000
>> DMAR:[fault reason 01] Present bit in root entry is clear
>>
>> I confiremd your commit 1a2262 was already applied. Any idea?
>>
>> Thanks,
>> Takao Indoh
>>
>>
>> On 2014/12/22 18:15, Li, Zhen-Hua wrote:
>>> This patchset is an update of Bill Sumner's patchset, implements a fix for:
>>> If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
>>> when a panic happens, the kdump kernel will boot with these faults:
>>>
>>>   dmar: DRHD: handling fault status reg 102
>>>   dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
>>>   DMAR:[fault reason 01] Present bit in root entry is clear
>>>
>>>   dmar: DRHD: handling fault status reg 2
>>>   dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
>>>   INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
>>>
>>> On some system, the interrupt remapping fault will also happen even if the
>>> intel_iommu is not set to on, because the interrupt remapping will be 
>>> enabled
>>> when x2apic is needed by the system.
>>>
>>> The cause of the DMA fault is described in Bill's original version, and the
>>> INTR-Remap fault is caused by a similar reason. In short, the initialization
>>> of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
>>> response.

Re: [PATCH 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2014-12-25 Thread Takao Indoh
Hi Zhen-Hua,

I tested your patch and found two problems.

[1]
Kenel panic occurs during 2nd kernel boot.

..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
Kernel panic - not syncing: timer doesn't work through Interrupt-remapped 
IO-APIC
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0 #25
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015 Rev.3D81.3030 
02/10/2012
 0002 880036167d08 815b1c6a 
 817f7670 880036167d88 815b19f1 0008
 880036167d98 880036167d38 810a5d2f 880036167d98
Call Trace:
 [] dump_stack+0x48/0x5e
 [] panic+0xbb/0x1fa
 [] ? vprintk_default+0x1f/0x30
 [] panic_if_irq_remap+0x1c/0x20
 [] check_timer+0x1e7/0x5ed
 [] ? radix_tree_lookup+0xd/0x10
 [] setup_IO_APIC+0x261/0x292
 [] native_smp_prepare_cpus+0x214/0x25d
 [] kernel_init_freeable+0x1dc/0x28c
 [] ? rest_init+0x80/0x80
 [] kernel_init+0xe/0xf0
 [] ret_from_fork+0x7c/0xb0
 [] ? rest_init+0x80/0x80
---[ end Kernel panic - not syncing: timer doesn't work through 
Interrupt-remapped IO-APIC


This panic seems to be related to unflushed cache. I confirmed this
problem was fixed by the following patch.

--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -200,8 +200,13 @@ static int modify_irte(int irq, struct irte *irte_modified)
set_64bit(>high, irte_modified->high);
 
 #ifdef CONFIG_CRASH_DUMP
-   if (is_kdump_kernel())
+   if (is_kdump_kernel()) {
__iommu_update_old_irte(iommu, index);
+   __iommu_flush_cache(iommu,
+   iommu->ir_table->base_old_virt +
+   index * sizeof(struct irte),
+   sizeof(struct irte));
+   }
 #endif
__iommu_flush_cache(iommu, irte, sizeof(*irte));
 

[2]
Some DMAR error messages are still found in 2nd kernel boot.

dmar: DRHD: handling fault status reg 2
dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr ffded000
DMAR:[fault reason 01] Present bit in root entry is clear

I confiremd your commit 1a2262 was already applied. Any idea?

Thanks,
Takao Indoh


On 2014/12/22 18:15, Li, Zhen-Hua wrote:
> This patchset is an update of Bill Sumner's patchset, implements a fix for:
> If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
> when a panic happens, the kdump kernel will boot with these faults:
> 
>  dmar: DRHD: handling fault status reg 102
>  dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
>  DMAR:[fault reason 01] Present bit in root entry is clear
> 
>  dmar: DRHD: handling fault status reg 2
>  dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
>  INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
> 
> On some system, the interrupt remapping fault will also happen even if the
> intel_iommu is not set to on, because the interrupt remapping will be enabled
> when x2apic is needed by the system.
> 
> The cause of the DMA fault is described in Bill's original version, and the
> INTR-Remap fault is caused by a similar reason. In short, the initialization
> of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
> response.
> 
> To fix this problem, we modifies the behaviors of the intel vt-d in the
> crashdump kernel:
> 
> For DMA Remapping:
> 1. To accept the vt-d hardware in an active state,
> 2. Do not disable and re-enable the translation, keep it enabled.
> 3. Use the old root entry table, do not rewrite the RTA register.
> 4. Malloc and use new context entry table and page table, copy data from the
> old ones that used by the old kernel.
> 5. to use different portions of the iova address ranges for the device drivers
> in the crashdump kernel than the iova ranges that were in-use at the time
> of the panic.
> 6. After device driver is loaded, when it issues the first dma_map command,
> free the dmar_domain structure for this device, and generate a new one, so
> that the device can be assigned a new and empty page table.
> 7. When a new context entry table is generated, we also save its address to
> the old root entry table.
> 
> For Interrupt Remapping:
> 1. To accept the vt-d hardware in an active state,
> 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
> 3. Use the old interrupt remapping table, do not rewrite the IRTA register.
> 4. When ioapic entry is setup, the interrupt remapping table is changed, and
> the updated data will be stored to the old interrupt remapping table.
> 
> Advantages of this approach:
> 1. All manipulation of the IO-device is done by the Linux device-driver
> for that device.
> 2. This approach behaves in a manner very similar to operation without an
> active iommu.
> 3. Any a

Re: [PATCH 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2014-12-25 Thread Takao Indoh
Hi Zhen-Hua,

I tested your patch and found two problems.

[1]
Kenel panic occurs during 2nd kernel boot.

..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
Kernel panic - not syncing: timer doesn't work through Interrupt-remapped 
IO-APIC
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0 #25
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015 Rev.3D81.3030 
02/10/2012
 0002 880036167d08 815b1c6a 
 817f7670 880036167d88 815b19f1 0008
 880036167d98 880036167d38 810a5d2f 880036167d98
Call Trace:
 [815b1c6a] dump_stack+0x48/0x5e
 [815b19f1] panic+0xbb/0x1fa
 [810a5d2f] ? vprintk_default+0x1f/0x30
 [814c6a6c] panic_if_irq_remap+0x1c/0x20
 [81b53985] check_timer+0x1e7/0x5ed
 [8129bd9d] ? radix_tree_lookup+0xd/0x10
 [81b5413b] setup_IO_APIC+0x261/0x292
 [81b50302] native_smp_prepare_cpus+0x214/0x25d
 [81b41c65] kernel_init_freeable+0x1dc/0x28c
 [815aaf00] ? rest_init+0x80/0x80
 [815aaf0e] kernel_init+0xe/0xf0
 [815b5d2c] ret_from_fork+0x7c/0xb0
 [815aaf00] ? rest_init+0x80/0x80
---[ end Kernel panic - not syncing: timer doesn't work through 
Interrupt-remapped IO-APIC


This panic seems to be related to unflushed cache. I confirmed this
problem was fixed by the following patch.

--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -200,8 +200,13 @@ static int modify_irte(int irq, struct irte *irte_modified)
set_64bit(irte-high, irte_modified-high);
 
 #ifdef CONFIG_CRASH_DUMP
-   if (is_kdump_kernel())
+   if (is_kdump_kernel()) {
__iommu_update_old_irte(iommu, index);
+   __iommu_flush_cache(iommu,
+   iommu-ir_table-base_old_virt +
+   index * sizeof(struct irte),
+   sizeof(struct irte));
+   }
 #endif
__iommu_flush_cache(iommu, irte, sizeof(*irte));
 

[2]
Some DMAR error messages are still found in 2nd kernel boot.

dmar: DRHD: handling fault status reg 2
dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr ffded000
DMAR:[fault reason 01] Present bit in root entry is clear

I confiremd your commit 1a2262 was already applied. Any idea?

Thanks,
Takao Indoh


On 2014/12/22 18:15, Li, Zhen-Hua wrote:
 This patchset is an update of Bill Sumner's patchset, implements a fix for:
 If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
 when a panic happens, the kdump kernel will boot with these faults:
 
  dmar: DRHD: handling fault status reg 102
  dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
  DMAR:[fault reason 01] Present bit in root entry is clear
 
  dmar: DRHD: handling fault status reg 2
  dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
  INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
 
 On some system, the interrupt remapping fault will also happen even if the
 intel_iommu is not set to on, because the interrupt remapping will be enabled
 when x2apic is needed by the system.
 
 The cause of the DMA fault is described in Bill's original version, and the
 INTR-Remap fault is caused by a similar reason. In short, the initialization
 of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
 response.
 
 To fix this problem, we modifies the behaviors of the intel vt-d in the
 crashdump kernel:
 
 For DMA Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the translation, keep it enabled.
 3. Use the old root entry table, do not rewrite the RTA register.
 4. Malloc and use new context entry table and page table, copy data from the
 old ones that used by the old kernel.
 5. to use different portions of the iova address ranges for the device drivers
 in the crashdump kernel than the iova ranges that were in-use at the time
 of the panic.
 6. After device driver is loaded, when it issues the first dma_map command,
 free the dmar_domain structure for this device, and generate a new one, so
 that the device can be assigned a new and empty page table.
 7. When a new context entry table is generated, we also save its address to
 the old root entry table.
 
 For Interrupt Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
 3. Use the old interrupt remapping table, do not rewrite the IRTA register.
 4. When ioapic entry is setup, the interrupt remapping table is changed, and
 the updated data will be stored to the old interrupt remapping table.
 
 Advantages of this approach:
 1. All manipulation of the IO-device is done by the Linux device-driver
 for that device.
 2. This approach behaves in a manner very similar to operation without an
 active iommu.
 3. Any activity between

Re: [PATCH 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2014-12-25 Thread Takao Indoh
On 2014/12/26 15:46, Li, ZhenHua wrote:
 Hi Takao Indoh,
 
 Thank you very much for your testing. I will add your update in next
 version.
 Also I think a flush for __iommu_update_old_root_entry is also necessary.
 
 Currently I have no idea about your fault, does it happen before or
 during its loading? Could you send me your full kernel log as an
 attachment?

Sure, see attached file.

I removed 9/10 and 10/10 patches from my kernel to avoid panic problem I
reported in previous mail, and then tested kdump. So please ignore
intr-remap fault message in log file. Also please ignore stack trace
starting with the following message, it's a problem of my box.

  Flags mismatch irq 0. 0080 (i801_smbus) vs. 00015a00 (timer)

Thanks,
Takao Indoh

 
 Regards and Merry Christmas.
 Zhenhua
 
 On 12/26/2014 01:13 PM, Takao Indoh wrote:
 Hi Zhen-Hua,

 I tested your patch and found two problems.

 [1]
 Kenel panic occurs during 2nd kernel boot.

 ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
 Kernel panic - not syncing: timer doesn't work through Interrupt-remapped 
 IO-APIC
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0 #25
 Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015 Rev.3D81.3030 
 02/10/2012
   0002 880036167d08 815b1c6a 
   817f7670 880036167d88 815b19f1 0008
   880036167d98 880036167d38 810a5d2f 880036167d98
 Call Trace:
   [815b1c6a] dump_stack+0x48/0x5e
   [815b19f1] panic+0xbb/0x1fa
   [810a5d2f] ? vprintk_default+0x1f/0x30
   [814c6a6c] panic_if_irq_remap+0x1c/0x20
   [81b53985] check_timer+0x1e7/0x5ed
   [8129bd9d] ? radix_tree_lookup+0xd/0x10
   [81b5413b] setup_IO_APIC+0x261/0x292
   [81b50302] native_smp_prepare_cpus+0x214/0x25d
   [81b41c65] kernel_init_freeable+0x1dc/0x28c
   [815aaf00] ? rest_init+0x80/0x80
   [815aaf0e] kernel_init+0xe/0xf0
   [815b5d2c] ret_from_fork+0x7c/0xb0
   [815aaf00] ? rest_init+0x80/0x80
 ---[ end Kernel panic - not syncing: timer doesn't work through 
 Interrupt-remapped IO-APIC


 This panic seems to be related to unflushed cache. I confirmed this
 problem was fixed by the following patch.

 --- a/drivers/iommu/intel_irq_remapping.c
 +++ b/drivers/iommu/intel_irq_remapping.c
 @@ -200,8 +200,13 @@ static int modify_irte(int irq, struct irte 
 *irte_modified)
  set_64bit(irte-high, irte_modified-high);
   
   #ifdef CONFIG_CRASH_DUMP
 -if (is_kdump_kernel())
 +if (is_kdump_kernel()) {
  __iommu_update_old_irte(iommu, index);
 +__iommu_flush_cache(iommu,
 +iommu-ir_table-base_old_virt +
 +index * sizeof(struct irte),
 +sizeof(struct irte));
 +}
   #endif
  __iommu_flush_cache(iommu, irte, sizeof(*irte));
   

 [2]
 Some DMAR error messages are still found in 2nd kernel boot.

 dmar: DRHD: handling fault status reg 2
 dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr ffded000
 DMAR:[fault reason 01] Present bit in root entry is clear

 I confiremd your commit 1a2262 was already applied. Any idea?

 Thanks,
 Takao Indoh


 On 2014/12/22 18:15, Li, Zhen-Hua wrote:
 This patchset is an update of Bill Sumner's patchset, implements a fix for:
 If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
 when a panic happens, the kdump kernel will boot with these faults:

   dmar: DRHD: handling fault status reg 102
   dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
   DMAR:[fault reason 01] Present bit in root entry is clear

   dmar: DRHD: handling fault status reg 2
   dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
   INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear

 On some system, the interrupt remapping fault will also happen even if the
 intel_iommu is not set to on, because the interrupt remapping will be 
 enabled
 when x2apic is needed by the system.

 The cause of the DMA fault is described in Bill's original version, and the
 INTR-Remap fault is caused by a similar reason. In short, the initialization
 of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
 response.

 To fix this problem, we modifies the behaviors of the intel vt-d in the
 crashdump kernel:

 For DMA Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the translation, keep it enabled.
 3. Use the old root entry table, do not rewrite the RTA register.
 4. Malloc and use new context entry table and page table, copy data from the
  old ones that used by the old kernel.
 5. to use different portions of the iova address ranges for the device 
 drivers
  in the crashdump kernel than the iova ranges that were in-use at the 
 time
  of the panic.
 6. After device driver is loaded, when it issues the first

Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-11-05 Thread Takao Indoh
(2014/11/06 11:11), Li, ZhenHua wrote:
> This patch does the same thing as you said in your mail.
> It should work, I have tested on my HP huge system.

Yep, I also confirmed it worked.

BTW, I found another problem. When I tested your patches with 3.17
kernel, iommu initialization failed with the following message.

IOMMU intel_iommu_in_crashdump = true
IOMMU Skip disabling iommu hardware translations
IOMMU Copying translate tables from panicked kernel
IOMMU: Copy translate tables failed
IOMMU: dmar init failed


I found that oldcopy() from physical address 0x15000 was failed.
oldcopy() copies data using ioremap, and ioremap for the memory region
around 0x15000 does not work because it is already mapped to virtual
space.


static void __iomem *__ioremap_caller(resource_size_t phys_addr,
unsigned long size, unsigned long prot_val, void *caller)
{
(snip)
/*
 * Don't allow anybody to remap normal RAM that we're using..
 */
pfn  = phys_addr >> PAGE_SHIFT;
last_pfn = last_addr >> PAGE_SHIFT;
if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,
  __ioremap_check_ram) == 1)
return NULL;
   


I'm not sure how we should handle this, but as far as I tested the
following fix works.

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3a9e7b8..8d2bd23 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4871,7 +4871,12 @@ static int oldcopy(void *to, void *from, int size)

pfn = ((unsigned long) from) >> VTD_PAGE_SHIFT;
offset = ((unsigned long) from) & (~VTD_PAGE_MASK);
-   ret = copy_oldmem_page(pfn, buf, csize, offset, userbuf);
+
+   if (page_is_ram(pfn)) {
+   memcpy(buf, pfn_to_kaddr(pfn) + offset, csize);
+   ret = size;
+   } else
+   ret = copy_oldmem_page(pfn, buf, csize, offset, userbuf);

return (int) ret;
 }


Thanks,
Takao Indoh


> 
> On 11/06/2014 09:48 AM, Takao Indoh wrote:
>> (2014/11/06 10:35), Li, ZhenHua wrote:
>>> Yes, that's it. The function context_set_address_root does not set the
>>> address root correctly.
>>>
>>> I have created another patch for it, see
>>> https://lkml.org/lkml/2014/11/5/43
>>
>> Oh, ok. I'll try again with this patch, thank you.
>>
>> Thanks,
>> Takao Indoh
>>
>>>
>>> Thanks
>>> Zhenhua
>>>
>>> On 11/06/2014 09:31 AM, Takao Indoh wrote:
>>>> Hi Zhenhua, Baoquan,
>>>>
>>>> (2014/10/22 19:05), Baoquan He wrote:
>>>>> Hi Zhenhua,
>>>>>
>>>>> I tested your latest patch on 3.18.0-rc1+, there are still some dmar
>>>>> errors. I remember it worked well with Bill's original patchset.
>>>>
>>>> This should be a problem in copy_context_entry().
>>>>
>>>>> +static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 
>>>>> devfn,
>>>>> +   void *ppap, struct context_entry *ce)
>>>>> +{
>>>>> + int ret = 0;/* Integer Return Code */
>>>>> + u32 shift = 0;  /* bits to shift page_addr  */
>>>>> + u64 page_addr = 0;  /* Address of translated page */
>>>>> + struct dma_pte *pgt_old_phys;   /* Adr(page_table in the old kernel) */
>>>>> + struct dma_pte *pgt_new_phys;   /* Adr(page_table in the new kernel) */
>>>>> + u8  t;  /* Translation-type from context */
>>>>> + u8  aw; /* Address-width from context */
>>>>> + u32 aw_shift[8] = {
>>>>> + 12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
>>>>> + 12+9+9+9,   /* [001b] 39-bit AGAW (3-level page table) */
>>>>> + 12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
>>>>> + 12+9+9+9+9+9,   /* [011b] 57-bit AGAW (5-level page table) */
>>>>> + 12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
>>>>> + 0,  /* [111b] Reserved */
>>>>> + 0,  /* [110b] Reserved */
>>>>> + 0,  /* [111b] Reserved */
>>>>> + };
>>>>> +
>>>>> + struct domain_values_entry *dve = NULL;
>>>>> +
>>>>> +
>>>>> + if (!context_present(ce)) { /* If (context not present) */
>>>>> + ret = RET_CCE_NOT_PRESENT;  /* Skip it */
>>>&

Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-11-05 Thread Takao Indoh
(2014/11/06 10:35), Li, ZhenHua wrote:
> Yes, that's it. The function context_set_address_root does not set the
> address root correctly.
> 
> I have created another patch for it, see
>   https://lkml.org/lkml/2014/11/5/43

Oh, ok. I'll try again with this patch, thank you.

Thanks,
Takao Indoh

> 
> Thanks
> Zhenhua
> 
> On 11/06/2014 09:31 AM, Takao Indoh wrote:
>> Hi Zhenhua, Baoquan,
>>
>> (2014/10/22 19:05), Baoquan He wrote:
>>> Hi Zhenhua,
>>>
>>> I tested your latest patch on 3.18.0-rc1+, there are still some dmar
>>> errors. I remember it worked well with Bill's original patchset.
>>
>> This should be a problem in copy_context_entry().
>>
>>> +static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 
>>> devfn,
>>> + void *ppap, struct context_entry *ce)
>>> +{
>>> +   int ret = 0;/* Integer Return Code */
>>> +   u32 shift = 0;  /* bits to shift page_addr  */
>>> +   u64 page_addr = 0;  /* Address of translated page */
>>> +   struct dma_pte *pgt_old_phys;   /* Adr(page_table in the old kernel) */
>>> +   struct dma_pte *pgt_new_phys;   /* Adr(page_table in the new kernel) */
>>> +   u8  t;  /* Translation-type from context */
>>> +   u8  aw; /* Address-width from context */
>>> +   u32 aw_shift[8] = {
>>> +   12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
>>> +   12+9+9+9,   /* [001b] 39-bit AGAW (3-level page table) */
>>> +   12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
>>> +   12+9+9+9+9+9,   /* [011b] 57-bit AGAW (5-level page table) */
>>> +   12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
>>> +   0,  /* [111b] Reserved */
>>> +   0,  /* [110b] Reserved */
>>> +   0,  /* [111b] Reserved */
>>> +   };
>>> +
>>> +   struct domain_values_entry *dve = NULL;
>>> +
>>> +
>>> +   if (!context_present(ce)) { /* If (context not present) */
>>> +   ret = RET_CCE_NOT_PRESENT;  /* Skip it */
>>> +   goto exit;
>>> +   }
>>> +
>>> +   t = context_translation_type(ce);
>>> +
>>> +   /* If we have seen this domain-id before on this iommu,
>>> +* give this context the same page-tables and we are done.
>>> +*/
>>> +   list_for_each_entry(dve, _values_list[iommu->seq_id], link) {
>>> +   if (dve->did == (int) context_domain_id(ce)) {
>>> +   switch (t) {
>>> +   case 0: /* page tables */
>>> +   case 1: /* page tables */
>>> +   context_set_address_root(ce,
>>> +   virt_to_phys(dve->pgd));
>>
>> Here, in context_set_address_root(), the new address is set like this:
>>
>>  context->lo |= value & VTD_PAGE_MASK;
>>
>> This is wrong, the logical disjunction of old address and new address
>> becomes invalid address.
>>
>> This should be like this.
>>
>>  case 1: /* page tables */
>>  ce->lo &= (~VTD_PAGE_MASK);
>>  context_set_address_root(ce,
>>  virt_to_phys(dve->pgd));
>>
>> And one more,
>>
>>> +   ret = RET_CCE_PREVIOUS_DID;
>>> +   break;
>>> +
>>> +   case 2: /* Pass through */
>>> +   if (dve->pgd == NULL)
>>> +   ret =  RET_CCE_PASS_THROUGH_2;
>>> +   else
>>> +   ret = RET_BADCOPY;
>>> +   break;
>>> +
>>> +   default: /* Bad value of 't'*/
>>> +   ret = RET_BADCOPY;
>>> +   break;
>>> +   }
>>> +   goto exit;
>>> +   }
>>> +   }
>> (snip)
>>> +   if (t == 0 || t == 1) { /* If (context has page tables) */
>>> +   aw = context_address_width(ce);
>>> +   shift = aw_shift[aw];
>>> +
>>> +   pgt_old_phys = (struc

Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-11-05 Thread Takao Indoh
Hi Zhenhua, Baoquan,

(2014/10/22 19:05), Baoquan He wrote:
> Hi Zhenhua,
> 
> I tested your latest patch on 3.18.0-rc1+, there are still some dmar
> errors. I remember it worked well with Bill's original patchset.

This should be a problem in copy_context_entry().

> +static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 devfn,
> +   void *ppap, struct context_entry *ce)
> +{
> + int ret = 0;/* Integer Return Code */
> + u32 shift = 0;  /* bits to shift page_addr  */
> + u64 page_addr = 0;  /* Address of translated page */
> + struct dma_pte *pgt_old_phys;   /* Adr(page_table in the old kernel) */
> + struct dma_pte *pgt_new_phys;   /* Adr(page_table in the new kernel) */
> + u8  t;  /* Translation-type from context */
> + u8  aw; /* Address-width from context */
> + u32 aw_shift[8] = {
> + 12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
> + 12+9+9+9,   /* [001b] 39-bit AGAW (3-level page table) */
> + 12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
> + 12+9+9+9+9+9,   /* [011b] 57-bit AGAW (5-level page table) */
> + 12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
> + 0,  /* [111b] Reserved */
> + 0,  /* [110b] Reserved */
> + 0,  /* [111b] Reserved */
> + };
> +
> + struct domain_values_entry *dve = NULL;
> +
> +
> + if (!context_present(ce)) { /* If (context not present) */
> + ret = RET_CCE_NOT_PRESENT;  /* Skip it */
> + goto exit;
> + }
> +
> + t = context_translation_type(ce);
> +
> + /* If we have seen this domain-id before on this iommu,
> +  * give this context the same page-tables and we are done.
> +  */
> + list_for_each_entry(dve, _values_list[iommu->seq_id], link) {
> + if (dve->did == (int) context_domain_id(ce)) {
> + switch (t) {
> + case 0: /* page tables */
> + case 1: /* page tables */
> + context_set_address_root(ce,
> + virt_to_phys(dve->pgd));

Here, in context_set_address_root(), the new address is set like this:

context->lo |= value & VTD_PAGE_MASK;

This is wrong, the logical disjunction of old address and new address
becomes invalid address.

This should be like this.

case 1: /* page tables */
ce->lo &= (~VTD_PAGE_MASK);
context_set_address_root(ce,
virt_to_phys(dve->pgd));

And one more,

> + ret = RET_CCE_PREVIOUS_DID;
> + break;
> +
> + case 2: /* Pass through */
> + if (dve->pgd == NULL)
> + ret =  RET_CCE_PASS_THROUGH_2;
> + else
> + ret = RET_BADCOPY;
> + break;
> +
> + default: /* Bad value of 't'*/
> + ret = RET_BADCOPY;
> + break;
> + }
> + goto exit;
> + }
> + }
(snip)
> + if (t == 0 || t == 1) { /* If (context has page tables) */
> + aw = context_address_width(ce);
> + shift = aw_shift[aw];
> +
> + pgt_old_phys = (struct dma_pte *)
> + (context_address_root(ce) << VTD_PAGE_SHIFT);
> +
> + ret = copy_page_table(_new_phys, pgt_old_phys,
> + shift-9, page_addr, iommu, bus, devfn, dve, ppap);
> +
> + if (ret)/* if (problem) bail out */
> + goto exit;
> +
> + context_set_address_root(ce, (unsigned long)(pgt_new_phys));

ditto.

Thanks,
Takao Indoh


> 
> 
> 0console [earlya[0.00] allocate tes of page_cg  'a ong[
> 0.00] tsc: Fast TSC calibration using PIT
> 0031] Calibrating delay loop (skipped), value calculated using timer
> frequency.. 5586.77 BogoMIPS (lpj=2793386)
> [0.010682] pid_max: default: 32768 minimum: 301
> [0.015317] ACPI: Core revision 20140828
> [0.044598] ACPI: All ACPI Tables successfully acquired
> [0.126450] Security Framework initialized
> [0.130569] SELinux:  Initializing.
> [0.135211] Dentry ca

Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-11-05 Thread Takao Indoh
Hi Zhenhua, Baoquan,

(2014/10/22 19:05), Baoquan He wrote:
 Hi Zhenhua,
 
 I tested your latest patch on 3.18.0-rc1+, there are still some dmar
 errors. I remember it worked well with Bill's original patchset.

This should be a problem in copy_context_entry().

 +static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 devfn,
 +   void *ppap, struct context_entry *ce)
 +{
 + int ret = 0;/* Integer Return Code */
 + u32 shift = 0;  /* bits to shift page_addr  */
 + u64 page_addr = 0;  /* Address of translated page */
 + struct dma_pte *pgt_old_phys;   /* Adr(page_table in the old kernel) */
 + struct dma_pte *pgt_new_phys;   /* Adr(page_table in the new kernel) */
 + u8  t;  /* Translation-type from context */
 + u8  aw; /* Address-width from context */
 + u32 aw_shift[8] = {
 + 12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
 + 12+9+9+9,   /* [001b] 39-bit AGAW (3-level page table) */
 + 12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
 + 12+9+9+9+9+9,   /* [011b] 57-bit AGAW (5-level page table) */
 + 12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
 + 0,  /* [111b] Reserved */
 + 0,  /* [110b] Reserved */
 + 0,  /* [111b] Reserved */
 + };
 +
 + struct domain_values_entry *dve = NULL;
 +
 +
 + if (!context_present(ce)) { /* If (context not present) */
 + ret = RET_CCE_NOT_PRESENT;  /* Skip it */
 + goto exit;
 + }
 +
 + t = context_translation_type(ce);
 +
 + /* If we have seen this domain-id before on this iommu,
 +  * give this context the same page-tables and we are done.
 +  */
 + list_for_each_entry(dve, domain_values_list[iommu-seq_id], link) {
 + if (dve-did == (int) context_domain_id(ce)) {
 + switch (t) {
 + case 0: /* page tables */
 + case 1: /* page tables */
 + context_set_address_root(ce,
 + virt_to_phys(dve-pgd));

Here, in context_set_address_root(), the new address is set like this:

context-lo |= value  VTD_PAGE_MASK;

This is wrong, the logical disjunction of old address and new address
becomes invalid address.

This should be like this.

case 1: /* page tables */
ce-lo = (~VTD_PAGE_MASK);
context_set_address_root(ce,
virt_to_phys(dve-pgd));

And one more,

 + ret = RET_CCE_PREVIOUS_DID;
 + break;
 +
 + case 2: /* Pass through */
 + if (dve-pgd == NULL)
 + ret =  RET_CCE_PASS_THROUGH_2;
 + else
 + ret = RET_BADCOPY;
 + break;
 +
 + default: /* Bad value of 't'*/
 + ret = RET_BADCOPY;
 + break;
 + }
 + goto exit;
 + }
 + }
(snip)
 + if (t == 0 || t == 1) { /* If (context has page tables) */
 + aw = context_address_width(ce);
 + shift = aw_shift[aw];
 +
 + pgt_old_phys = (struct dma_pte *)
 + (context_address_root(ce)  VTD_PAGE_SHIFT);
 +
 + ret = copy_page_table(pgt_new_phys, pgt_old_phys,
 + shift-9, page_addr, iommu, bus, devfn, dve, ppap);
 +
 + if (ret)/* if (problem) bail out */
 + goto exit;
 +
 + context_set_address_root(ce, (unsigned long)(pgt_new_phys));

ditto.

Thanks,
Takao Indoh


 
 
 0console [earlya[0.00] allocate tes of page_cg  'a ong[
 0.00] tsc: Fast TSC calibration using PIT
 0031] Calibrating delay loop (skipped), value calculated using timer
 frequency.. 5586.77 BogoMIPS (lpj=2793386)
 [0.010682] pid_max: default: 32768 minimum: 301
 [0.015317] ACPI: Core revision 20140828
 [0.044598] ACPI: All ACPI Tables successfully acquired
 [0.126450] Security Framework initialized
 [0.130569] SELinux:  Initializing.
 [0.135211] Dentry cache hash table entries: 2097152 (order: 12,
 16777216 bytes)
 [0.145731] Inode-cache hash table entries: 1048576 (order: 11,
 8388608 bytes)
 [0.154249] Mount-cache hash table entries: 32768 (order: 6, 262144
 bytes)
 [0.161163] Mountpoint-cache hash table entries: 32768 (order: 6,
 262144 bytes)
 [0.168731] Initializing cgroup subsys memory
 [0.173110] Initializing cgroup

Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-11-05 Thread Takao Indoh
(2014/11/06 10:35), Li, ZhenHua wrote:
 Yes, that's it. The function context_set_address_root does not set the
 address root correctly.
 
 I have created another patch for it, see
   https://lkml.org/lkml/2014/11/5/43

Oh, ok. I'll try again with this patch, thank you.

Thanks,
Takao Indoh

 
 Thanks
 Zhenhua
 
 On 11/06/2014 09:31 AM, Takao Indoh wrote:
 Hi Zhenhua, Baoquan,

 (2014/10/22 19:05), Baoquan He wrote:
 Hi Zhenhua,

 I tested your latest patch on 3.18.0-rc1+, there are still some dmar
 errors. I remember it worked well with Bill's original patchset.

 This should be a problem in copy_context_entry().

 +static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 
 devfn,
 + void *ppap, struct context_entry *ce)
 +{
 +   int ret = 0;/* Integer Return Code */
 +   u32 shift = 0;  /* bits to shift page_addr  */
 +   u64 page_addr = 0;  /* Address of translated page */
 +   struct dma_pte *pgt_old_phys;   /* Adr(page_table in the old kernel) */
 +   struct dma_pte *pgt_new_phys;   /* Adr(page_table in the new kernel) */
 +   u8  t;  /* Translation-type from context */
 +   u8  aw; /* Address-width from context */
 +   u32 aw_shift[8] = {
 +   12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
 +   12+9+9+9,   /* [001b] 39-bit AGAW (3-level page table) */
 +   12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
 +   12+9+9+9+9+9,   /* [011b] 57-bit AGAW (5-level page table) */
 +   12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
 +   0,  /* [111b] Reserved */
 +   0,  /* [110b] Reserved */
 +   0,  /* [111b] Reserved */
 +   };
 +
 +   struct domain_values_entry *dve = NULL;
 +
 +
 +   if (!context_present(ce)) { /* If (context not present) */
 +   ret = RET_CCE_NOT_PRESENT;  /* Skip it */
 +   goto exit;
 +   }
 +
 +   t = context_translation_type(ce);
 +
 +   /* If we have seen this domain-id before on this iommu,
 +* give this context the same page-tables and we are done.
 +*/
 +   list_for_each_entry(dve, domain_values_list[iommu-seq_id], link) {
 +   if (dve-did == (int) context_domain_id(ce)) {
 +   switch (t) {
 +   case 0: /* page tables */
 +   case 1: /* page tables */
 +   context_set_address_root(ce,
 +   virt_to_phys(dve-pgd));

 Here, in context_set_address_root(), the new address is set like this:

  context-lo |= value  VTD_PAGE_MASK;

 This is wrong, the logical disjunction of old address and new address
 becomes invalid address.

 This should be like this.

  case 1: /* page tables */
  ce-lo = (~VTD_PAGE_MASK);
  context_set_address_root(ce,
  virt_to_phys(dve-pgd));

 And one more,

 +   ret = RET_CCE_PREVIOUS_DID;
 +   break;
 +
 +   case 2: /* Pass through */
 +   if (dve-pgd == NULL)
 +   ret =  RET_CCE_PASS_THROUGH_2;
 +   else
 +   ret = RET_BADCOPY;
 +   break;
 +
 +   default: /* Bad value of 't'*/
 +   ret = RET_BADCOPY;
 +   break;
 +   }
 +   goto exit;
 +   }
 +   }
 (snip)
 +   if (t == 0 || t == 1) { /* If (context has page tables) */
 +   aw = context_address_width(ce);
 +   shift = aw_shift[aw];
 +
 +   pgt_old_phys = (struct dma_pte *)
 +   (context_address_root(ce)  VTD_PAGE_SHIFT);
 +
 +   ret = copy_page_table(pgt_new_phys, pgt_old_phys,
 +   shift-9, page_addr, iommu, bus, devfn, dve, ppap);
 +
 +   if (ret)/* if (problem) bail out */
 +   goto exit;
 +
 +   context_set_address_root(ce, (unsigned long)(pgt_new_phys));

 ditto.

 Thanks,
 Takao Indoh




 0console [earlya[0.00] allocate tes of page_cg  'a ong[
 0.00] tsc: Fast TSC calibration using PIT
 0031] Calibrating delay loop (skipped), value calculated using timer
 frequency.. 5586.77 BogoMIPS (lpj=2793386)
 [0.010682] pid_max: default: 32768 minimum: 301
 [0.015317] ACPI: Core revision 20140828
 [0.044598] ACPI: All ACPI Tables successfully acquired
 [0.126450] Security Framework initialized
 [0.130569] SELinux:  Initializing.
 [0.135211] Dentry cache hash table entries: 2097152 (order: 12,
 16777216 bytes)
 [0.145731] Inode-cache hash table entries: 1048576 (order: 11,
 8388608 bytes

Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-11-05 Thread Takao Indoh
(2014/11/06 11:11), Li, ZhenHua wrote:
 This patch does the same thing as you said in your mail.
 It should work, I have tested on my HP huge system.

Yep, I also confirmed it worked.

BTW, I found another problem. When I tested your patches with 3.17
kernel, iommu initialization failed with the following message.

IOMMU intel_iommu_in_crashdump = true
IOMMU Skip disabling iommu hardware translations
IOMMU Copying translate tables from panicked kernel
IOMMU: Copy translate tables failed
IOMMU: dmar init failed


I found that oldcopy() from physical address 0x15000 was failed.
oldcopy() copies data using ioremap, and ioremap for the memory region
around 0x15000 does not work because it is already mapped to virtual
space.

arch/x86/mm/ioremap.c
static void __iomem *__ioremap_caller(resource_size_t phys_addr,
unsigned long size, unsigned long prot_val, void *caller)
{
(snip)
/*
 * Don't allow anybody to remap normal RAM that we're using..
 */
pfn  = phys_addr  PAGE_SHIFT;
last_pfn = last_addr  PAGE_SHIFT;
if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,
  __ioremap_check_ram) == 1)
return NULL;
   ioreamp failed HERE!


I'm not sure how we should handle this, but as far as I tested the
following fix works.

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3a9e7b8..8d2bd23 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4871,7 +4871,12 @@ static int oldcopy(void *to, void *from, int size)

pfn = ((unsigned long) from)  VTD_PAGE_SHIFT;
offset = ((unsigned long) from)  (~VTD_PAGE_MASK);
-   ret = copy_oldmem_page(pfn, buf, csize, offset, userbuf);
+
+   if (page_is_ram(pfn)) {
+   memcpy(buf, pfn_to_kaddr(pfn) + offset, csize);
+   ret = size;
+   } else
+   ret = copy_oldmem_page(pfn, buf, csize, offset, userbuf);

return (int) ret;
 }


Thanks,
Takao Indoh


 
 On 11/06/2014 09:48 AM, Takao Indoh wrote:
 (2014/11/06 10:35), Li, ZhenHua wrote:
 Yes, that's it. The function context_set_address_root does not set the
 address root correctly.

 I have created another patch for it, see
 https://lkml.org/lkml/2014/11/5/43

 Oh, ok. I'll try again with this patch, thank you.

 Thanks,
 Takao Indoh


 Thanks
 Zhenhua

 On 11/06/2014 09:31 AM, Takao Indoh wrote:
 Hi Zhenhua, Baoquan,

 (2014/10/22 19:05), Baoquan He wrote:
 Hi Zhenhua,

 I tested your latest patch on 3.18.0-rc1+, there are still some dmar
 errors. I remember it worked well with Bill's original patchset.

 This should be a problem in copy_context_entry().

 +static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 
 devfn,
 +   void *ppap, struct context_entry *ce)
 +{
 + int ret = 0;/* Integer Return Code */
 + u32 shift = 0;  /* bits to shift page_addr  */
 + u64 page_addr = 0;  /* Address of translated page */
 + struct dma_pte *pgt_old_phys;   /* Adr(page_table in the old kernel) */
 + struct dma_pte *pgt_new_phys;   /* Adr(page_table in the new kernel) */
 + u8  t;  /* Translation-type from context */
 + u8  aw; /* Address-width from context */
 + u32 aw_shift[8] = {
 + 12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
 + 12+9+9+9,   /* [001b] 39-bit AGAW (3-level page table) */
 + 12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
 + 12+9+9+9+9+9,   /* [011b] 57-bit AGAW (5-level page table) */
 + 12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
 + 0,  /* [111b] Reserved */
 + 0,  /* [110b] Reserved */
 + 0,  /* [111b] Reserved */
 + };
 +
 + struct domain_values_entry *dve = NULL;
 +
 +
 + if (!context_present(ce)) { /* If (context not present) */
 + ret = RET_CCE_NOT_PRESENT;  /* Skip it */
 + goto exit;
 + }
 +
 + t = context_translation_type(ce);
 +
 + /* If we have seen this domain-id before on this iommu,
 +  * give this context the same page-tables and we are done.
 +  */
 + list_for_each_entry(dve, domain_values_list[iommu-seq_id], link) {
 + if (dve-did == (int) context_domain_id(ce)) {
 + switch (t) {
 + case 0: /* page tables */
 + case 1: /* page tables */
 + context_set_address_root(ce,
 + virt_to_phys(dve-pgd));

 Here, in context_set_address_root(), the new address is set like this:

context-lo |= value  VTD_PAGE_MASK;

 This is wrong, the logical disjunction of old address and new address
 becomes invalid address.

 This should be like this.

case 1: /* page tables */
ce-lo = (~VTD_PAGE_MASK

Re: [PATCH 1/1] pci: fix dmar fault for kdump kernel

2014-10-21 Thread Takao Indoh
Hi ZhenHua,

(2014/10/20 11:19), Li, ZhenHua wrote:
> Hi  Takao Indoh,
> 
> According to this discussion
>   https://lkml.org/lkml/2014/10/17/107
> 
> It seems that we can not do the resetting on the first kernel.  It can
> only be called during kdump kernel boots.

Sounds like that. Do you know any example cases which cannot be fixed by
Bill's patch?

Thanks,
Takao Indoh


> 
> Thanks
> Zhenhua
> On 10/15/2014 04:14 PM, Takao Indoh wrote:
>> (2014/10/14 18:34), Li, ZhenHua wrote:
>>> I tested on the latest stable version 3.17, it works well.
>>>
>>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>>> there may be some unexpected DMA requests on this adapter, which will
>>>> cause DMA Remapping faults like:
>>>>dmar: DRHD: handling fault status reg 102
>>>>dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>>DMAR:[fault reason 01] Present bit in root entry is clear
>>>>
>>>> This bug may happen on *any* PCI device.
>>>> Analysis for this bug:
>>>>
>>>> The present bit is set in this function:
>>>>
>>>> static struct context_entry * device_to_context_entry(
>>>>struct intel_iommu *iommu, u8 bus, u8 devfn)
>>>> {
>>>>..
>>>>set_root_present(root);
>>>>..
>>>> }
>>>>
>>>> Calling tree:
>>>>device driver
>>>>intel_alloc_coherent
>>>>__intel_map_single
>>>>domain_context_mapping
>>>>domain_context_mapping_one
>>>>device_to_context_entry
>>>>
>>>> This means, the present bit in root entry will not be set until the device
>>>> driver is loaded.
>>>>
>>>> But in the kdump kernel, hardware devices are not aware that control has
>>>> transferred to the second kernel, and those drivers must initialize again.
>>>> Consequently there may be unexpected DMA requests from devices activity
>>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>>> second kernel.
>>>>
>>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>>> the device itself does not work.
>>>>
>>>> A patch for this bug that has been sent before:
>>>> https://lkml.org/lkml/2014/9/30/55
>>>> As in discussion, this bug may happen on *any* device, so we need to reset 
>>>> all
>>>> pci devices.
>>>>
>>>> There was an original version(Takao Indoh) that resets the pcie devices:
>>>> https://lkml.org/lkml/2013/5/14/9
>>
>> As far as I can remember, the original patch was nacked by
>> the following reasons:
>>
>> 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
>>  so there would still be a window where ongoing DMA could cause an
>>  IOMMU error.
>>
>> 2) Basically Bjorn is thinking device reset should be done in the
>>  1st kernel before jumping into 2nd kernel.
>>
>> And Bill Sumner proposed another idea.
>> http://comments.gmane.org/gmane.linux.kernel.iommu/4828
>> I don't know the current status of this patch, but I think Jerry Hoemann
>> is working on this.
>>
>> Thanks,
>> Takao Indoh
>>
>>
>>>>
>>>> Update of this new version, comparing with Takao Indoh's version:
>>>>Add support for legacy PCI devices.
>>>>Use pci_try_reset_bus instead of do_downstream_device_reset in 
>>>> original version
>>>>
>>>> Randy Wright corrects some misunderstanding in this description.
>>>>
>>>> Signed-off-by: Li, Zhen-Hua 
>>>> Signed-off-by: Takao Indoh 
>>>> Signed-off-by: Randy Wright 
>>>> ---
>>>> drivers/pci/pci.c | 84 
>>>> +++
>>>> 1 file changed, 84 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 2c9ac70..8cb146c 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -23,6 +23,7 @

Re: [PATCH 1/1] pci: fix dmar fault for kdump kernel

2014-10-21 Thread Takao Indoh
Hi ZhenHua,

(2014/10/20 11:19), Li, ZhenHua wrote:
 Hi  Takao Indoh,
 
 According to this discussion
   https://lkml.org/lkml/2014/10/17/107
 
 It seems that we can not do the resetting on the first kernel.  It can
 only be called during kdump kernel boots.

Sounds like that. Do you know any example cases which cannot be fixed by
Bill's patch?

Thanks,
Takao Indoh


 
 Thanks
 Zhenhua
 On 10/15/2014 04:14 PM, Takao Indoh wrote:
 (2014/10/14 18:34), Li, ZhenHua wrote:
 I tested on the latest stable version 3.17, it works well.

 On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
 On a HP system with Intel vt-d supported and many PCI devices on it,
 when kernel crashed and the kdump kernel boots with intel_iommu=on,
 there may be some unexpected DMA requests on this adapter, which will
 cause DMA Remapping faults like:
dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
DMAR:[fault reason 01] Present bit in root entry is clear

 This bug may happen on *any* PCI device.
 Analysis for this bug:

 The present bit is set in this function:

 static struct context_entry * device_to_context_entry(
struct intel_iommu *iommu, u8 bus, u8 devfn)
 {
..
set_root_present(root);
..
 }

 Calling tree:
device driver
intel_alloc_coherent
__intel_map_single
domain_context_mapping
domain_context_mapping_one
device_to_context_entry

 This means, the present bit in root entry will not be set until the device
 driver is loaded.

 But in the kdump kernel, hardware devices are not aware that control has
 transferred to the second kernel, and those drivers must initialize again.
 Consequently there may be unexpected DMA requests from devices activity
 initiated in the first kernel leading to the DMA Remapping errors in the
 second kernel.

 To fix this DMAR fault, we need to reset the bus that this device on. Reset
 the device itself does not work.

 A patch for this bug that has been sent before:
 https://lkml.org/lkml/2014/9/30/55
 As in discussion, this bug may happen on *any* device, so we need to reset 
 all
 pci devices.

 There was an original version(Takao Indoh) that resets the pcie devices:
 https://lkml.org/lkml/2013/5/14/9

 As far as I can remember, the original patch was nacked by
 the following reasons:

 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
  so there would still be a window where ongoing DMA could cause an
  IOMMU error.

 2) Basically Bjorn is thinking device reset should be done in the
  1st kernel before jumping into 2nd kernel.

 And Bill Sumner proposed another idea.
 http://comments.gmane.org/gmane.linux.kernel.iommu/4828
 I don't know the current status of this patch, but I think Jerry Hoemann
 is working on this.

 Thanks,
 Takao Indoh



 Update of this new version, comparing with Takao Indoh's version:
Add support for legacy PCI devices.
Use pci_try_reset_bus instead of do_downstream_device_reset in 
 original version

 Randy Wright corrects some misunderstanding in this description.

 Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com
 Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
 Signed-off-by: Randy Wright rwri...@hp.com
 ---
 drivers/pci/pci.c | 84 
 +++
 1 file changed, 84 insertions(+)

 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index 2c9ac70..8cb146c 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -23,6 +23,7 @@
 #include linux/device.h
 #include linux/pm_runtime.h
 #include linux/pci_hotplug.h
 +#include linux/crash_dump.h
 #include asm-generic/pci-bridge.h
 #include asm/setup.h
 #include pci.h
 @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
 }
 EXPORT_SYMBOL(pci_fixup_cardbus);

 +/*
 + * Return true if dev is PCI root port or downstream port whose child is 
 PCI
 + * endpoint except VGA device.
 + */
 +static int __pci_dev_need_reset(struct pci_dev *dev)
 +{
 +struct pci_bus *subordinate;
 +struct pci_dev *child;
 +
 +if (dev-hdr_type != PCI_HEADER_TYPE_BRIDGE)
 +return 0;
 +
 +if (pci_is_pcie(dev)) {
 +if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) 
 +(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
 +return 0;
 +}
 +
 +subordinate = dev-subordinate;
 +list_for_each_entry(child, subordinate-devices, bus_list) {
 +/* Don't reset switch, bridge, VGA device */
 +if ((child-hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
 +((child-class  16) == PCI_BASE_CLASS_BRIDGE) ||
 +((child-class  16) == PCI_BASE_CLASS_DISPLAY))
 +return 0;
 +
 +if (pci_is_pcie(child)) {
 +if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM

Re: [PATCH 1/1] pci: fix dmar fault for kdump kernel

2014-10-15 Thread Takao Indoh
(2014/10/14 18:34), Li, ZhenHua wrote:
> I tested on the latest stable version 3.17, it works well.
> 
> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>> On a HP system with Intel vt-d supported and many PCI devices on it,
>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>> there may be some unexpected DMA requests on this adapter, which will
>> cause DMA Remapping faults like:
>>  dmar: DRHD: handling fault status reg 102
>>  dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>  DMAR:[fault reason 01] Present bit in root entry is clear
>>
>> This bug may happen on *any* PCI device.
>> Analysis for this bug:
>>
>> The present bit is set in this function:
>>
>> static struct context_entry * device_to_context_entry(
>>  struct intel_iommu *iommu, u8 bus, u8 devfn)
>> {
>>  ..
>>  set_root_present(root);
>>  ..
>> }
>>
>> Calling tree:
>>  device driver
>>  intel_alloc_coherent
>>  __intel_map_single
>>  domain_context_mapping
>>  domain_context_mapping_one
>>  device_to_context_entry
>>
>> This means, the present bit in root entry will not be set until the device
>> driver is loaded.
>>
>> But in the kdump kernel, hardware devices are not aware that control has
>> transferred to the second kernel, and those drivers must initialize again.
>> Consequently there may be unexpected DMA requests from devices activity
>> initiated in the first kernel leading to the DMA Remapping errors in the
>> second kernel.
>>
>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>> the device itself does not work.
>>
>> A patch for this bug that has been sent before:
>> https://lkml.org/lkml/2014/9/30/55
>> As in discussion, this bug may happen on *any* device, so we need to reset 
>> all
>> pci devices.
>>
>> There was an original version(Takao Indoh) that resets the pcie devices:
>> https://lkml.org/lkml/2013/5/14/9

As far as I can remember, the original patch was nacked by
the following reasons:

1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
   so there would still be a window where ongoing DMA could cause an
   IOMMU error.

2) Basically Bjorn is thinking device reset should be done in the
   1st kernel before jumping into 2nd kernel.

And Bill Sumner proposed another idea.
http://comments.gmane.org/gmane.linux.kernel.iommu/4828
I don't know the current status of this patch, but I think Jerry Hoemann
is working on this.

Thanks,
Takao Indoh


>>
>> Update of this new version, comparing with Takao Indoh's version:
>>  Add support for legacy PCI devices.
>>  Use pci_try_reset_bus instead of do_downstream_device_reset in original 
>> version
>>
>> Randy Wright corrects some misunderstanding in this description.
>>
>> Signed-off-by: Li, Zhen-Hua 
>> Signed-off-by: Takao Indoh 
>> Signed-off-by: Randy Wright 
>> ---
>>   drivers/pci/pci.c | 84 
>> +++
>>   1 file changed, 84 insertions(+)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 2c9ac70..8cb146c 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -23,6 +23,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   #include 
>>   #include 
>>   #include "pci.h"
>> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>   }
>>   EXPORT_SYMBOL(pci_fixup_cardbus);
>>
>> +/*
>> + * Return true if dev is PCI root port or downstream port whose child is PCI
>> + * endpoint except VGA device.
>> + */
>> +static int __pci_dev_need_reset(struct pci_dev *dev)
>> +{
>> +struct pci_bus *subordinate;
>> +struct pci_dev *child;
>> +
>> +if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>> +return 0;
>> +
>> +if (pci_is_pcie(dev)) {
>> +if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> +(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>> +return 0;
>> +}
>> +
>> +subordinate = dev->subordinate;
>> +list_for_each_entry(child, >devices, bus_list) {
>> +/* Don't reset switch, bridge, VGA device */
>> +if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
>> +((child->class >> 16) =

Re: [PATCH 1/1] pci: fix dmar fault for kdump kernel

2014-10-15 Thread Takao Indoh
(2014/10/14 18:34), Li, ZhenHua wrote:
 I tested on the latest stable version 3.17, it works well.
 
 On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
 On a HP system with Intel vt-d supported and many PCI devices on it,
 when kernel crashed and the kdump kernel boots with intel_iommu=on,
 there may be some unexpected DMA requests on this adapter, which will
 cause DMA Remapping faults like:
  dmar: DRHD: handling fault status reg 102
  dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
  DMAR:[fault reason 01] Present bit in root entry is clear

 This bug may happen on *any* PCI device.
 Analysis for this bug:

 The present bit is set in this function:

 static struct context_entry * device_to_context_entry(
  struct intel_iommu *iommu, u8 bus, u8 devfn)
 {
  ..
  set_root_present(root);
  ..
 }

 Calling tree:
  device driver
  intel_alloc_coherent
  __intel_map_single
  domain_context_mapping
  domain_context_mapping_one
  device_to_context_entry

 This means, the present bit in root entry will not be set until the device
 driver is loaded.

 But in the kdump kernel, hardware devices are not aware that control has
 transferred to the second kernel, and those drivers must initialize again.
 Consequently there may be unexpected DMA requests from devices activity
 initiated in the first kernel leading to the DMA Remapping errors in the
 second kernel.

 To fix this DMAR fault, we need to reset the bus that this device on. Reset
 the device itself does not work.

 A patch for this bug that has been sent before:
 https://lkml.org/lkml/2014/9/30/55
 As in discussion, this bug may happen on *any* device, so we need to reset 
 all
 pci devices.

 There was an original version(Takao Indoh) that resets the pcie devices:
 https://lkml.org/lkml/2013/5/14/9

As far as I can remember, the original patch was nacked by
the following reasons:

1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
   so there would still be a window where ongoing DMA could cause an
   IOMMU error.

2) Basically Bjorn is thinking device reset should be done in the
   1st kernel before jumping into 2nd kernel.

And Bill Sumner proposed another idea.
http://comments.gmane.org/gmane.linux.kernel.iommu/4828
I don't know the current status of this patch, but I think Jerry Hoemann
is working on this.

Thanks,
Takao Indoh



 Update of this new version, comparing with Takao Indoh's version:
  Add support for legacy PCI devices.
  Use pci_try_reset_bus instead of do_downstream_device_reset in original 
 version

 Randy Wright corrects some misunderstanding in this description.

 Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com
 Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
 Signed-off-by: Randy Wright rwri...@hp.com
 ---
   drivers/pci/pci.c | 84 
 +++
   1 file changed, 84 insertions(+)

 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index 2c9ac70..8cb146c 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -23,6 +23,7 @@
   #include linux/device.h
   #include linux/pm_runtime.h
   #include linux/pci_hotplug.h
 +#include linux/crash_dump.h
   #include asm-generic/pci-bridge.h
   #include asm/setup.h
   #include pci.h
 @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
   }
   EXPORT_SYMBOL(pci_fixup_cardbus);

 +/*
 + * Return true if dev is PCI root port or downstream port whose child is PCI
 + * endpoint except VGA device.
 + */
 +static int __pci_dev_need_reset(struct pci_dev *dev)
 +{
 +struct pci_bus *subordinate;
 +struct pci_dev *child;
 +
 +if (dev-hdr_type != PCI_HEADER_TYPE_BRIDGE)
 +return 0;
 +
 +if (pci_is_pcie(dev)) {
 +if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) 
 +(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
 +return 0;
 +}
 +
 +subordinate = dev-subordinate;
 +list_for_each_entry(child, subordinate-devices, bus_list) {
 +/* Don't reset switch, bridge, VGA device */
 +if ((child-hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
 +((child-class  16) == PCI_BASE_CLASS_BRIDGE) ||
 +((child-class  16) == PCI_BASE_CLASS_DISPLAY))
 +return 0;
 +
 +if (pci_is_pcie(child)) {
 +if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
 +(pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
 +return 0;
 +}
 +}
 +
 +return 1;
 +}
 +
 +struct pci_dev_reset_entry {
 +struct list_head list;
 +struct pci_dev *dev;
 +};
 +int __init pci_reset_endpoints(void)
 +{
 +struct pci_dev *dev = NULL;
 +struct pci_dev_reset_entry *pdev_entry, *tmp;
 +struct pci_bus *subordinate = NULL;
 +int has_it;
 +
 +LIST_HEAD(pdev_list);
 +
 +if (likely(!is_kdump_kernel

[PATCH v2] ipmi: Clear drvdata when interface is removed

2014-09-09 Thread Takao Indoh
This patch fixes a bug on hotmod removing.

After ipmi interface is removed using hotmod, kernel panic occurs when
rmmod impi_si. For example, try this:

 # echo "remove,"`cat /proc/ipmi/0/params` > \
 /sys/module/ipmi_si/parameters/hotmod
 # rmmod ipmi_si

Then, rmmod fails with the following messages.

[ cut here ]
WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
__list_del_entry+0x63/0xd0()
(snip)
CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
Rev.3D81.3030 02/10/2012
 0009 88022d547d40 81575778 88022d547d88
 88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
 8800bac20860  02046090 88022d547dd8
Call Trace:
 [] dump_stack+0x45/0x56
 [] warn_slowpath_common+0x7d/0xa0
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? __kernfs_remove+0xdf/0x220
 [] __list_del_entry+0x63/0xd0
 [] list_del+0xd/0x30
 [] cleanup_one_si+0x2a/0x230 [ipmi_si]
 [] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
 [] pnp_device_remove+0x24/0x40
 [] __device_release_driver+0x7f/0xf0
 [] driver_detach+0xb0/0xc0
 [] bus_remove_driver+0x55/0xd0
 [] driver_unregister+0x2c/0x50
 [] pnp_unregister_driver+0x12/0x20
 [] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
 [] SyS_delete_module+0x132/0x1c0
 [] ? do_notify_resume+0x59/0x80
 [] ? int_signal+0x12/0x17
 [] system_call_fastpath+0x16/0x1b
---[ end trace 70b4377268f85c23 ]---

list_del in cleanup_one_si() fails because the smi_info is already
removed when hotmod removing.

When ipmi interface is removed by hotmod, smi_info is removed by
cleanup_one_si(), but is is still set in drvdata. Therefore when rmmod
ipmi_si, ipmi_pnp_remove tries to remove it again and fails.

By this patch, a pointer to smi_info in drvdata is cleared when hotmod
removing so that it will be not accessed when rmmod.

changelog:
v2:
- Clear drvdata in cleanup_one_si
- Change subject

v1:
https://lkml.org/lkml/2014/9/8/741


Signed-off-by: Takao Indoh 
---
 drivers/char/ipmi/ipmi_si_intf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 5d66568..01dcbdb 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -3655,6 +3655,9 @@ static void cleanup_one_si(struct smi_info *to_clean)
if (!to_clean)
return;
 
+   if (to_clean->dev)
+   dev_set_drvdata(to_clean->dev, NULL);
+
list_del(_clean->link);
 
/* Tell the driver that we are shutting down. */
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] ipmi: Clear drvdata when interface is removed by hotmod

2014-09-09 Thread Takao Indoh
(2014/09/10 3:17), Corey Minyard wrote:
> Ok, I can see the problem.  Instead of the change you have, can you add
> something like:
> 
> if (!info)
>  return;
> if (info->dev)
>  dev_set_drvdata(info->dev, NULL);
> 
> to the top of cleanup_one_si() instead?  I think that would be a cleaner
> and more general.

Ok, agreed.

  > if (!info)
  >  return;

This is already done in the top of cleanup_one_si(), so I'll just 
add this:

  > if (info->dev)
  >      dev_set_drvdata(info->dev, NULL);

Thanks,
Takao Indoh


> 
> -corey
> 
> 
> On 09/08/2014 07:19 PM, Takao Indoh wrote:
>> Add another email of maintainer just in case.
>>
>>
>> This patch fixes a bug on hotmod removing.
>>
>> After ipmi interface is removed using hotmod, kernel panic occurs when
>> rmmod impi_si. For example, try this:
>>
>>   # echo "remove,"`cat /proc/ipmi/0/params` > \
>>   /sys/module/ipmi_si/parameters/hotmod
>>   # rmmod ipmi_si
>>
>> Then, rmmod fails with the following messages.
>>
>> [ cut here ]
>> WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
>> __list_del_entry+0x63/0xd0()
>> (snip)
>> CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
>> Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
>> Rev.3D81.3030 02/10/2012
>>   0009 88022d547d40 81575778 88022d547d88
>>   88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
>>   8800bac20860  02046090 88022d547dd8
>> Call Trace:
>>   [] dump_stack+0x45/0x56
>>   [] warn_slowpath_common+0x7d/0xa0
>>   [] warn_slowpath_fmt+0x4c/0x50
>>   [] ? __kernfs_remove+0xdf/0x220
>>   [] __list_del_entry+0x63/0xd0
>>   [] list_del+0xd/0x30
>>   [] cleanup_one_si+0x2a/0x230 [ipmi_si]
>>   [] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
>>   [] pnp_device_remove+0x24/0x40
>>   [] __device_release_driver+0x7f/0xf0
>>   [] driver_detach+0xb0/0xc0
>>   [] bus_remove_driver+0x55/0xd0
>>   [] driver_unregister+0x2c/0x50
>>   [] pnp_unregister_driver+0x12/0x20
>>   [] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
>>   [] SyS_delete_module+0x132/0x1c0
>>   [] ? do_notify_resume+0x59/0x80
>>   [] ? int_signal+0x12/0x17
>>   [] system_call_fastpath+0x16/0x1b
>> ---[ end trace 70b4377268f85c23 ]---
>>
>> list_del in cleanup_one_si() fails because the smi_info is already
>> removed when hotmod removing.
>>
>> When ipmi interface is removed by hotmod, smi_info is removed by
>> cleanup_one_si(), but it is still set in drvdata. Therefore when rmmod
>> ipmi_si, ipmi_pnp_remove tries to remove it again and fails.
>>
>> By this patch, a pointer to smi_info in drvdata is cleared when hotmod
>> removing so that it will be not accessed when rmmod.
>>
>> Signed-off-by: Takao Indoh 
>> ---
>>   drivers/char/ipmi/ipmi_si_intf.c | 27 +--
>>   1 file changed, 21 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/char/ipmi/ipmi_si_intf.c 
>> b/drivers/char/ipmi/ipmi_si_intf.c
>> index 5d66568..20a3739 100644
>> --- a/drivers/char/ipmi/ipmi_si_intf.c
>> +++ b/drivers/char/ipmi/ipmi_si_intf.c
>> @@ -1904,8 +1904,14 @@ static int hotmod_handler(const char *val, struct 
>> kernel_param *kp)
>>  continue;
>>  if (e->si_type != si_type)
>>  continue;
>> -if (e->io.addr_data == addr)
>> -cleanup_one_si(e);
>> +if (e->io.addr_data != addr)
>> +continue;
>> +
>> +/* Clear driver data */
>> +if (e->dev)
>> +dev_set_drvdata(e->dev, NULL);
>> +
>> +cleanup_one_si(e);
>>  }
>>  mutex_unlock(_infos_lock);
>>  }
>> @@ -2316,7 +2322,8 @@ static void ipmi_pnp_remove(struct pnp_dev *dev)
>>   {
>>  struct smi_info *info = pnp_get_drvdata(dev);
>>   
>> -cleanup_one_si(info);
>> +if (info)
>> +cleanup_one_si(info);
>>   }
>>   
>>   static const struct pnp_device_id pnp_dev_table[] = {
>> @@ -2621,7 +2628,8 @@ static int ipmi_pci_probe(struct pci_dev *pdev,
>>   static void ipmi_pc

Re: [RESEND PATCH] ipmi: Clear drvdata when interface is removed by hotmod

2014-09-09 Thread Takao Indoh
(2014/09/10 3:17), Corey Minyard wrote:
 Ok, I can see the problem.  Instead of the change you have, can you add
 something like:
 
 if (!info)
  return;
 if (info-dev)
  dev_set_drvdata(info-dev, NULL);
 
 to the top of cleanup_one_si() instead?  I think that would be a cleaner
 and more general.

Ok, agreed.

   if (!info)
return;

This is already done in the top of cleanup_one_si(), so I'll just 
add this:

   if (info-dev)
dev_set_drvdata(info-dev, NULL);

Thanks,
Takao Indoh


 
 -corey
 
 
 On 09/08/2014 07:19 PM, Takao Indoh wrote:
 Add another email of maintainer just in case.


 This patch fixes a bug on hotmod removing.

 After ipmi interface is removed using hotmod, kernel panic occurs when
 rmmod impi_si. For example, try this:

   # echo remove,`cat /proc/ipmi/0/params`  \
   /sys/module/ipmi_si/parameters/hotmod
   # rmmod ipmi_si

 Then, rmmod fails with the following messages.

 [ cut here ]
 WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
 __list_del_entry+0x63/0xd0()
 (snip)
 CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
 Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
 Rev.3D81.3030 02/10/2012
   0009 88022d547d40 81575778 88022d547d88
   88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
   8800bac20860  02046090 88022d547dd8
 Call Trace:
   [81575778] dump_stack+0x45/0x56
   [8104ec5d] warn_slowpath_common+0x7d/0xa0
   [8104eccc] warn_slowpath_fmt+0x4c/0x50
   [811f60bf] ? __kernfs_remove+0xdf/0x220
   [81291213] __list_del_entry+0x63/0xd0
   [8129128d] list_del+0xd/0x30
   [a06f285a] cleanup_one_si+0x2a/0x230 [ipmi_si]
   [a06f2f05] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
   [8131c7d4] pnp_device_remove+0x24/0x40
   [8137175f] __device_release_driver+0x7f/0xf0
   [81372100] driver_detach+0xb0/0xc0
   [81371415] bus_remove_driver+0x55/0xd0
   [8137283c] driver_unregister+0x2c/0x50
   [8131ca02] pnp_unregister_driver+0x12/0x20
   [a06f347c] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
   [810c33f2] SyS_delete_module+0x132/0x1c0
   [81002ab9] ? do_notify_resume+0x59/0x80
   [8157c45a] ? int_signal+0x12/0x17
   [8157c1d2] system_call_fastpath+0x16/0x1b
 ---[ end trace 70b4377268f85c23 ]---

 list_del in cleanup_one_si() fails because the smi_info is already
 removed when hotmod removing.

 When ipmi interface is removed by hotmod, smi_info is removed by
 cleanup_one_si(), but it is still set in drvdata. Therefore when rmmod
 ipmi_si, ipmi_pnp_remove tries to remove it again and fails.

 By this patch, a pointer to smi_info in drvdata is cleared when hotmod
 removing so that it will be not accessed when rmmod.

 Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
 ---
   drivers/char/ipmi/ipmi_si_intf.c | 27 +--
   1 file changed, 21 insertions(+), 6 deletions(-)

 diff --git a/drivers/char/ipmi/ipmi_si_intf.c 
 b/drivers/char/ipmi/ipmi_si_intf.c
 index 5d66568..20a3739 100644
 --- a/drivers/char/ipmi/ipmi_si_intf.c
 +++ b/drivers/char/ipmi/ipmi_si_intf.c
 @@ -1904,8 +1904,14 @@ static int hotmod_handler(const char *val, struct 
 kernel_param *kp)
  continue;
  if (e-si_type != si_type)
  continue;
 -if (e-io.addr_data == addr)
 -cleanup_one_si(e);
 +if (e-io.addr_data != addr)
 +continue;
 +
 +/* Clear driver data */
 +if (e-dev)
 +dev_set_drvdata(e-dev, NULL);
 +
 +cleanup_one_si(e);
  }
  mutex_unlock(smi_infos_lock);
  }
 @@ -2316,7 +2322,8 @@ static void ipmi_pnp_remove(struct pnp_dev *dev)
   {
  struct smi_info *info = pnp_get_drvdata(dev);
   
 -cleanup_one_si(info);
 +if (info)
 +cleanup_one_si(info);
   }
   
   static const struct pnp_device_id pnp_dev_table[] = {
 @@ -2621,7 +2628,8 @@ static int ipmi_pci_probe(struct pci_dev *pdev,
   static void ipmi_pci_remove(struct pci_dev *pdev)
   {
  struct smi_info *info = pci_get_drvdata(pdev);
 -cleanup_one_si(info);
 +if (info)
 +cleanup_one_si(info);
  pci_disable_device(pdev);
   }
   
 @@ -2729,7 +2737,10 @@ static int ipmi_probe(struct platform_device *dev)
   static int ipmi_remove(struct platform_device *dev)
   {
   #ifdef CONFIG_OF
 -cleanup_one_si(dev_get_drvdata(dev-dev));
 +struct smi_info *info = dev_get_drvdata(dev-dev);
 +
 +if (info)
 +cleanup_one_si(info);
   #endif
  return 0;
   }
 @@ -2796,7

[PATCH v2] ipmi: Clear drvdata when interface is removed

2014-09-09 Thread Takao Indoh
This patch fixes a bug on hotmod removing.

After ipmi interface is removed using hotmod, kernel panic occurs when
rmmod impi_si. For example, try this:

 # echo remove,`cat /proc/ipmi/0/params`  \
 /sys/module/ipmi_si/parameters/hotmod
 # rmmod ipmi_si

Then, rmmod fails with the following messages.

[ cut here ]
WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
__list_del_entry+0x63/0xd0()
(snip)
CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
Rev.3D81.3030 02/10/2012
 0009 88022d547d40 81575778 88022d547d88
 88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
 8800bac20860  02046090 88022d547dd8
Call Trace:
 [81575778] dump_stack+0x45/0x56
 [8104ec5d] warn_slowpath_common+0x7d/0xa0
 [8104eccc] warn_slowpath_fmt+0x4c/0x50
 [811f60bf] ? __kernfs_remove+0xdf/0x220
 [81291213] __list_del_entry+0x63/0xd0
 [8129128d] list_del+0xd/0x30
 [a06f285a] cleanup_one_si+0x2a/0x230 [ipmi_si]
 [a06f2f05] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
 [8131c7d4] pnp_device_remove+0x24/0x40
 [8137175f] __device_release_driver+0x7f/0xf0
 [81372100] driver_detach+0xb0/0xc0
 [81371415] bus_remove_driver+0x55/0xd0
 [8137283c] driver_unregister+0x2c/0x50
 [8131ca02] pnp_unregister_driver+0x12/0x20
 [a06f347c] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
 [810c33f2] SyS_delete_module+0x132/0x1c0
 [81002ab9] ? do_notify_resume+0x59/0x80
 [8157c45a] ? int_signal+0x12/0x17
 [8157c1d2] system_call_fastpath+0x16/0x1b
---[ end trace 70b4377268f85c23 ]---

list_del in cleanup_one_si() fails because the smi_info is already
removed when hotmod removing.

When ipmi interface is removed by hotmod, smi_info is removed by
cleanup_one_si(), but is is still set in drvdata. Therefore when rmmod
ipmi_si, ipmi_pnp_remove tries to remove it again and fails.

By this patch, a pointer to smi_info in drvdata is cleared when hotmod
removing so that it will be not accessed when rmmod.

changelog:
v2:
- Clear drvdata in cleanup_one_si
- Change subject

v1:
https://lkml.org/lkml/2014/9/8/741


Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 drivers/char/ipmi/ipmi_si_intf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 5d66568..01dcbdb 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -3655,6 +3655,9 @@ static void cleanup_one_si(struct smi_info *to_clean)
if (!to_clean)
return;
 
+   if (to_clean-dev)
+   dev_set_drvdata(to_clean-dev, NULL);
+
list_del(to_clean-link);
 
/* Tell the driver that we are shutting down. */
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH] ipmi: Clear drvdata when interface is removed by hotmod

2014-09-08 Thread Takao Indoh
Add another email of maintainer just in case.


This patch fixes a bug on hotmod removing.

After ipmi interface is removed using hotmod, kernel panic occurs when
rmmod impi_si. For example, try this:

 # echo "remove,"`cat /proc/ipmi/0/params` > \
 /sys/module/ipmi_si/parameters/hotmod
 # rmmod ipmi_si

Then, rmmod fails with the following messages.

[ cut here ]
WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
__list_del_entry+0x63/0xd0()
(snip)
CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
Rev.3D81.3030 02/10/2012
 0009 88022d547d40 81575778 88022d547d88
 88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
 8800bac20860  02046090 88022d547dd8
Call Trace:
 [] dump_stack+0x45/0x56
 [] warn_slowpath_common+0x7d/0xa0
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? __kernfs_remove+0xdf/0x220
 [] __list_del_entry+0x63/0xd0
 [] list_del+0xd/0x30
 [] cleanup_one_si+0x2a/0x230 [ipmi_si]
 [] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
 [] pnp_device_remove+0x24/0x40
 [] __device_release_driver+0x7f/0xf0
 [] driver_detach+0xb0/0xc0
 [] bus_remove_driver+0x55/0xd0
 [] driver_unregister+0x2c/0x50
 [] pnp_unregister_driver+0x12/0x20
 [] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
 [] SyS_delete_module+0x132/0x1c0
 [] ? do_notify_resume+0x59/0x80
 [] ? int_signal+0x12/0x17
 [] system_call_fastpath+0x16/0x1b
---[ end trace 70b4377268f85c23 ]---

list_del in cleanup_one_si() fails because the smi_info is already
removed when hotmod removing.

When ipmi interface is removed by hotmod, smi_info is removed by
cleanup_one_si(), but it is still set in drvdata. Therefore when rmmod
ipmi_si, ipmi_pnp_remove tries to remove it again and fails.

By this patch, a pointer to smi_info in drvdata is cleared when hotmod
removing so that it will be not accessed when rmmod.

Signed-off-by: Takao Indoh 
---
 drivers/char/ipmi/ipmi_si_intf.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 5d66568..20a3739 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -1904,8 +1904,14 @@ static int hotmod_handler(const char *val, struct 
kernel_param *kp)
continue;
if (e->si_type != si_type)
continue;
-   if (e->io.addr_data == addr)
-   cleanup_one_si(e);
+   if (e->io.addr_data != addr)
+   continue;
+
+   /* Clear driver data */
+   if (e->dev)
+   dev_set_drvdata(e->dev, NULL);
+
+   cleanup_one_si(e);
}
mutex_unlock(_infos_lock);
}
@@ -2316,7 +2322,8 @@ static void ipmi_pnp_remove(struct pnp_dev *dev)
 {
struct smi_info *info = pnp_get_drvdata(dev);
 
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
 }
 
 static const struct pnp_device_id pnp_dev_table[] = {
@@ -2621,7 +2628,8 @@ static int ipmi_pci_probe(struct pci_dev *pdev,
 static void ipmi_pci_remove(struct pci_dev *pdev)
 {
struct smi_info *info = pci_get_drvdata(pdev);
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
pci_disable_device(pdev);
 }
 
@@ -2729,7 +2737,10 @@ static int ipmi_probe(struct platform_device *dev)
 static int ipmi_remove(struct platform_device *dev)
 {
 #ifdef CONFIG_OF
-   cleanup_one_si(dev_get_drvdata(>dev));
+   struct smi_info *info = dev_get_drvdata(>dev);
+
+   if (info)
+   cleanup_one_si(info);
 #endif
return 0;
 }
@@ -2796,7 +2807,11 @@ static int ipmi_parisc_probe(struct parisc_device *dev)
 
 static int ipmi_parisc_remove(struct parisc_device *dev)
 {
-   cleanup_one_si(dev_get_drvdata(>dev));
+   struct smi_info *info = dev_get_drvdata(>dev);
+
+   if (info)
+   cleanup_one_si(info);
+
return 0;
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH] ipmi: Clear drvdata when interface is removed by hotmod

2014-09-08 Thread Takao Indoh
Add another email of maintainer just in case.


This patch fixes a bug on hotmod removing.

After ipmi interface is removed using hotmod, kernel panic occurs when
rmmod impi_si. For example, try this:

 # echo remove,`cat /proc/ipmi/0/params`  \
 /sys/module/ipmi_si/parameters/hotmod
 # rmmod ipmi_si

Then, rmmod fails with the following messages.

[ cut here ]
WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
__list_del_entry+0x63/0xd0()
(snip)
CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
Rev.3D81.3030 02/10/2012
 0009 88022d547d40 81575778 88022d547d88
 88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
 8800bac20860  02046090 88022d547dd8
Call Trace:
 [81575778] dump_stack+0x45/0x56
 [8104ec5d] warn_slowpath_common+0x7d/0xa0
 [8104eccc] warn_slowpath_fmt+0x4c/0x50
 [811f60bf] ? __kernfs_remove+0xdf/0x220
 [81291213] __list_del_entry+0x63/0xd0
 [8129128d] list_del+0xd/0x30
 [a06f285a] cleanup_one_si+0x2a/0x230 [ipmi_si]
 [a06f2f05] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
 [8131c7d4] pnp_device_remove+0x24/0x40
 [8137175f] __device_release_driver+0x7f/0xf0
 [81372100] driver_detach+0xb0/0xc0
 [81371415] bus_remove_driver+0x55/0xd0
 [8137283c] driver_unregister+0x2c/0x50
 [8131ca02] pnp_unregister_driver+0x12/0x20
 [a06f347c] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
 [810c33f2] SyS_delete_module+0x132/0x1c0
 [81002ab9] ? do_notify_resume+0x59/0x80
 [8157c45a] ? int_signal+0x12/0x17
 [8157c1d2] system_call_fastpath+0x16/0x1b
---[ end trace 70b4377268f85c23 ]---

list_del in cleanup_one_si() fails because the smi_info is already
removed when hotmod removing.

When ipmi interface is removed by hotmod, smi_info is removed by
cleanup_one_si(), but it is still set in drvdata. Therefore when rmmod
ipmi_si, ipmi_pnp_remove tries to remove it again and fails.

By this patch, a pointer to smi_info in drvdata is cleared when hotmod
removing so that it will be not accessed when rmmod.

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 drivers/char/ipmi/ipmi_si_intf.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 5d66568..20a3739 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -1904,8 +1904,14 @@ static int hotmod_handler(const char *val, struct 
kernel_param *kp)
continue;
if (e-si_type != si_type)
continue;
-   if (e-io.addr_data == addr)
-   cleanup_one_si(e);
+   if (e-io.addr_data != addr)
+   continue;
+
+   /* Clear driver data */
+   if (e-dev)
+   dev_set_drvdata(e-dev, NULL);
+
+   cleanup_one_si(e);
}
mutex_unlock(smi_infos_lock);
}
@@ -2316,7 +2322,8 @@ static void ipmi_pnp_remove(struct pnp_dev *dev)
 {
struct smi_info *info = pnp_get_drvdata(dev);
 
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
 }
 
 static const struct pnp_device_id pnp_dev_table[] = {
@@ -2621,7 +2628,8 @@ static int ipmi_pci_probe(struct pci_dev *pdev,
 static void ipmi_pci_remove(struct pci_dev *pdev)
 {
struct smi_info *info = pci_get_drvdata(pdev);
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
pci_disable_device(pdev);
 }
 
@@ -2729,7 +2737,10 @@ static int ipmi_probe(struct platform_device *dev)
 static int ipmi_remove(struct platform_device *dev)
 {
 #ifdef CONFIG_OF
-   cleanup_one_si(dev_get_drvdata(dev-dev));
+   struct smi_info *info = dev_get_drvdata(dev-dev);
+
+   if (info)
+   cleanup_one_si(info);
 #endif
return 0;
 }
@@ -2796,7 +2807,11 @@ static int ipmi_parisc_probe(struct parisc_device *dev)
 
 static int ipmi_parisc_remove(struct parisc_device *dev)
 {
-   cleanup_one_si(dev_get_drvdata(dev-dev));
+   struct smi_info *info = dev_get_drvdata(dev-dev);
+
+   if (info)
+   cleanup_one_si(info);
+
return 0;
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipmi: Clear drvdata when interface is removed by hotmod

2014-09-04 Thread Takao Indoh
This patch fixes a bug on hotmod removing.

After ipmi interface is removed using hotmod, kernel panic occurs when
rmmod impi_si. For example, try this:

 # echo "remove,"`cat /proc/ipmi/0/params` > \
 /sys/module/ipmi_si/parameters/hotmod
 # rmmod ipmi_si

Then, rmmod fails with the following messages.

[ cut here ]
WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
__list_del_entry+0x63/0xd0()
(snip)
CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
Rev.3D81.3030 02/10/2012
 0009 88022d547d40 81575778 88022d547d88
 88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
 8800bac20860  02046090 88022d547dd8
Call Trace:
 [] dump_stack+0x45/0x56
 [] warn_slowpath_common+0x7d/0xa0
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? __kernfs_remove+0xdf/0x220
 [] __list_del_entry+0x63/0xd0
 [] list_del+0xd/0x30
 [] cleanup_one_si+0x2a/0x230 [ipmi_si]
 [] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
 [] pnp_device_remove+0x24/0x40
 [] __device_release_driver+0x7f/0xf0
 [] driver_detach+0xb0/0xc0
 [] bus_remove_driver+0x55/0xd0
 [] driver_unregister+0x2c/0x50
 [] pnp_unregister_driver+0x12/0x20
 [] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
 [] SyS_delete_module+0x132/0x1c0
 [] ? do_notify_resume+0x59/0x80
 [] ? int_signal+0x12/0x17
 [] system_call_fastpath+0x16/0x1b
---[ end trace 70b4377268f85c23 ]---

list_del in cleanup_one_si() fails because the smi_info is already
removed when hotmod removing.

When ipmi interface is removed by hotmod, smi_info is removed by
cleanup_one_si(), but it is still set in drvdata. Therefore when rmmod
ipmi_si, ipmi_pnp_remove tries to remove it again and fails.

By this patch, a pointer to smi_info in drvdata is cleared when hotmod
removing so that it will be not accessed when rmmod.

Signed-off-by: Takao Indoh 
---
 drivers/char/ipmi/ipmi_si_intf.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 5d66568..20a3739 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -1904,8 +1904,14 @@ static int hotmod_handler(const char *val, struct 
kernel_param *kp)
continue;
if (e->si_type != si_type)
continue;
-   if (e->io.addr_data == addr)
-   cleanup_one_si(e);
+   if (e->io.addr_data != addr)
+   continue;
+
+   /* Clear driver data */
+   if (e->dev)
+   dev_set_drvdata(e->dev, NULL);
+
+   cleanup_one_si(e);
}
mutex_unlock(_infos_lock);
}
@@ -2316,7 +2322,8 @@ static void ipmi_pnp_remove(struct pnp_dev *dev)
 {
struct smi_info *info = pnp_get_drvdata(dev);
 
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
 }
 
 static const struct pnp_device_id pnp_dev_table[] = {
@@ -2621,7 +2628,8 @@ static int ipmi_pci_probe(struct pci_dev *pdev,
 static void ipmi_pci_remove(struct pci_dev *pdev)
 {
struct smi_info *info = pci_get_drvdata(pdev);
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
pci_disable_device(pdev);
 }
 
@@ -2729,7 +2737,10 @@ static int ipmi_probe(struct platform_device *dev)
 static int ipmi_remove(struct platform_device *dev)
 {
 #ifdef CONFIG_OF
-   cleanup_one_si(dev_get_drvdata(>dev));
+   struct smi_info *info = dev_get_drvdata(>dev);
+
+   if (info)
+   cleanup_one_si(info);
 #endif
return 0;
 }
@@ -2796,7 +2807,11 @@ static int ipmi_parisc_probe(struct parisc_device *dev)
 
 static int ipmi_parisc_remove(struct parisc_device *dev)
 {
-   cleanup_one_si(dev_get_drvdata(>dev));
+   struct smi_info *info = dev_get_drvdata(>dev);
+
+   if (info)
+   cleanup_one_si(info);
+
return 0;
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipmi: Clear drvdata when interface is removed by hotmod

2014-09-04 Thread Takao Indoh
This patch fixes a bug on hotmod removing.

After ipmi interface is removed using hotmod, kernel panic occurs when
rmmod impi_si. For example, try this:

 # echo remove,`cat /proc/ipmi/0/params`  \
 /sys/module/ipmi_si/parameters/hotmod
 # rmmod ipmi_si

Then, rmmod fails with the following messages.

[ cut here ]
WARNING: CPU: 12 PID: 10819 at /mnt/repos/linux/lib/list_debug.c:53
__list_del_entry+0x63/0xd0()
(snip)
CPU: 12 PID: 10819 Comm: rmmod Not tainted 3.17.0-rc1 #19
Hardware name: FUJITSU-SV PRIMERGY BX920 S2/D3030, BIOS 080015
Rev.3D81.3030 02/10/2012
 0009 88022d547d40 81575778 88022d547d88
 88022d547d78 8104ec5d 88023908cdb0 a06fa4e0
 8800bac20860  02046090 88022d547dd8
Call Trace:
 [81575778] dump_stack+0x45/0x56
 [8104ec5d] warn_slowpath_common+0x7d/0xa0
 [8104eccc] warn_slowpath_fmt+0x4c/0x50
 [811f60bf] ? __kernfs_remove+0xdf/0x220
 [81291213] __list_del_entry+0x63/0xd0
 [8129128d] list_del+0xd/0x30
 [a06f285a] cleanup_one_si+0x2a/0x230 [ipmi_si]
 [a06f2f05] ipmi_pnp_remove+0x15/0x20 [ipmi_si]
 [8131c7d4] pnp_device_remove+0x24/0x40
 [8137175f] __device_release_driver+0x7f/0xf0
 [81372100] driver_detach+0xb0/0xc0
 [81371415] bus_remove_driver+0x55/0xd0
 [8137283c] driver_unregister+0x2c/0x50
 [8131ca02] pnp_unregister_driver+0x12/0x20
 [a06f347c] cleanup_ipmi_si+0xbc/0xf0 [ipmi_si]
 [810c33f2] SyS_delete_module+0x132/0x1c0
 [81002ab9] ? do_notify_resume+0x59/0x80
 [8157c45a] ? int_signal+0x12/0x17
 [8157c1d2] system_call_fastpath+0x16/0x1b
---[ end trace 70b4377268f85c23 ]---

list_del in cleanup_one_si() fails because the smi_info is already
removed when hotmod removing.

When ipmi interface is removed by hotmod, smi_info is removed by
cleanup_one_si(), but it is still set in drvdata. Therefore when rmmod
ipmi_si, ipmi_pnp_remove tries to remove it again and fails.

By this patch, a pointer to smi_info in drvdata is cleared when hotmod
removing so that it will be not accessed when rmmod.

Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
---
 drivers/char/ipmi/ipmi_si_intf.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 5d66568..20a3739 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -1904,8 +1904,14 @@ static int hotmod_handler(const char *val, struct 
kernel_param *kp)
continue;
if (e-si_type != si_type)
continue;
-   if (e-io.addr_data == addr)
-   cleanup_one_si(e);
+   if (e-io.addr_data != addr)
+   continue;
+
+   /* Clear driver data */
+   if (e-dev)
+   dev_set_drvdata(e-dev, NULL);
+
+   cleanup_one_si(e);
}
mutex_unlock(smi_infos_lock);
}
@@ -2316,7 +2322,8 @@ static void ipmi_pnp_remove(struct pnp_dev *dev)
 {
struct smi_info *info = pnp_get_drvdata(dev);
 
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
 }
 
 static const struct pnp_device_id pnp_dev_table[] = {
@@ -2621,7 +2628,8 @@ static int ipmi_pci_probe(struct pci_dev *pdev,
 static void ipmi_pci_remove(struct pci_dev *pdev)
 {
struct smi_info *info = pci_get_drvdata(pdev);
-   cleanup_one_si(info);
+   if (info)
+   cleanup_one_si(info);
pci_disable_device(pdev);
 }
 
@@ -2729,7 +2737,10 @@ static int ipmi_probe(struct platform_device *dev)
 static int ipmi_remove(struct platform_device *dev)
 {
 #ifdef CONFIG_OF
-   cleanup_one_si(dev_get_drvdata(dev-dev));
+   struct smi_info *info = dev_get_drvdata(dev-dev);
+
+   if (info)
+   cleanup_one_si(info);
 #endif
return 0;
 }
@@ -2796,7 +2807,11 @@ static int ipmi_parisc_probe(struct parisc_device *dev)
 
 static int ipmi_parisc_remove(struct parisc_device *dev)
 {
-   cleanup_one_si(dev_get_drvdata(dev-dev));
+   struct smi_info *info = dev_get_drvdata(dev-dev);
+
+   if (info)
+   cleanup_one_si(info);
+
return 0;
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ftrace/kprobes: Warning when insmod two modules

2014-04-24 Thread Takao Indoh
(2014/04/23 11:37), Masami Hiramatsu wrote:
> (2014/04/23 10:56), Steven Rostedt wrote:
>> On Wed, 23 Apr 2014 10:26:00 +0900
>> Masami Hiramatsu  wrote:
>>
>>
>>> Agreed. That should be done in a protected (critical) region,
>>> and the region must be protected by correct lock. It seems that
>>> the ftrace_lock is not a correct one.
>>
>> The setting of RO to RW done by ftrace before doing the normal
>> modification is under the ftrace_lock mutex. Why wouldn't that be the
>> correct lock?
> 
> Hmm, Ok. I checked that currently ftrace is the only user of
> set_all_modules_text_rw(), so until another user appears,
> ftrace_lock mutex can work.  (and also, we need a comment
> on the top of such functions, about by what it is protected. )
> 
>> The issue today is with the loading of a module and ftrace
>> expecting its code to be RW. Here's the current race:
>>
>>
>>  CPU 1   CPU 2
>>  -   -
>>load_module()
>> module->state = MODULE_STATE_COMING
>>
>>  register_ftrace_function()
>>   mutex_lock(_lock);
>>   ftrace_startup()
>>update_ftrace_function();
>> ftrace_arch_code_modify_prepare()
>>  set_all_module_text_rw();
>> 
>>  ftrace_arch_code_modify_post_process()
>>   set_all_module_text_ro();
>>
>>  [ here all module text is set to RO,
>>including the module that is
>>loading!! ]
>>
>> blocking_notifier_call_chain(MODULE_STATE_COMING);
>>  ftrace_init_module()
>>
>>
>>   [ tries to modify code, but it's RO, and fails! ]
>>
>> One solution is to add a way to set a single module text to ro and rw,
>> and then we can encapsulate ftrace_init_module() under ftrace_lock
>> mutex and have the ftrace_init_module() set the text to RW and then
>> back to RO, and this will keep ftrace from having issues with the
>> loaded module.
> 
> It sounds nicer solution, less side-effect.
> 
>> Now, if text poke does something similar, we need to make another mutex
>> that covers modifying text. Don't we have one already?
> 
> We have the text_mutex already :).
> 
>> The worry I have here, and why I still prefer the simple split state of
>> MODULE_STATE_COMING, is that once you add another mutex, we now have to
>> fight mutex ordering. Not to mention where else things might do this :-p
> 
> I see, however, we should take care of it, at least comment level.

Ok, I'll do this. Something like this, right?

static void ftrace_init_module(struct module *mod,
   unsigned long *start, unsigned long *end)
{
if (ftrace_disabled || start == end)
return;

/*
 * Need ftrace_lock here to prevent someone from changing the module
 * text to RO by set_all_modules_text_ro(). Currently ftrace is the
 * only user of set_all_modules_text_ro(), so until another user
 * appears, ftrace_lock mutex can work.
 */
mutex_lock(_lock);

set_one_module_text_rw(mod);
ftrace_process_locs(mod, start, end);
set_one_module_text_ro(mod);

mutex_unlock(_lock);
}

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ftrace/kprobes: Warning when insmod two modules

2014-04-24 Thread Takao Indoh
(2014/04/23 11:37), Masami Hiramatsu wrote:
 (2014/04/23 10:56), Steven Rostedt wrote:
 On Wed, 23 Apr 2014 10:26:00 +0900
 Masami Hiramatsu masami.hiramatsu...@hitachi.com wrote:


 Agreed. That should be done in a protected (critical) region,
 and the region must be protected by correct lock. It seems that
 the ftrace_lock is not a correct one.

 The setting of RO to RW done by ftrace before doing the normal
 modification is under the ftrace_lock mutex. Why wouldn't that be the
 correct lock?
 
 Hmm, Ok. I checked that currently ftrace is the only user of
 set_all_modules_text_rw(), so until another user appears,
 ftrace_lock mutex can work.  (and also, we need a comment
 on the top of such functions, about by what it is protected. )
 
 The issue today is with the loading of a module and ftrace
 expecting its code to be RW. Here's the current race:


  CPU 1   CPU 2
  -   -
load_module()
 module-state = MODULE_STATE_COMING

  register_ftrace_function()
   mutex_lock(ftrace_lock);
   ftrace_startup()
update_ftrace_function();
 ftrace_arch_code_modify_prepare()
  set_all_module_text_rw();
 enables-ftrace
  ftrace_arch_code_modify_post_process()
   set_all_module_text_ro();

  [ here all module text is set to RO,
including the module that is
loading!! ]

 blocking_notifier_call_chain(MODULE_STATE_COMING);
  ftrace_init_module()


   [ tries to modify code, but it's RO, and fails! ]

 One solution is to add a way to set a single module text to ro and rw,
 and then we can encapsulate ftrace_init_module() under ftrace_lock
 mutex and have the ftrace_init_module() set the text to RW and then
 back to RO, and this will keep ftrace from having issues with the
 loaded module.
 
 It sounds nicer solution, less side-effect.
 
 Now, if text poke does something similar, we need to make another mutex
 that covers modifying text. Don't we have one already?
 
 We have the text_mutex already :).
 
 The worry I have here, and why I still prefer the simple split state of
 MODULE_STATE_COMING, is that once you add another mutex, we now have to
 fight mutex ordering. Not to mention where else things might do this :-p
 
 I see, however, we should take care of it, at least comment level.

Ok, I'll do this. Something like this, right?

static void ftrace_init_module(struct module *mod,
   unsigned long *start, unsigned long *end)
{
if (ftrace_disabled || start == end)
return;

/*
 * Need ftrace_lock here to prevent someone from changing the module
 * text to RO by set_all_modules_text_ro(). Currently ftrace is the
 * only user of set_all_modules_text_ro(), so until another user
 * appears, ftrace_lock mutex can work.
 */
mutex_lock(ftrace_lock);

set_one_module_text_rw(mod);
ftrace_process_locs(mod, start, end);
set_one_module_text_ro(mod);

mutex_unlock(ftrace_lock);
}

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ftrace/kprobes: Warning when insmod two modules

2014-04-22 Thread Takao Indoh
(2014/04/22 16:28), Masami Hiramatsu wrote:
> (2014/04/22 14:29), Takao Indoh wrote:
>> (2014/04/22 12:51), Rusty Russell wrote:
>>> Steven Rostedt  writes:
>>>> On Mon, 24 Mar 2014 20:26:05 +0900
>>>> Masami Hiramatsu  wrote:
>>>>
>>>>
>>>>> Thank you for reporting with this pretty backtrace :)
>>>>> Steven, I think this is not the kprobe bug but ftrace (and perhaps, 
>>>>> module).
>>>>
>>>> Looks to be more of a module issue than a ftrace issue.
>>>>
>>>>>
>>>>> If the ftrace can set loading module text read only before the module 
>>>>> subsystem
>>>>> expected, I think it should be protected by the module subsystem itself
>>>>> (e.g. set_all_modules_text_ro(rw) skips the modules which is 
>>>>> MODULE_STATE_COMING)
>>>>>
>>>>
>>>> Does this patch fix it?
>>>>
>>>> In-review-off-by: Steven Rostedt 
>>>
>>> Sorry, was on paternity leave.
>>>
>>> I'm always nervous about adding more states, since every place which
>>> examines the state has to be audited.
>>>
>>> We set the mod->state to MOD_STATE_COMING in complete_formation;
>>> why don't we set NX there instead?  It also makes more sense to
>>> set NX before we hit parse_args() which can execute code in the module.
>>>
>>> In fact, we should probably call the notifier there too, so people
>>> can breakpoint/tracepoint/etc parameter calls.
>>>
>>> Of course, this means that we set NX before the notifier; does anything
>>> break?
>>
>> This does not work. ftrace_process_locs() is called from the notifier,
>> and it tries to change its text like this.
>>
>> load_module
>>blocking_notifier_call_chain
>>  ftrace_module_notify_enter
>>ftrace_init_module
>>  ftrace_process_locs
>>sort
>>  ftrace_swap_ips
>>
>> But the text is already RO, so it causes panic. We need to call notifier
>> before setting it RO. Or should we unset RO temporarily in
>> ftrace_process_locs()?
> 
> Perhaps, IMHO, ftrace needs to change the module RW in ftrace_init_module and
> makes it RO after modifying the module text.

Hmm..., I think the same problem occurs if we set module RW in
ftrace_init_module().


init_module
  load_module
complete_formation
  set_section_ro_nx -- (1)
  set_section_ro_nx -- (2)
  blocking_notifier_call_chain
ftrace_module_notify_enter
  ftrace_init_module - (3)
ftrace_process_locs
 mutex_lock(_lock)  (4)
     ftrace_update_code
   __ftrace_replace_code
 ftrace_make_nop
   ftrace_modify_code_direct
 do_ftrace_mod_code
   probe_kernel_write  (5)


The text of module B is set to RO at (1) and (2) by Rusty's patch. And
even if we change it to RW at (3), it set to RO again by another module
while module B is waiting at (4).

So, we need to set module to RW somewhere after get ftrace_lock, maybe
in ftrace_update_code()?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >