from:"Steven Rostedt"

Re: [PATCH 6/7] qemu: Implement virtio-pstore device

2016-07-28 Thread Steven Rostedt

On Thu, 28 Jul 2016 14:39:53 +0900
Namhyung Kim  wrote:

> Well, I dont' know.  As you know, the kernel oops dump is already sent
> to serial device but it's rather slow.  As I wrote in the cover
> letter, enabling ftrace_dump_on_oops makes it even worse..  Also
> pstore saves the (compressed) binary data so I thought it'd be better
> to have a dedicated IO channel.

BTW, I agree with this. It is better to have a quick way to grab the
ftrace buffers when a crash happens, as serial is excruciatingly slow.
Although, currently I still use kexec/kdump, but as Namhyung said, it
depends on crash being up to date. I tend to be sending in updates
every time I have to use it.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions

2016-08-08 Thread Steven Rostedt


Hmm, I'm guessing this patch got lost.

-- Steve


On Wed, 25 May 2016 13:47:26 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
> after enabling function tracer. I asked him to bisect the functions within
> available_filter_functions, which he did and it came down to three:
> 
>  _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
> 
> It was found that this is only an issue when noreplace-paravirt is added to
> the kernel command line.
> 
> This means that those functions are most likely called within critical
> sections of the funtion tracer, and must not be traced.
> 
> In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
> longer an issue. But both _paravirt_ident_{32,64}() causes the following
> splat when they are traced:
> 
>  mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070)
>  mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054)
>  NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
>  Modules linked in: e1000e
>  CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
>  Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 
> 05/07/2012
>  task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000
>  RIP: 0010:[]  [] 
> queued_spin_lock_slowpath+0x118/0x1a0
>  RSP: 0018:8800d4aefb90  EFLAGS: 0246
>  RAX:  RBX:  RCX: 88011eb16d40
>  RDX: 82485760 RSI: 1f288820 RDI: ea008030
>  RBP: 8800d4aefb90 R08: 000c R09: 
>  R10: 821c8e0e R11:  R12: 88200fb8
>  R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0
>  FS:  7f7a4e3d7840() GS:88011eb0() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0
>  Stack:
>   8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b
>   8800d4aefcb0 8800d4aefbd0 811bf4cb 0002
>   0015 8800d2276050 8000c0fd8867 ea008030
>  Call Trace:
>   [] _raw_spin_lock+0x27/0x30
>   [] handle_pte_fault+0x13db/0x16b0
>   [] ? function_trace_call+0x15b/0x180
>   [] ? handle_pte_fault+0x5/0x16b0
>   [] handle_mm_fault+0x312/0x670
>   [] ? find_vma+0x68/0x70
>   [] __do_page_fault+0x1b1/0x4e0
>   [] do_page_fault+0x22/0x30
>   [] page_fault+0x28/0x30
>   [] ? copy_user_enhanced_fast_string+0x5/0x10
>   [] ? seq_read+0x305/0x370
>   [] __vfs_read+0x28/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] vfs_read+0x86/0x130
>   [] SyS_read+0x46/0xa0
>   [] entry_SYSCALL_64_fastpath+0x1e/0xa8
>  Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14
>  c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7
>  4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b
> 
> Reported-by: Łukasz Daniluk <lukasz.dani...@intel.com>
> Signed-off-by: Steven Rostedt <rost...@goodmis.org>
> 
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index f08ac28b8136..f975d226be6e 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n"
>   ".popsection");
>  
>  /* identity function, which can be inlined */
> -u32 _paravirt_ident_32(u32 x)
> +u32 notrace _paravirt_ident_32(u32 x)
>  {
>   return x;
>  }
>  
> -u64 _paravirt_ident_64(u64 x)
> +u64 notrace _paravirt_ident_64(u64 x)
>  {
>   return x;
>  }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions

2016-09-02 Thread Steven Rostedt


I just spent half a day bisecting function tracing because I tripped
over this again. I thought this was merged, but I guess it was missed
again.

Can someone please pull this in. And mark it for stable, it goes
probably as far back as 2.6.32.

Thanks!

-- Steve

On Wed, 25 May 2016 13:47:26 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
> after enabling function tracer. I asked him to bisect the functions within
> available_filter_functions, which he did and it came down to three:
> 
>  _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
> 
> It was found that this is only an issue when noreplace-paravirt is added to
> the kernel command line.
> 
> This means that those functions are most likely called within critical
> sections of the funtion tracer, and must not be traced.
> 
> In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
> longer an issue. But both _paravirt_ident_{32,64}() causes the following
> splat when they are traced:
> 
>  mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070)
>  mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054)
>  NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
>  Modules linked in: e1000e
>  CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
>  Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 
> 05/07/2012
>  task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000
>  RIP: 0010:[]  [] 
> queued_spin_lock_slowpath+0x118/0x1a0
>  RSP: 0018:8800d4aefb90  EFLAGS: 0246
>  RAX:  RBX:  RCX: 88011eb16d40
>  RDX: 82485760 RSI: 1f288820 RDI: ea008030
>  RBP: 8800d4aefb90 R08: 000c R09: 
>  R10: 821c8e0e R11:  R12: 88200fb8
>  R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0
>  FS:  7f7a4e3d7840() GS:88011eb0() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0
>  Stack:
>   8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b
>   8800d4aefcb0 8800d4aefbd0 811bf4cb 0002
>   0015 8800d2276050 8000c0fd8867 ea008030
>  Call Trace:
>   [] _raw_spin_lock+0x27/0x30
>   [] handle_pte_fault+0x13db/0x16b0
>   [] ? function_trace_call+0x15b/0x180
>   [] ? handle_pte_fault+0x5/0x16b0
>   [] handle_mm_fault+0x312/0x670
>   [] ? find_vma+0x68/0x70
>   [] __do_page_fault+0x1b1/0x4e0
>   [] do_page_fault+0x22/0x30
>   [] page_fault+0x28/0x30
>   [] ? copy_user_enhanced_fast_string+0x5/0x10
>   [] ? seq_read+0x305/0x370
>   [] __vfs_read+0x28/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] vfs_read+0x86/0x130
>   [] SyS_read+0x46/0xa0
>   [] entry_SYSCALL_64_fastpath+0x1e/0xa8
>  Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14
>  c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7
>  4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b
> 
> Reported-by: Łukasz Daniluk <lukasz.dani...@intel.com>
> Signed-off-by: Steven Rostedt <rost...@goodmis.org>
> 
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index f08ac28b8136..f975d226be6e 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n"
>   ".popsection");
>  
>  /* identity function, which can be inlined */
> -u32 _paravirt_ident_32(u32 x)
> +u32 notrace _paravirt_ident_32(u32 x)
>  {
>   return x;
>  }
>  
> -u64 _paravirt_ident_64(u64 x)
> +u64 notrace _paravirt_ident_64(u64 x)
>  {
>   return x;
>  }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 21/27] x86/ftrace: Adapt function tracing for PIE support

2018-05-24 Thread Steven Rostedt

On Thu, 24 May 2018 13:40:24 +0200
Petr Mladek  wrote:

> On Wed 2018-05-23 12:54:15, Thomas Garnier wrote:
> > When using -fPIE/PIC with function tracing, the compiler generates a
> > call through the GOT (call *__fentry__@GOTPCREL). This instruction
> > takes 6 bytes instead of 5 on the usual relative call.
> > 
> > If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
> > so ftrace can handle the previous 5-bytes as before.
> > 
> > Position Independent Executable (PIE) support will allow to extended the
> > KASLR randomization range below the -2G memory limit.
> > 
> > Signed-off-by: Thomas Garnier 
> > ---
> >  arch/x86/include/asm/ftrace.h   |  6 +++--
> >  arch/x86/include/asm/sections.h |  4 
> >  arch/x86/kernel/ftrace.c| 42 +++--
> >  3 files changed, 48 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > index c18ed65287d5..8f2decce38d8 100644
> > --- a/arch/x86/include/asm/ftrace.h
> > +++ b/arch/x86/include/asm/ftrace.h
> > @@ -25,9 +25,11 @@ extern void __fentry__(void);
> >  static inline unsigned long ftrace_call_adjust(unsigned long addr)
> >  {
> > /*
> > -* addr is the address of the mcount call instruction.
> > -* recordmcount does the necessary offset calculation.
> > +* addr is the address of the mcount call instruction. PIE has always a
> > +* byte added to the start of the function.
> >  */
> > +   if (IS_ENABLED(CONFIG_X86_PIE))
> > +   addr -= 1;  
> 
> This seems to modify the address even for modules that are _not_ compiled with
> PIE, see below.

Can one load a module not compiled for PIE in a kernel with PIE?

> 
> > return addr;
> >  }
> >  
> > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > index 01ebcb6f263e..73b3c30cb7a3 100644
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -135,6 +135,44 @@ ftrace_modify_code_direct(unsigned long ip, unsigned 
> > const char *old_code,
> > return 0;
> >  }
> >  
> > +/* Bytes before call GOT offset */
> > +const unsigned char got_call_preinsn[] = { 0xff, 0x15 };
> > +
> > +static int
> > +ftrace_modify_initial_code(unsigned long ip, unsigned const char *old_code,
> > +  unsigned const char *new_code)
> > +{
> > +   unsigned char replaced[MCOUNT_INSN_SIZE + 1];
> > +
> > +   ftrace_expected = old_code;
> > +
> > +   /*
> > +* If PIE is not enabled or no GOT call was found, default to the
> > +* original approach to code modification.
> > +*/
> > +   if (!IS_ENABLED(CONFIG_X86_PIE) ||
> > +   probe_kernel_read(replaced, (void *)ip, sizeof(replaced)) ||
> > +   memcmp(replaced, got_call_preinsn, sizeof(got_call_preinsn)))
> > +   return ftrace_modify_code_direct(ip, old_code, new_code);  
> 
> And this looks like an attempt to handle modules compiled without
> PIE. Does it works with the right ip in that case?

I'm guessing the || is for the "or no GOT call was found", but it
doesn't explain why no GOT would be found.

> 
> I wonder if a better solution would be to update
> scripts/recordmcount.c to store the incremented location into the module.

If recordmcount.c can handle this, then I think that's the preferred
approach. Thanks!

-- Steve

> 
> IMPORTANT: I have only vague picture about how this all works. It is
> possible that I am completely wrong. The code might be correct,
> especially if you tested this situation.
> 
> Best Regards,
> Petr

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: net-next boot error

2018-07-26 Thread Steven Rostedt



[ Added Thomas Gleixner ]


On Thu, 26 Jul 2018 11:34:39 +0200
Dmitry Vyukov  wrote:

> On Thu, Jul 26, 2018 at 11:29 AM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:dc66fe43b7eb rds: send: Fix dead code in rds_sendmsg
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=127874c840
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=f34ce142a9f5f0e8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=604f8271211546f5b3c7
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+604f8271211546f5b...@syzkaller.appspotmail.com
> >
> > possible deadlock in static_key_slow_incsd 0:0:1:0: [sda] Attached SCSI disk
> > MACsec IEEE 802.1AE
> > tun: Universal TUN/TAP device driver, 1.6
> >
> > 
> > WARNING: possible recursive locking detected  
> 
> +Tetsuo, perhaps this boot lockdep problem then disables lockdep for
> actual testing. I think lockdep should respect panic_on_warn.
> 
> 
> > 4.18.0-rc6+ #141 Not tainted
> > 
> > swapper/0/1 is trying to acquire lock:
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> >
> > but task is already holding lock:
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > include/linux/cpu.h:126 [inline]
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0xe1a/0x1520
> > drivers/net/virtio_net.c:2777

Here init_vqs() does:

get_online_cpus();
virtnet_set_affinity(vi);
put_online_cpus();

Which disables cpu hotplug and calls virtnet_set_affinity()

Note, get_online_cpus() is no longer recursive.

> >
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> >
> >CPU0
> >
> >   lock(cpu_hotplug_lock.rw_sem);
> >   lock(cpu_hotplug_lock.rw_sem);
> >
> >  *** DEADLOCK ***
> >
> >  May be due to missing lock nesting notation
> >
> > 3 locks held by swapper/0/1:
> >  #0: (ptrval) (>mutex){}, at: device_lock
> > include/linux/device.h:1134 [inline]
> >  #0: (ptrval) (>mutex){}, at: __driver_attach+0x15f/0x2f0
> > drivers/base/dd.c:820
> >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > include/linux/cpu.h:126 [inline]
> >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > init_vqs+0xe1a/0x1520 drivers/net/virtio_net.c:2777
> >  #2: (ptrval) (xps_map_mutex){+.+.}, at:
> > __netif_set_xps_queue+0x243/0x23f0 net/core/dev.c:2278
> >
> > stack backtrace:
> > CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6+ #141
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
> >  print_deadlock_bug kernel/locking/lockdep.c:1765 [inline]
> >  check_deadlock kernel/locking/lockdep.c:1809 [inline]
> >  validate_chain kernel/locking/lockdep.c:2405 [inline]
> >  __lock_acquire.cold.65+0x1fb/0x486 kernel/locking/lockdep.c:3435
> >  lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
> >  percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 [inline]
> >  percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> >  cpus_read_lock+0x43/0xa0 kernel/cpu.c:289
> >  static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> >  __netif_set_xps_queue+0xaac/0x23f0 net/core/dev.c:2320


__netif_set_xps_queue() calls static_key_slow_inc() which will also do
a get_online_cpus() which will trigger this bug.

There's a static_key_slow_inc_cpuslocked() version that should be used
when get_online_cpus() is already taken, but I see
__netif_set_xps_queue() is called from several places, and I doubt it
is always called with get_online_cpus() held. Thus just using the
cpuslocked() version is probably not sufficient of a fix.

I don't know the code enough to offer other suggestions.

-- Steve


> >  netif_set_xps_queue+0x26/0x30 net/core/dev.c:2455
> >  virtnet_set_affinity+0x2ba/0x4b0 drivers/net/virtio_net.c:1944
> >  init_vqs+0xe22/0x1520 drivers/net/virtio_net.c:2778
> >  virtnet_probe+0x1092/0x2260 drivers/net/virtio_net.c:3016
> >  virtio_dev_probe+0x592/0x942 drivers/virtio/virtio.c:245
> >  really_probe drivers/base/dd.c:446 [inline]
> >  driver_probe_device+0x6ad/0x970 drivers/base/dd.c:588
> >  __driver_attach+0x28b/0x2f0 drivers/base/dd.c:822
> >  bus_for_each_dev+0x15d/0x1f0 drivers/base/bus.c:311
> >  driver_attach+0x3d/0x50 drivers/base/dd.c:841
> >  bus_add_driver+0x4b2/0x600 drivers/base/bus.c:667
> >  driver_register+0x1c8/0x320 drivers/base/driver.c:170
> >  register_virtio_driver+0x79/0xd0

[PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Steven Rostedt

This is taken from the work I did on lguest64.

When killing a guest, we read the guest stack to do a nice back trace of
the guest and send it via printk to the host.

So instead of just getting an error message from the lguest launcher of:

lguest: bad read address 537012178 len 1

I also get in my dmesg:

called from  [c0405f30] show_trace_log_lvl+0x1a/0x2f
 [c04069aa] show_trace+0x12/0x14
 [c0406a03] dump_stack+0x16/0x18
 [f8c746de] lguest_dump_lg_regs+0x22/0x13c [lg]
 [f8c7131b] lgread+0x59/0x90 [lg]
 [f8c715bd] run_guest+0x26b/0x406 [lg]
 [f8c739be] read+0x73/0x7d [lg]
 [c04825e9] vfs_read+0xad/0x161
 [c0482a75] sys_read+0x3d/0x61
 [c0404f34] syscall_call+0x7/0xb
 ===
[f8c7131b] lgread+0x59/0x90 [lg]
Printing LG 0 regs cr3: 021eb000
EIP: 0061:  [e00227d2]
ESP: 0069:c236fe3c  EFLAGS: 00010202
EAX: 0004 EBX: e001fb20 ECX: 0008 EDX: 03f2
ESI: e001ee00 EDI: e001fb60 EBP: c236fea0
 CR2: 1278000  lguest_data-cr2: 80011380
errcode: 0   trapnum: d
Stack Dump:
 [c1042b7a] trace_hardirqs_on+0x125/0x149
 [c123b0ea] wait_for_completion+0x90/0x98
 [c123bddc] __mutex_unlock_slowpath+0x129/0x13e
 [c1048769] unlock_cpu_hotplug+0x62/0x64
 [c104b5b6] sys_init_module+0x14e3/0x162c
 [c1081d44] do_sync_read+0xc2/0xff
 [c1004f85] restore_nocheck+0x12/0x15
 [c1004f34] syscall_call+0x7/0xb


TODO:

  - Clean up a little (still has stuff from lguest64 in it).
  - Perhaps make a config option or runtime switch to turn it off.
  - Send to the launcher the dump instead of printk.
  - make modules work too.

Also I need to change the %u of the bad read print to a %x, because
seeing 0x200227d2 is better than seeing 537012178 for addresses.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]

Index: linux-2.6.21-rc5-mm2/drivers/lguest/Makefile
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/Makefile
+++ linux-2.6.21-rc5-mm2/drivers/lguest/Makefile
@@ -4,4 +4,4 @@ obj-$(CONFIG_LGUEST_GUEST) += lguest.o l
 # Host requires the other files, which can be a module.
 obj-$(CONFIG_LGUEST)   += lg.o
 lg-objs := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
-   segments.o io.o lguest_user.o hypervisor.o
+   segments.o io.o lguest_user.o hypervisor.o lguest_debug.o
Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -210,6 +210,28 @@ int lguest_address_ok(const struct lgues
 }
 
 /* Just like get_user, but don't let guest access lguest binary. */
+u8 lgread_u8(struct lguest *lg, u32 addr)
+{
+   u8 val = 0;
+
+   /* Don't let them access lguest binary */
+   if (!lguest_address_ok(lg, addr)
+   || get_user(val, (u32 __user *)addr) != 0)
+   kill_guest(lg, bad read address %u, addr);
+   return val;
+}
+
+u16 lgread_u16(struct lguest *lg, u32 addr)
+{
+   u16 val = 0;
+
+   /* Don't let them access lguest binary */
+   if (!lguest_address_ok(lg, addr)
+   || get_user(val, (u32 __user *)addr) != 0)
+   kill_guest(lg, bad read address %u, addr);
+   return val;
+}
+
 u32 lgread_u32(struct lguest *lg, u32 addr)
 {
u32 val = 0;
Index: linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/lg.h
+++ linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
@@ -176,6 +176,8 @@ extern struct mutex lguest_lock;
 /* core.c: */
 /* Entry points in hypervisor */
 const unsigned long *__lguest_default_idt_entries(void);
+u8 lgread_u8(struct lguest *lg, u32 addr);
+u16 lgread_u16(struct lguest *lg, u32 addr);
 u32 lgread_u32(struct lguest *lg, u32 addr);
 void lgwrite_u32(struct lguest *lg, u32 val, u32 addr);
 void lgread(struct lguest *lg, void *buf, u32 addr, unsigned bytes);
@@ -238,6 +240,7 @@ int hypercall(struct lguest *info, struc
 #define kill_guest(lg, fmt...) \
 do {   \
if (!(lg)-dead) {  \
+   lguest_dump_lg_regs(lg);\
(lg)-dead = kasprintf(GFP_ATOMIC, fmt);\
if (!(lg)-dead)\
(lg)-dead = (void *)1; \
@@ -248,5 +251,11 @@ static inline unsigned long guest_pa(str
 {
return vaddr - lg-page_offset;
 }
+
+/* lguest_debug.c */
+void lguest_print_address(struct lguest *lg, unsigned long address);
+void lguest_dump_trace(struct lguest *lg, struct lguest_regs *regs);
+void lguest_dump_lg_regs(struct lguest *lg);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _LGUEST_H */
Index: linux-2.6.21-rc5-mm2/drivers/lguest/lguest.c

[PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Steven Rostedt

Currently the lguest32 error messages from bad reads and writes prints a
decimal integer for addresses. This is pretty annoying. So this patch
changes those to be hex outputs.

This is applied on top of my debug patch.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]

Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -220,7 +220,7 @@ u8 lgread_u8(struct lguest *lg, u32 addr
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, bad read address %u, addr);
+   kill_guest(lg, bad read address %x, addr);
return val;
 }
 
@@ -231,7 +231,7 @@ u16 lgread_u16(struct lguest *lg, u32 ad
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, bad read address %u, addr);
+   kill_guest(lg, bad read address %x, addr);
return val;
 }
 
@@ -242,7 +242,7 @@ u32 lgread_u32(struct lguest *lg, u32 ad
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, bad read address %u, addr);
+   kill_guest(lg, bad read address %x, addr);
return val;
 }
 
@@ -250,7 +250,7 @@ void lgwrite_u32(struct lguest *lg, u32 
 {
if (!lguest_address_ok(lg, addr)
|| put_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, bad write address %u, addr);
+   kill_guest(lg, bad write address %x, addr);
 }
 
 void lgread(struct lguest *lg, void *b, u32 addr, unsigned bytes)
@@ -259,7 +259,7 @@ void lgread(struct lguest *lg, void *b, 
|| copy_from_user(b, (void __user *)addr, bytes) != 0) {
/* copy_from_user should do this, but as we rely on it... */
memset(b, 0, bytes);
-   kill_guest(lg, bad read address %u len %u, addr, bytes);
+   kill_guest(lg, bad read address %x len %u, addr, bytes);
}
 }
 
@@ -268,7 +268,7 @@ void lgwrite(struct lguest *lg, u32 addr
if (addr + bytes  addr
|| !lguest_address_ok(lg, addr+bytes)
|| copy_to_user((void __user *)addr, b, bytes) != 0)
-   kill_guest(lg, bad write address %u len %u, addr, bytes);
+   kill_guest(lg, bad write address %x len %u, addr, bytes);
 }
 
 static void set_ts(unsigned int guest_ts)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Steven Rostedt

On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:

  (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
  for %#x (or 0x%x)?  I'm easily confused.
 
 How about %p for pointers?

But that would require casting the numbers to pointers.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Steven Rostedt

On Thu, 2007-04-05 at 12:54 +1000, Rusty Russell wrote:

 
   This is a cool idea, but there are two issues with this patch.  The
 first is that it's 500 lines of code: that's around +10% on lguest's
 total code size!  The second is that it conflicts with the medium-term
 plan to allow any user to run up lguests: this is why lg.ko never
 printk()s about problems with the guest.

Not much I can do about the size, but it's in the debug section so
hopefully it's not considered too bad :)

 
 While it is useful for cases where a guest dies mysteriously before it
 brings up the console, three alternatives come to mind:
 
 1) Modify early_printk so Guests can use it.
 2) Have a separate tool(-set?) for this kind of post-mortem.  Then you
 just have to implement guest suspend! 8)
 3) Put this in a CONFIG_LGUEST_DEBUG.
 
 Note that options 1 or 2 make you do more work, but are probably better
 in the long term.  I'm happy for #3 to sit as a patch in the tree for
 the duration, tho!

OK, I'll make a #3 patch to send, but the #1 looks best. Not to mention
that I still need to make it so that the console can read it.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH] Lguest launcher, child starving parent

2007-04-05 Thread Steven Rostedt

Glauber noticed long delays between hitting a key, and seeing data come
up on the virtual console.  Looking into this, I found that the
wake_parent routine that reads from all devices was actually starving
out the parent after sending the parent a signal to wake up.

The thing is, the child which takes the console input is recognized by
the scheduler as an interactive process.  The parent, doesn't do so
much, so it is recognized more as a CPU hog. So the child easily gets a
higher priority than the parent.

So when the child receives data from the console, it sends a signal to
the parent and then does another select.  The problem is that the select
doesn't actually read from the device, and will return immediately since
their is still data pending until it is read.  But it's the parent that
reads the data.  So the child actually starves the parent from reading
the data by spinning and waiting for it to read the data.

The fix I implemented was to have the child wait for a response from the
parent before going on. Since there was already communication between
the parent and child via a pipe, I used that. This time, the data
returned by the pipe is either everything went fine (fd == -1) or the
device should be ignored (fd = 0).

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]

Index: linux-2.6-lguest/Documentation/lguest/lguest.c
===
--- linux-2.6-lguest.orig/Documentation/lguest/lguest.c 2007-04-05 
16:13:08.0 -0400
+++ linux-2.6-lguest/Documentation/lguest/lguest.c  2007-04-05 
16:16:31.0 -0400
@@ -328,15 +328,15 @@ static void wake_parent(int pipefd, stru
 
for (;;) {
fd_set rfds = devices-infds;
+   int ignorefd;
 
select(devices-max_infd+1, rfds, NULL, NULL, NULL);
-   if (FD_ISSET(pipefd, rfds)) {
-   int ignorefd;
-   if (read(pipefd, ignorefd, sizeof(ignorefd)) == 0)
-   exit(0);
-   FD_CLR(ignorefd, devices-infds);
-   }
kill(getppid(), SIGUSR1);
+   /* wait for parent response */
+   if (read(pipefd, ignorefd, sizeof(ignorefd)) == 0)
+   exit(0);
+   if (ignorefd = 0)
+   FD_CLR(ignorefd, devices-infds);
}
 }
 
@@ -594,6 +594,7 @@ static void handle_output(int fd, unsign
 static void handle_input(int fd, int childfd, struct device_list *devices)
 {
struct timeval poll = { .tv_sec = 0, .tv_usec = 0 };
+   int fdok = -1;
 
for (;;) {
struct device *i;
@@ -608,7 +609,9 @@ static void handle_input(int fd, int chi
FD_CLR(i-fd, devices-infds);
/* Tell child to ignore it too... */
write(childfd, i-fd, sizeof(i-fd));
-   }
+   } else
+   /* Tell child to continue */
+   write(childfd, fdok, sizeof(fdok));
}
}
}


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: lguest over qemu

2007-07-12 Thread Steven Rostedt



   (Note that lguest doesn't support NMIs, but Steven has code for NMI
 support for lguest-x86-64 which could be ported across).

Rusty,

About that.  Is there a way to get a NMI only stack in i386? In x86_64
it's part of the TSS. So I can always know I have a good stack for the
NMI. I'm not sure i368 has the same thing. Or do we always have a good
stack whenever we are in ring0?

Oh, and btw, I've just rewrote all of the Lguest64 page table handling.
I'm just going over one design change that is really bothering me. In
x86_64 we can have 2M or 4k pages (like the PSE in i386). But since 4K
pages are used by the shadow page tables, I have to map them like that.
But this means that I can have the same guest address as both a PMD and a
PTE. Which is breaking some of my code. I'm working on a fix as I write
this.

But to get you up-to-date to where I'm at. I've implemented a way to have
the HV mapped uniquely for all VCPUs.  So there's a HV text section (the
same for all VCPUs), a HV VCPU Data section (readonly in all rings with
guest cr3), and a HV VCPU Scratch Pad section (read/write in rings
0,1,and2).  So now the guest kernel runs in ring 1. With this change, I
already implemented a syscall trampoline that no longer needs to switch to
the host, as well as iretq by the guest kernel goes directly to the guest
user space (or kernel).  The next version of lguest64 will be much cleaner
and faster

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt


Hi Andi,

Thanks for all the comments, it's greatly appreciated.

On Wed, 8 Aug 2007, Andi Kleen wrote:


  +#define SYSRETQ\
  +   movq%gs:pda_oldrsp,%rsp;\
  +   swapgs; \
  +   sysretq;

 When the macro does more than sysret it should have a different
 name

Noted. Do you have a better idea? Something like SETSTACK_SWAPGS_SYSRETQ?



*/
  .globl int_ret_from_sys_call
   int_ret_from_sys_call:
  -   cli
  +   DISABLE_INTERRUPTS(CLBR_ANY)

 ANY? There are certainly some registers alive at this point like rax

Glauber will need to address this, this is his code ;-)


   retint_restore_args:
  -   cli
  +   DISABLE_INTERRUPTS(CLBR_ANY)

 Similar.


  /*
   * The iretq could re-enable interrupts:
   */
  @@ -566,10 +587,14 @@ retint_restore_args:
   restore_args:
  RESTORE_ARGS 0,8,0
   iret_label:
  -   iretq
  +#ifdef CONFIG_PARAVIRT
  +   INTERRUPT_RETURN
  +ENTRY(native_iret)

 ENTRY adds alignment. Why do you need that export anyways?

The paravirt ops struct points to it.


  +#endif
  +1: iretq
 
  .section __ex_table,a
  -   .quad iret_label,bad_iret
  +   .quad 1b, bad_iret

 iret_label seems more expressive to me than 1

The reason for this change is because of the added:
#ifdef CONFIG_PARAVIRT
INTERRUPT_RETURN
ENTRY(native_iret)
#endif

If we are paravirt ops, we need the iretq in the exception table, not the
paravit ops function call.  Since that function call may simply call the
native_iretq, and if we take a fault at the iretq, it wont be in the
exception table.


  +   ENABLE_INTERRUPTS(CLBR_NONE)

 In many of the CLBR_NONEs there are actually some registers free;
 but it might be safer to keep it this way. But if some client can get
 significantly better code with one or two free registers it might
 be worthwhile to investigate.

Glauber, that comment is for you.


  -   swapgs
  +   SWAPGS_NOSTACK

 There's still stack here

OK, bad name then. How about SWAPGS_UNTRUSTED_STACK?

From earlier in the file where SWAPGS_NOSTACK is declared we have:


/* Currently paravirt can't handle swapgs nicely when we
 * don't have a stack.  So we either find a way around these
 * or just fault and emulate if a guest tries to call swapgs
 * directly.
 *
 * Either way, this is a good way to document that we don't
 * have a reliable stack.
 */
#define SWAPGS_NOSTACK  swapgs


   paranoid_restore\trace:
  RESTORE_ALL 8
  -   iretq
  +   INTERRUPT_RETURN

 I suspect Xen will need much more changes anyways because of its
 ring 3 guest. Are these changes sufficient for lguest?

Probably not, but this part of the code I don't fully understand. But just
doing this doesn't break lguest.  But perhaps it only did not break it yet
;-)


Thanks,

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

2007-08-08 Thread Steven Rostedt


On Wed, 8 Aug 2007, Andi Kleen wrote:

 Strictly you would at least need a !X86_VSMP dependency, but
 with the vsmp change i requested that will be unnecessary

 Is this really synced with the latest version of the i386 code?

Glauber started the paravirt ops 64 a second time around, from scratch
using the PVOPS of i386 as a base. But since we couldn't just take a the
PVOPS patch from i386 and apply it to x86_64, this was mainly done by
looking at i386 code and massaging it for x86_64. Somethings may have
slipped (and we may have been looking at different versions of PVOPS).
It's not easy trying to keep up with a moving target ;-)

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt


On Wed, 8 Aug 2007, Andi Kleen wrote:


  Probably not, but this part of the code I don't fully understand.

 I would suggest to defer all this until at least one example to test it
 (except vsmp which is too simple) is around.

Who uses that code? NMIs and debug regs?  Lguest only has the host handle
the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who
else uses that part of the code?

We can leave the paranoid check change out for this series.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt


--
On Wed, 8 Aug 2007, Andi Kleen wrote:

 Steven Rostedt [EMAIL PROTECTED] writes:

  On Wed, 8 Aug 2007, Andi Kleen wrote:
 
  
Probably not, but this part of the code I don't fully understand.
  
   I would suggest to defer all this until at least one example to test it
   (except vsmp which is too simple) is around.
 
  Who uses that code? NMIs and debug regs?  Lguest only has the host handle
  the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who
  else uses that part of the code?

 I'm not sure I understand your question. You're asking who uses entry.S?
 Answer would be everybody. If you asked something else please reformulate.

When I said this part of the code I don't fully understand I was not
talking about entry.S.  I understand entry.S very well, but the comment
was originally on the paranoid_restore code. Which I thought had to deal
with NMIs and such that I didn't worry about that I simply did the
default.

  paranoid_restore\trace:
   RESTORE_ALL 8
 - iretq
 + INTERRUPT_RETURN

I suspect Xen will need much more changes anyways because of its
ring 3 guest. Are these changes sufficient for lguest?


The above was what I was replying to.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt


--
On Wed, 8 Aug 2007, Andi Kleen wrote:

 On Wed, Aug 08, 2007 at 09:47:05AM -0400, Steven Rostedt wrote:
 
  [...]
  asm volatile (pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;
/* The stack we pushed is off by 8, due to the
  previous pushq */
addq $8, %%rsp

 Weird stack inbalance, i'm surprised that works. %6 must be doing strange
 things.

Heh, no it's very subtle. We do the push %rsp after the push %2. So on
return from the call, the stack that was pushed, is really 8 bytes off.
Because we didn't save the stack pointer from the entry of the asm, we
saved it after doing that push %2. :-)

We get back to this call via a retq from the HV switcher code. So that rsp
that was pushed will be the stack.



  /* Need to read manual, does rdmsr clear
   * the top 32 bits of rax? */
  xor %rax, %rax
 
  movl$MSR_GS_BASE,%ecx
  rdmsr


 This seems incredibly slow. Since GS changes are controlled in
 the kernel why don't you just cache them in the per cpu area?
 The only case where user can reload it is using segment
 selector changes, which can be also handled.

 Also why do you need the guest gs value anyways?

I did a swapgs here, so we have the HV GS value. I don't know of an easy
way to read the GS value to put it into the RSP.


 You could just use your own stack and let the guest
 switch to its own.

Well, it is quite complex, since we can't save anything in the writable
section that the guest can't also write to. Remember, we are in ring 1 for
the guest kernel, so we have no protection. All the host related stuff is
in read only sections when we are using the guest cr3.

What I would like to do is have the GS point to the read only section, and
load the RSP from a value there. But first we need to store the old value
of RSP.  The problem is, where do you store it. Hmm, I guess I can create
my own big offset from the read only section :)

OK this gives me an idea, I *can* point the kernel GS to the vcpu read
only section, and have a way to store the value with the GS offset. The
read only section is always mapped before the read write section, so it
should be easy to calculate the diff!

Thanks for the comment, you guided me to this great idea!

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 2/7] Added generic lg.h in lguest directory.

2007-08-08 Thread Steven Rostedt

Add a generic lg.h file to call the architecture specific one.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 drivers/lguest/lg.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
new file mode 100644
index 000..4c4356e
--- /dev/null
+++ b/drivers/lguest/lg.h
@@ -0,0 +1,3 @@
+#ifdef CONFIG_X86_32
+#include i386/lg.h
+#endif
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 5/7] Change lguest launcher to use asm generic include

2007-08-08 Thread Steven Rostedt

Have the lguest launcher include e820.h via asm/e820.h instead of explicitly
saying i386.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 Documentation/lguest/lguest.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index f791840..eb7a01c 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -46,7 +46,7 @@ typedef uint32_t u32;
 typedef uint16_t u16;
 typedef uint8_t u8;
 #include ../../include/linux/lguest_launcher.h
-#include ../../include/asm-i386/e820.h
+#include ../../include/asm/e820.h
 /*:*/
 
 #define PAGE_PRESENT 0x7   /* Present, RW, Execute */
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 7/7] Move lguest_dma_info into generic lg.h

2007-08-08 Thread Steven Rostedt

The lguest_dma_info is also generic across architectures.
Move it to the generic lg.h

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 drivers/lguest/i386/lg.h |   11 ---
 drivers/lguest/lg.h  |   11 +++
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
index c5ea14c..ae83bd2 100644
--- a/drivers/lguest/i386/lg.h
+++ b/drivers/lguest/i386/lg.h
@@ -49,17 +49,6 @@ int init_pagetables(struct page **switcher_page, unsigned 
int pages);
 #define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00})
 #define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300})
 
-struct lguest_dma_info
-{
-   struct list_head list;
-   union futex_key key;
-   unsigned long dmas;
-   u16 next_dma;
-   u16 num_dmas;
-   u16 guestid;
-   u8 interrupt;   /* 0 when not registered */
-};
-
 /*H:310 The page-table code owes a great debt of gratitude to Andi Kleen.  He
  * reviewed the original code which used u32 for all page table entries, and
  * insisted that it would be far clearer with explicit typing.  I thought it
diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
index 3147bf6..6057e3c 100644
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -12,6 +12,17 @@ void release_all_dma(struct lguest *lg);
 unsigned long get_dma_buffer(struct lguest *lg, unsigned long key,
 unsigned long *interrupt);
 
+struct lguest_dma_info
+{
+   struct list_head list;
+   union futex_key key;
+   unsigned long dmas;
+   u16 next_dma;
+   u16 num_dmas;
+   u16 guestid;
+   u8 interrupt;   /* 0 when not registered */
+};
+
 #ifdef CONFIG_X86_32
 #include i386/lg.h
 #endif
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 1/7] Move lg.h to the i386 specific lguest directory

2007-08-08 Thread Steven Rostedt

Make a i386 directory in the lguest directory, and move the lg.h into
it. This will clear the way for other archs to have their own lg.h

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 drivers/lguest/i386/lg.h |  300 ++
 drivers/lguest/lg.h  |  300 --
 2 files changed, 300 insertions(+), 300 deletions(-)

diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
new file mode 100644
index 000..64f0abe
--- /dev/null
+++ b/drivers/lguest/i386/lg.h
@@ -0,0 +1,300 @@
+#ifndef _LGUEST_H
+#define _LGUEST_H
+
+#include asm/desc.h
+
+#define GDT_ENTRY_LGUEST_CS10
+#define GDT_ENTRY_LGUEST_DS11
+#define LGUEST_CS  (GDT_ENTRY_LGUEST_CS * 8)
+#define LGUEST_DS  (GDT_ENTRY_LGUEST_DS * 8)
+
+#ifndef __ASSEMBLY__
+#include linux/types.h
+#include linux/init.h
+#include linux/stringify.h
+#include linux/binfmts.h
+#include linux/futex.h
+#include linux/lguest.h
+#include linux/lguest_launcher.h
+#include linux/wait.h
+#include linux/err.h
+#include asm/semaphore.h
+#include irq_vectors.h
+
+#define GUEST_PL 1
+
+struct lguest_regs
+{
+   /* Manually saved part. */
+   unsigned long ebx, ecx, edx;
+   unsigned long esi, edi, ebp;
+   unsigned long gs;
+   unsigned long eax;
+   unsigned long fs, ds, es;
+   unsigned long trapnum, errcode;
+   /* Trap pushed part */
+   unsigned long eip;
+   unsigned long cs;
+   unsigned long eflags;
+   unsigned long esp;
+   unsigned long ss;
+};
+
+void free_pagetables(void);
+int init_pagetables(struct page **switcher_page, unsigned int pages);
+
+/* Full 4G segment descriptors, suitable for CS and DS. */
+#define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00})
+#define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300})
+
+struct lguest_dma_info
+{
+   struct list_head list;
+   union futex_key key;
+   unsigned long dmas;
+   u16 next_dma;
+   u16 num_dmas;
+   u16 guestid;
+   u8 interrupt;   /* 0 when not registered */
+};
+
+/*H:310 The page-table code owes a great debt of gratitude to Andi Kleen.  He
+ * reviewed the original code which used u32 for all page table entries, and
+ * insisted that it would be far clearer with explicit typing.  I thought it
+ * was overkill, but he was right: it is much clearer than it was before.
+ *
+ * We have separate types for the Guest's ptes  pgds and the shadow ptes 
+ * pgds.  There's already a Linux type for these (pte_t and pgd_t) but they
+ * change depending on kernel config options (PAE). */
+
+/* Each entry is identical: lower 12 bits of flags and upper 20 bits for the
+ * page frame number (0 == first physical page, etc).  They are different
+ * types so the compiler will warn us if we mix them improperly. */
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} spgd_t;
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} spte_t;
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} gpgd_t;
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} gpte_t;
+
+/* We have two convenient macros to convert a raw value as handed to us by
+ * the Guest into the correct Guest PGD or PTE type. */
+#define mkgpte(_val) ((gpte_t){.raw.val = _val})
+#define mkgpgd(_val) ((gpgd_t){.raw.val = _val})
+/*:*/
+
+struct pgdir
+{
+   unsigned long cr3;
+   spgd_t *pgdir;
+};
+
+/* This is a guest-specific page (mapped ro) into the guest. */
+struct lguest_ro_state
+{
+   /* Host information we need to restore when we switch back. */
+   u32 host_cr3;
+   struct Xgt_desc_struct host_idt_desc;
+   struct Xgt_desc_struct host_gdt_desc;
+   u32 host_sp;
+
+   /* Fields which are used when guest is running. */
+   struct Xgt_desc_struct guest_idt_desc;
+   struct Xgt_desc_struct guest_gdt_desc;
+   struct i386_hw_tss guest_tss;
+   struct desc_struct guest_idt[IDT_ENTRIES];
+   struct desc_struct guest_gdt[GDT_ENTRIES];
+};
+
+/* We have two pages shared with guests, per cpu.  */
+struct lguest_pages
+{
+   /* This is the stack page mapped rw in guest */
+   char spare[PAGE_SIZE - sizeof(struct lguest_regs)];
+   struct lguest_regs regs;
+
+   /* This is the host state  guest descriptor page, ro in guest */
+   struct lguest_ro_state state;
+} __attribute__((aligned(PAGE_SIZE)));
+
+#define CHANGED_IDT1
+#define CHANGED_GDT2
+#define CHANGED_GDT_TLS4 /* Actually a subset of CHANGED_GDT */
+#define CHANGED_ALL3
+
+/* The private info the thread maintains about the guest. */
+struct lguest
+{
+   /* At end of a page shared mapped over lguest_pages in guest

[PATCH 4/7] Moved the io struct up to the generic lg.h

2007-08-08 Thread Steven Rostedt

Move the io struct into the lg.h file since the io.c is generic to other
architectures.

Also added a proper ifdef for the generic lg.h.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 drivers/lguest/i386/lg.h |   11 ++-
 drivers/lguest/lg.h  |   16 
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
index 64f0abe..c5ea14c 100644
--- a/drivers/lguest/i386/lg.h
+++ b/drivers/lguest/i386/lg.h
@@ -20,6 +20,8 @@
 #include linux/err.h
 #include asm/semaphore.h
 #include irq_vectors.h
+/* some files include this, some include the parent */
+#include ../lg.h
 
 #define GUEST_PL 1
 
@@ -245,15 +247,6 @@ void pin_page(struct lguest *lg, unsigned long vaddr);
 int lguest_device_init(void);
 void lguest_device_remove(void);
 
-/* io.c: */
-void lguest_io_init(void);
-int bind_dma(struct lguest *lg,
-unsigned long key, unsigned long udma, u16 numdmas, u8 interrupt);
-void send_dma(struct lguest *info, unsigned long key, unsigned long udma);
-void release_all_dma(struct lguest *lg);
-unsigned long get_dma_buffer(struct lguest *lg, unsigned long key,
-unsigned long *interrupt);
-
 /* hypercalls.c: */
 void do_hypercalls(struct lguest *lg);
 void write_timestamp(struct lguest *lg);
diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
index 4c4356e..3147bf6 100644
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -1,3 +1,19 @@
+#ifndef _LGUEST_LG_GENERIC
+#define _LGUEST_LG_GENERIC
+
+struct lguest;
+
+/* io.c: */
+void lguest_io_init(void);
+int bind_dma(struct lguest *lg,
+unsigned long key, unsigned long udma, u16 numdmas, u8 interrupt);
+void send_dma(struct lguest *info, unsigned long key, unsigned long udma);
+void release_all_dma(struct lguest *lg);
+unsigned long get_dma_buffer(struct lguest *lg, unsigned long key,
+unsigned long *interrupt);
+
 #ifdef CONFIG_X86_32
 #include i386/lg.h
 #endif
+
+#endif /* _LGUEST_LG_GENERIC */
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 6/7] Remove __pa() use in hvc_lguest

2007-08-08 Thread Steven Rostedt

The hvc_lguest uses __pa in the const initialization.
In some architectures, __pa() is not constant so this fails in compiles.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 drivers/char/hvc_lguest.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/char/hvc_lguest.c b/drivers/char/hvc_lguest.c
index feeccba..c7d7d4c 100644
--- a/drivers/char/hvc_lguest.c
+++ b/drivers/char/hvc_lguest.c
@@ -42,7 +42,6 @@
  * use of physical address for the buffer itself. */
 static char inbuf[256];
 static struct lguest_dma cons_input = { .used_len = 0,
-   .addr[0] = __pa(inbuf),
.len[0] = sizeof(inbuf),
.len[1] = 0 };
 
@@ -114,6 +113,13 @@ static struct hv_ops lguest_cons = {
  * (0), and the struct hv_ops containing the put_chars() function. */
 static int __init cons_init(void)
 {
+   /*
+* We can't initialize using __pa in const declarations,
+* since __pa(inbuf) does not evaluate into a constant on
+* all architectures (namely x86_64).
+*/
+   cons_input.addr[0] = __pa(inbuf);
+
if (strcmp(paravirt_ops.name, lguest) != 0)
return 0;
 
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/7] Added generic lg.h in lguest directory.

2007-08-08 Thread Steven Rostedt

--
On Thu, 9 Aug 2007, Stephen Rothwell wrote:

 On Wed, 08 Aug 2007 20:32:13 -0400 Steven Rostedt [EMAIL PROTECTED]
 wrote:
 
  Add a generic lg.h file to call the architecture specific one.

 Please combine this with the previous patch so that the tree will build
 at each step.

Yeah, I wanted to do that, but git got confused when I did that, and did
what I didn't want. That was, created a new file in the i386 directory,
and diffed the one in this directory.

I'm really want to keep git knowing the history of the file on moving. I
could just simply cat the two files instead. Would that be acceptible?

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 0/5 -v2] Modify lguest32 to make room for lguest64 (version 2)

2007-08-08 Thread Steven Rostedt

[
  Changes since last version.
  - Move lg.h to include/asm instead (suggested by Rusty Russel)
  - All steps of the series compiles (suggested by Stephen Rothwell)
  - Better ifdef header naming (suggested by Stephen Rothwell)
  - Added Andi Kleen to CC (forgot to on V1)
]

Hi all,

I've been working on lguest64 and in order to do this, I had to move
a lot of the i386 specific out of the way.  Well, the lguest64 port
is still not ready to display, but before Rusty makes too many changes
I would like this in upstream so I don't have to keep repeating my
changes :-)


So this patch series moves lguest32 out of the way for other archs.

-- Steve

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 4/5 -v2] Added generic lg.h in the include/linux directory

2007-08-08 Thread Steven Rostedt

Add a generic lg.h that can be included from lguest files.
This file will hold the data that can be shared across archs.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 arch/i386/kernel/asm-offsets.c|2 +-
 drivers/lguest/core.c |2 +-
 drivers/lguest/hypercalls.c   |2 +-
 drivers/lguest/interrupts_and_traps.c |2 +-
 drivers/lguest/io.c   |2 +-
 drivers/lguest/lguest_user.c  |2 +-
 drivers/lguest/page_tables.c  |2 +-
 drivers/lguest/segments.c |2 +-
 include/asm-i386/lg.h |   24 ++--
 include/linux/lg.h|   30 ++
 10 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/arch/i386/kernel/asm-offsets.c b/arch/i386/kernel/asm-offsets.c
index e426e51..d6d17e5 100644
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -21,7 +21,7 @@
 
 #ifdef CONFIG_LGUEST_GUEST
 #include linux/lguest.h
-#include asm/lg.h
+#include linux/lg.h
 #endif
 
 #define DEFINE(sym, val) \
diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
index e5f441d..5bcd237 100644
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -19,7 +19,7 @@
 #include asm/highmem.h
 #include asm/asm-offsets.h
 #include asm/i387.h
-#include asm/lg.h
+#include linux/lg.h
 
 /* Found in switcher.S */
 extern char start_switcher_text[], end_switcher_text[], switch_to_guest[];
diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c
index 0ddb133..3c0d41e 100644
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -26,7 +26,7 @@
 #include asm/page.h
 #include asm/pgtable.h
 #include irq_vectors.h
-#include asm/lg.h
+#include linux/lg.h
 
 /*H:120 This is the core hypercall routine: where the Guest gets what it
  * wants.  Or gets killed.  Or, in the case of LHCALL_CRASH, both.
diff --git a/drivers/lguest/interrupts_and_traps.c 
b/drivers/lguest/interrupts_and_traps.c
index 899e681..3daf3e2 100644
--- a/drivers/lguest/interrupts_and_traps.c
+++ b/drivers/lguest/interrupts_and_traps.c
@@ -12,7 +12,7 @@
  * them first, so we also have a way of reflecting them into the Guest as if
  * they had been delivered to it directly. :*/
 #include linux/uaccess.h
-#include asm/lg.h
+#include linux/lg.h
 
 /* The address of the interrupt handler is split into two bits: */
 static unsigned long idt_address(u32 lo, u32 hi)
diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c
index 091a46d..d196121 100644
--- a/drivers/lguest/io.c
+++ b/drivers/lguest/io.c
@@ -25,7 +25,7 @@
 #include linux/mm.h
 #include linux/highmem.h
 #include linux/uaccess.h
-#include asm/lg.h
+#include linux/lg.h
 
 /*L:300
  * I/O
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
index a29d85b..2f477df 100644
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -7,7 +7,7 @@
 #include linux/uaccess.h
 #include linux/miscdevice.h
 #include linux/fs.h
-#include asm/lg.h
+#include linux/lg.h
 
 /*L:030 setup_regs() doesn't really belong in this file, but it gives us an
  * early glimpse deeper into the Host so it's worth having here.
diff --git a/drivers/lguest/page_tables.c b/drivers/lguest/page_tables.c
index db3dd8b..812d56a 100644
--- a/drivers/lguest/page_tables.c
+++ b/drivers/lguest/page_tables.c
@@ -13,7 +13,7 @@
 #include linux/random.h
 #include linux/percpu.h
 #include asm/tlbflush.h
-#include asm/lg.h
+#include linux/lg.h
 
 /*M:008 We hold reference to pages, which prevents them from being swapped.
  * It'd be nice to have a callback in the struct mm_struct when Linux wants
diff --git a/drivers/lguest/segments.c b/drivers/lguest/segments.c
index 1d16d5d..351b13f 100644
--- a/drivers/lguest/segments.c
+++ b/drivers/lguest/segments.c
@@ -9,7 +9,7 @@
  * In these modern times, the segment handling code consists of simple sanity
  * checks, and the worst you'll experience reading this code is butterfly-rash
  * from frolicking through its parklike serenity. :*/
-#include asm/lg.h
+#include linux/lg.h
 
 /*H:600
  * We've almost completed the Host; there's just one file to go!
diff --git a/include/asm-i386/lg.h b/include/asm-i386/lg.h
index 64f0abe..c8e01c3 100644
--- a/include/asm-i386/lg.h
+++ b/include/asm-i386/lg.h
@@ -1,5 +1,5 @@
-#ifndef _LGUEST_H
-#define _LGUEST_H
+#ifndef _I386_LGUEST_H
+#define _I386_LGUEST_H
 
 #include asm/desc.h
 
@@ -47,17 +47,6 @@ int init_pagetables(struct page **switcher_page, unsigned 
int pages);
 #define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00})
 #define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300})
 
-struct lguest_dma_info
-{
-   struct list_head list;
-   union futex_key key;
-   unsigned long dmas;
-   u16 next_dma;
-   u16 num_dmas;
-   u16 guestid;
-   u8 interrupt;   /* 0 when not registered */
-};
-
 /*H:310 The page-table code owes a great debt of gratitude to Andi Kleen.  He

[PATCH 2/5 -v2] Change lguest launcher to use asm generic include

2007-08-08 Thread Steven Rostedt

Have the lguest launcher include e820.h via asm/e820.h instead of explicitly
saying i386.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 Documentation/lguest/lguest.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index f791840..eb7a01c 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -46,7 +46,7 @@ typedef uint32_t u32;
 typedef uint16_t u16;
 typedef uint8_t u8;
 #include ../../include/linux/lguest_launcher.h
-#include ../../include/asm-i386/e820.h
+#include ../../include/asm/e820.h
 /*:*/
 
 #define PAGE_PRESENT 0x7   /* Present, RW, Execute */
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-09 Thread Steven Rostedt

--
On Wed, 8 Aug 2007, Jeremy Fitzhardinge wrote:

 Steven Rostedt wrote:
  /*
   * x86 arch doesn't have an easy way to find out where
   * gs is located. So we need to read the MSR. But first
   * we need to save off the rcx, rax and rdx.
 
 Why don't you store it in gs?  movq %gs:my_gs_base, %rax?

Because it can't be trusted.  After the swapgs, we are pointing to the RW
section of the HV. But by running the guest kernel in ring 1, we have no
protection from the guest writing into that area too. So we can't put
anything in that section and expect it to be safe after jumping to guest
code.  The only trusted values, is jumping in after getting there by an
interrupt, where the hardware places the values onto the stack.

But with Andi's comments, I realized I can point the gs pointer to the
RO area. And make a constant offset that will point up into the RW area,
so we could save the stack and replace it.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

2007-08-09 Thread Steven Rostedt

--
On Thu, 9 Aug 2007, Andi Kleen wrote:

  This has to match the normal C calling convention though, doesn't it?

 Native cli/sti/save/restore_flags are all only assembly and can be easily
 (in fact more easily than in C) written as pure assembler functions. Then
 you can use whatever calling convention you want.

I agree.
Should we make a paravirt_ops_asm.S file that can implement these native
funcions, and so we can get rid of the C functions only doing asm?


 While some paravirt implementations may have more complicated implementations
 i guess it's still a reasonable requirement to make them simple enough
 in pure assembler. If not they can use a trampoline, but that's hopefully
 not needed.

It works for lguest64.  I'm sure it should be no problem with other HVs.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH] input: Fix interrupt enable in i8042_ctr when enabling interrupt fails

2007-09-10 Thread Steven Rostedt


--
On Mon, 10 Sep 2007, Markus Armbruster wrote:

 I believe this possible, but unlikely (perhaps not so unlikely on
 virtual machines).  Scenarios involve enable succeeding the first
 time, failing the second time, and succeeding the third time.  I can
 provide details, but the point I'd like to make is not that this is
 broken (although it is, strictly speaking), but that it is not
 obviously correct where it easily could be: just clear the interrupt
 enable bits when writing them to the hardware failed, like the old
 code did.


I also want to stress that this is more of a clean up for technically
correct code than a bug fix.  This bug probably would never happen on
baremetal unless it was running on broken hardware.

  BUT!!!

With more and more systems going to a virtual environment, having a bug or
some other anomaly can trigger the error that this patch prevents. The
patch will also trigger a print in the case of running on a buggy virtual
machine, which would help out the developers of that virtual machine to
fix their code.

Please apply.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 7/24] consolidate msr.h

2007-11-19 Thread Steven Rostedt


On Mon, 19 Nov 2007, Bastian Blank wrote:

 On Fri, Nov 09, 2007 at 04:42:48PM -0200, Glauber de Oliveira Costa wrote:
  -   wrmsrl(MSR_CSTAR, ia32_cstar_target);
  +   wrmsrl(MSR_CSTAR, (u64)ia32_cstar_target);

 Hmm, why do you add explicit casts? The compiler should convert that
 correctly on its own.

  +static inline void wrmsrl(unsigned int msr, unsigned long long val)

 Hmm, long long is 64 bit on all x86, but why not use explicit u64 to
 show that?

(quick reply)

With PVOPS on it gives compiler warnings without that explict cast.
Without looking at the code, IIRC with non-PVOPS it is a macro directly
into asm, so it didn't matter what the cast was. But with PVOPS as a
function, it gave compiler warnings.

Take it out and try compiling it for both i386 and x86_64. One of them
gave warnings. But maybe it's not a problem now.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/6] Staging: hv: Unify hyper-v device abstractions

2011-02-28 Thread Steven Rostedt

Hi!

Just an FYI,

When sending out patch series, could you send out a [PATCH 0/6] first,
and have all other patches a reply to that patch. Both git and quilt
have ways to do this.

The reason is that the linux kernel mailing list gets over 600 emails a
day, and for those of us that skim through, we like to hit [make thread
as read] when we are not interested in the thread or patch set. When
people send out a set of patches that are all individual, this means I
need to hit this command for every patch.

Also do not think I'm just picking on you. I've complained to several
people about doing this. One just recently sent out 21 patches.

Thanks,

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] Re: linux-next: Tree for July 25 (xen)

2011-08-08 Thread Steven Rostedt

On Sat, 2011-08-06 at 11:22 -0700, Jeremy Fitzhardinge wrote:
 On 08/05/2011 06:58 PM, Konrad Rzeszutek Wilk wrote:
  On Fri, Aug 05, 2011 at 11:14:23PM +0200, Ingo Molnar wrote:
  * Ingo Molnar mi...@elte.hu wrote:
 
  * Randy Dunlap rdun...@xenotime.net wrote:
 
  On 08/04/11 15:40, Konrad Rzeszutek Wilk wrote:
  On Thu, Aug 04, 2011 at 06:32:59PM -0400, Konrad Rzeszutek Wilk wrote:
  These build failures are still triggering upstream:
 
   arch/x86/xen/trace.c:44:2: error: array index in initializer not 
  of integer type
   arch/x86/xen/trace.c:44:2: error: (near initialization for 
  ‘xen_hypercall_names’)
   arch/x86/xen/trace.c:45:1: error: ‘__HYPERVISOR_arch_4’ undeclared 
  here (not in a function)
   arch/x86/xen/trace.c:45:2: error: array index in initializer not 
  of integer type
   arch/x86/xen/trace.c:45:2: error: (near initialization for 
  ‘xen_hypercall_names’)
  Oh, that I haven't seen. Can you send me the .config for that please.
  You can't be trying very hard then.  I see lots of these (but no,
  Ah, I am getting it now. Thanks for reporting it.
  This should do the trick:
  Acked-by: Randy Dunlap rdun...@xenotime.net
  That patch did the trick here too:
 
  Acked-by: Ingo Molnar mi...@elte.hu
  Except that i'm still seeing the occasional build failure - see the 
  error log below. Config attached.
  Much appreciate for the report. I believe this fix (which I think hit
  linux-next yesterday?) should do that:
 
  commit 1e9ea2656b656edd3c8de98675bbc0340211b5bd
  Author: Jeremy Fitzhardinge jer...@goop.org
  Date:   Wed Aug 3 09:43:44 2011 -0700
 
   xen/tracing: it looks like we wanted CONFIG_FTRACE
  
  Apparently we wanted CONFIG_FTRACE rather the CONFIG_FUNCTION_TRACER.
  
  Reported-by: Sander Eikelenboom li...@eikelenboom.it
  Tested-by: Sander Eikelenboom li...@eikelenboom.it
  Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
  Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 
  diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
  index 45e94ac..3326204 100644
  --- a/arch/x86/xen/Makefile
  +++ b/arch/x86/xen/Makefile
  @@ -15,7 +15,7 @@ obj-y := enlighten.o setup.o multicalls.o 
  mmu.o irq.o \
  grant-table.o suspend.o platform-pci-unplug.o \
  p2m.o
   
  -obj-$(CONFIG_FUNCTION_TRACER) += trace.o
  +obj-$(CONFIG_FTRACE) += trace.o
   
 
 I'm not sure this is correct either.  Maybe it should be
 CONFIG_TRACEPOINTS?  Steven?

Actually, I believe the correct answer is:

CONFIG_EVENT_TRACING

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/6] virtio/console: Add a failback for unstealable pipe buffer

2012-08-09 Thread Steven Rostedt

On Thu, 2012-08-09 at 18:24 +0900, Masami Hiramatsu wrote:

 Yeah, it is really easy to fix that.
 But out of curiosity, would that be really a problem?
 I guess that host can access any guest page if need. If that
 is right, is that really insecure to leak randomly allocated
 unused page to the host?

Yeah, it's like protecting userspace pages from the kernel ;-)

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/5] trace-cmd: Use TRACE_DIR envrionment variable if defined

2012-08-22 Thread Steven Rostedt

On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
 From: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 
 Use TRACE_DIR environment variable for setting

TRACING_DIR would be better, as we are searching for /debug/tracing and
not /debug/trace. Perhaps DEBUG_TRACING_DIR would be even better as to
be less of a generic term.

-- Steve

 debugfs/tracing directory if defined. This is
 for controlling guest(or remote) ftrace.
 
 Signed-off-by: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 Signed-off-by: Yoshihiro YUNOMAE yoshihiro.yunomae...@hitachi.com
 ---
 
  trace-util.c |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)
 
 diff --git a/trace-util.c b/trace-util.c
 index e128188..d5a3eb4 100644
 --- a/trace-util.c
 +++ b/trace-util.c
 @@ -311,6 +311,15 @@ char *tracecmd_find_tracing_dir(void)
   char type[100];
   FILE *fp;
   
 + tracing_dir = getenv(TRACE_DIR);
 + if (tracing_dir) {
 + tracing_dir = strdup(tracing_dir);
 + if (!tracing_dir)
 + die(malloc);
 + warning(Use environmental tracing directory: %s\n, 
 tracing_dir);
 + return tracing_dir;
 + }
 +
   if ((fp = fopen(/proc/mounts,r)) == NULL) {
   warning(Can't open /proc/mounts for read);
   return NULL;
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs

2012-08-22 Thread Steven Rostedt

On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
 From: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 
 Count debugfs/tracing/per_cpu/cpu* to determine the
 number of CPUs.

I'm curious, do you find that sysconf doesn't return the # of CPUs the
system has? I've had boxes where the per_cpu/cpu* had more cpus than the
box actually holds. But this was a bug in the kernel, not the tool. This
change log needs to have rational instead of just explaining what the
patch does.

Thanks,

-- Steve

 
 Signed-off-by: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 Signed-off-by: Yoshihiro YUNOMAE yoshihiro.yunomae...@hitachi.com
 ---


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/5] trace-cmd: Support trace-agent of virtio-trace

2012-08-22 Thread Steven Rostedt

On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
 Add read path and control path to use trace-agent of virtio-trace.
 When we use trace-agent, trace-cmd will be used as follows:
   # AGENT_READ_DIR=/tmp/virtio-trace/tracing \
 AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \
 TRACING_DIR=/tmp/virtio-trace/debugfs/tracing \\

Ha! You used TRACING_DIR but patch one introduces TRACE_DIR. Lets
change this to DEBUG_TRACING_DIR instead anyway.

Also, I don't like the generic environment variables. Perhaps
VIRTIO_TRACE_DIR, or AGENT_TRACE_DIR and AGENT_TRACE_CTL. Lets try to
keep the environment namespace sparse.

 trace-cmd record -e sched:*
 Here, AGENT_READ_DIR is the path for a reading directory of virtio-trace,
 AGENT_CTL is a control path of trace-agent, and TRACING_DIR is a debugfs path
 of a guest.
 
 Signed-off-by: Yoshihiro YUNOMAE yoshihiro.yunomae...@hitachi.com
 ---
 
  trace-cmd.h  |1 +
  trace-recorder.c |   57 
 +-
  trace-util.c |   18 +
  3 files changed, 75 insertions(+), 1 deletions(-)
 
 diff --git a/trace-cmd.h b/trace-cmd.h
 index f904dc5..75506ed 100644
 --- a/trace-cmd.h
 +++ b/trace-cmd.h
 @@ -72,6 +72,7 @@ static inline int tracecmd_host_bigendian(void)
  }
  
  char *tracecmd_find_tracing_dir(void);
 +char *guest_agent_tracing_read_dir(void);
  
  /* --- Opening and Reading the trace.dat file --- */
  
 diff --git a/trace-recorder.c b/trace-recorder.c
 index 215affc..3b750e9 100644
 --- a/trace-recorder.c
 +++ b/trace-recorder.c
 @@ -33,6 +33,7 @@
  #include unistd.h
  #include ctype.h
  #include errno.h
 +#include stdbool.h
  
  #include trace-cmd.h
  
 @@ -43,6 +44,8 @@ struct tracecmd_recorder {
   int page_size;
   int cpu;
   int stop;
 + int ctl_fd;
 + boolagent_existing;

Thanks for the reminder. I need to convert a lot to use 'bool' instead.

  };
  
  void tracecmd_free_recorder(struct tracecmd_recorder *recorder)
 @@ -59,11 +62,29 @@ void tracecmd_free_recorder(struct tracecmd_recorder 
 *recorder)
   free(recorder);
  }
  
 +static char *use_trace_agent_dir(char *ctl_path,
 + struct tracecmd_recorder *recorder)
 +{
 + ctl_path = strdup(ctl_path);
 + if (!ctl_path)
 + die(malloc);
 + warning(Use environmental control path: %s\n, ctl_path);

s/Use/Using/

-- Steve

 +
 + recorder-ctl_fd = open(ctl_path, O_WRONLY);
 + if (recorder-ctl_fd  0)
 + return NULL;
 +
 + recorder-agent_existing = true;
 +
 + return guest_agent_tracing_read_dir();
 +}
 +


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs

2012-08-23 Thread Steven Rostedt

On Thu, 2012-08-23 at 12:00 +0900, Masami Hiramatsu wrote:
 (2012/08/23 11:01), Masami Hiramatsu wrote:
  (2012/08/22 22:41), Steven Rostedt wrote:
  On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
  From: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 
  Count debugfs/tracing/per_cpu/cpu* to determine the
  number of CPUs.
 
  I'm curious, do you find that sysconf doesn't return the # of CPUs the
  system has?
  
  No, sysconf returns the number of hosts CPUs, not guests.
  
  I've had boxes where the per_cpu/cpu* had more cpus than the
  box actually holds. But this was a bug in the kernel, not the tool. This
  change log needs to have rational instead of just explaining what the
  patch does.
  
  Ah, I see. Hmm, then this should be enabled by a command line
  option or an environment variable.
 
 Oops, I misunderstood. I'll add more comment for why this
 should be tried instead of sysconf.

And now that I understand why you are doing this, why not only do this
if the TRACE_AGENT or DEBUG_TRACING_DIR is defined. That is, if we are
doing it against a bare metal system, then sysconf should suffice, but
if we are tracing against a guest, then it should use the tracing
directory to determine the buffers.

We could add options to override this, but I would think the default
should just Do The Right Thing(tm).

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Readonly GDT

2013-04-09 Thread Steven Rostedt

On Tue, 2013-04-09 at 17:43 -0700, H. Peter Anvin wrote:
 OK, thinking about the GDT here.
 
 The GDT is quite small -- 256 bytes on i386, 128 bytes on x86-64.  As
 such, we probably don't want to allocate a full page to it for only
 that.  This means that in order to create a readonly mapping we have to
 pack GDTs from different CPUs together in the same pages, *or* we
 tolerate that other things on the same page gets reflected in the same
 mapping.

What about grouping via nodes?

 
 However, the packing solution has the advantage of reducing address
 space consumption which matters on 32 bits: even on i386 we can easily
 burn a megabyte of address space for 4096 processors, but burning 16
 megabytes starts to hurt.

Having 4096 32 bit processors, you deserve what you get. ;-)

-- Steve

 
 It would be important to measure the performance impact on task switch,
 though.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] tracing/events: Add bounce tracing to swiotbl-xen

2013-08-22 Thread Steven Rostedt

On Thu, 22 Aug 2013 22:47:28 +0100
Zoltan Kiss zoltan.k...@citrix.com wrote:


  /*
   * Used to do a quick range check in swiotlb_tbl_unmap_single and
   * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
 this
 @@ -358,6 +362,9 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
 struct page *page,
   /*
* Oh well, have to allocate and map a bounce buffer.
*/
 +
 + trace_bounced(dev, dev_addr, size, swiotlb_force);

Please use a more specific name. bounce is too generic. I know
tracepoints are grouped by systems, but its easier for tools to just
state a tracepoint name than the system:event pair.

trace_xen_bounced()
 or
trace_swiotlb_bounced()

Something other than just bounced.

Thanks!

-- Steve

 +
   map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
   if (map == SWIOTLB_MAP_ERROR)
   return DMA_ERROR_CODE;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCHv2] tracing/events: Add bounce tracing to swiotbl

2013-09-25 Thread Steven Rostedt

On Wed, 25 Sep 2013 13:56:49 -0400
Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:

 .. snip..
  +  TP_printk(dev_name: %s dma_mask=%llx dev_addr=%llx 
  +  size=%zu swiotlb_force=%x,
  +  __get_str(dev_name),
  +  __entry-dma_mask,
  +  (unsigned long long)__entry-dev_addr,
  +  __entry-size,
  +  __entry-swiotlb_force)
 
 Would it make sense to do something like this:
 
   __entry-swiotlb_force ? swiotlb_force : )

I think that's fine. I do believe that the libtraceevents can parse the
?: syntax.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH -tip RFC 0/2] kprobes: introduce NOKPROBE_SYMBOL() and prohibit probing on .entry.text

2013-11-11 Thread Steven Rostedt

On Tue, 12 Nov 2013 02:18:53 +0900
Masami Hiramatsu masami.hiramatsu...@hitachi.com wrote:
 
  After that we can convert all the rest, probably as part of this series.
 
 OK, I'll do. :)
 BTW, converting all the __kprobes involves many archs, which
 kprobes ported. In that case, which mailing-list would better me
 to post the series, linux-arch?

I would add linux-arch.

Note, you may need to support both ways for the current time being, as
new __kprobes are being added (I've seen several in patches flying by
in LKML).

But perhaps at a later -rc we convert the rest and discontinue them.
That way, linux-next will break all new __kprobes, and that should get
them fixed before we enter 3.14-rc.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH -tip RFC v2 01/22] kprobes: Prohibit probing on .entry.text code

2013-11-15 Thread Steven Rostedt

On Fri, 15 Nov 2013 04:53:18 +
Masami Hiramatsu masami.hiramatsu...@hitachi.com wrote:

 .entry.text is a code area which is used for interrupt/syscall
 entries, and there are many sensitive codes.
 Thus, it is better to prohibit probing on all of such codes
 instead of a part of that.
 Since some symbols are already registered on kprobe blacklist,
 this also removes them from the blacklist.

This change only works with x86. On other archs, I get this:

kernel/built-in.o: In function `register_kprobe':
(.kprobes.text+0x9f4): undefined reference to `__entry_text_start'
kernel/built-in.o: In function `register_kprobe':
(.kprobes.text+0x9f8): undefined reference to `__entry_text_end'
make[1]: *** [vmlinux] Error 1
make: *** [sub-make] Error 2

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH -tip v3 00/23] kprobes: introduce NOKPROBE_SYMBOL() and general cleaning of kprobe blacklist

2013-11-20 Thread Steven Rostedt

On Wed, 20 Nov 2013 12:36:00 -0500
Frank Ch. Eigler f...@redhat.com wrote:

 Hi -
 
   Does this new blacklist cover enough that the kernel now survives a 
   broadly wildcarded perf-probe, e.g. over e.g. all of its kallsyms?
  
  That's generally the purpose of the annotations - if it doesn't then 
  that's a bug.
 
 AFAIK, no kernel since kprobes was introduced has ever stood up to
 that test.  perf probe lacks the wildcarding powers of systemtap, so
 one needs to resort to something like:
 
 # cat /proc/kallsyms | grep ' [tT] ' | while read addr type symbol; do
perf probe $symbol
 done

I'm curious to why one would do that. IIUC, perf now has function
tracing support.

-- Steve

 
 then wait for a few hours for that to finish. Then, or while the loop
 is still running, run
 
 # perf record -e 'probe:*' -aR sleep 1
 
 to take a kernel down.
 
 
 - FChE

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions

2016-05-25 Thread Steven Rostedt


Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
after enabling function tracer. I asked him to bisect the functions within
available_filter_functions, which he did and it came down to three:

 _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()

It was found that this is only an issue when noreplace-paravirt is added to
the kernel command line.

This means that those functions are most likely called within critical
sections of the funtion tracer, and must not be traced.

In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
longer an issue. But both _paravirt_ident_{32,64}() causes the following
splat when they are traced:

 mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054)
 mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070)
 mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054)
 mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054)
 NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
 Modules linked in: e1000e
 CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 
05/07/2012
 task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000
 RIP: 0010:[]  [] 
queued_spin_lock_slowpath+0x118/0x1a0
 RSP: 0018:8800d4aefb90  EFLAGS: 0246
 RAX:  RBX:  RCX: 88011eb16d40
 RDX: 82485760 RSI: 1f288820 RDI: ea008030
 RBP: 8800d4aefb90 R08: 000c R09: 
 R10: 821c8e0e R11:  R12: 88200fb8
 R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0
 FS:  7f7a4e3d7840() GS:88011eb0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0
 Stack:
  8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b
  8800d4aefcb0 8800d4aefbd0 811bf4cb 0002
  0015 8800d2276050 8000c0fd8867 ea008030
 Call Trace:
  [] _raw_spin_lock+0x27/0x30
  [] handle_pte_fault+0x13db/0x16b0
  [] ? function_trace_call+0x15b/0x180
  [] ? handle_pte_fault+0x5/0x16b0
  [] handle_mm_fault+0x312/0x670
  [] ? find_vma+0x68/0x70
  [] __do_page_fault+0x1b1/0x4e0
  [] do_page_fault+0x22/0x30
  [] page_fault+0x28/0x30
  [] ? copy_user_enhanced_fast_string+0x5/0x10
  [] ? seq_read+0x305/0x370
  [] __vfs_read+0x28/0xe0
  [] ? __vfs_read+0x5/0xe0
  [] ? __vfs_read+0x5/0xe0
  [] vfs_read+0x86/0x130
  [] SyS_read+0x46/0xa0
  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
 Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14
 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7
 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b

Reported-by: Łukasz Daniluk <lukasz.dani...@intel.com>
Signed-off-by: Steven Rostedt <rost...@goodmis.org>

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index f08ac28b8136..f975d226be6e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n"
  ".popsection");
 
 /* identity function, which can be inlined */
-u32 _paravirt_ident_32(u32 x)
+u32 notrace _paravirt_ident_32(u32 x)
 {
return x;
 }
 
-u64 _paravirt_ident_64(u64 x)
+u64 notrace _paravirt_ident_64(u64 x)
 {
return x;
 }
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-05 Thread Steven Rostedt

On Wed,  4 Oct 2017 14:19:56 -0700
Thomas Garnier  wrote:

> When using -fPIE/PIC with function tracing, the compiler generates a
> call through the GOT (call *__fentry__@GOTPCREL). This instruction
> takes 6 bytes instead of 5 on the usual relative call.
> 
> With this change, function tracing supports 6 bytes on traceable
> function and can still replace relative calls on the ftrace assembly
> functions.
> 
> Position Independent Executable (PIE) support will allow to extended the
> KASLR randomization range below the -2G memory limit.

Question: This 6 bytes is only the initial call that gcc creates. When
function tracing is enabled, the calls are back to the normal call to
the ftrace trampoline?

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-05 Thread Steven Rostedt

On Thu, 5 Oct 2017 09:01:14 -0700
Thomas Garnier <thgar...@google.com> wrote:

> On Thu, Oct 5, 2017 at 6:06 AM, Steven Rostedt <rost...@goodmis.org> wrote:
> > On Wed,  4 Oct 2017 14:19:56 -0700
> > Thomas Garnier <thgar...@google.com> wrote:
> >  
> >> When using -fPIE/PIC with function tracing, the compiler generates a
> >> call through the GOT (call *__fentry__@GOTPCREL). This instruction
> >> takes 6 bytes instead of 5 on the usual relative call.
> >>
> >> With this change, function tracing supports 6 bytes on traceable
> >> function and can still replace relative calls on the ftrace assembly
> >> functions.
> >>
> >> Position Independent Executable (PIE) support will allow to extended the
> >> KASLR randomization range below the -2G memory limit.  
> >
> > Question: This 6 bytes is only the initial call that gcc creates. When
> > function tracing is enabled, the calls are back to the normal call to
> > the ftrace trampoline?  
> 
> That is correct.
> 

Then I think a better idea is to simply nop them out at compile time,
and have the code that updates them to nops to know about it.

See scripts/recordmcount.c

Could we simply add a 5 byte nop followed by a 1 byte nop, and treat it
the same as if it didn't exist? This code can be a little complex, and
can cause really nasty side effects if things go wrong. I would like to
keep from adding more variables to the changes here.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] x86/vmware: Do not trace vmware_sched_clock()

2018-11-09 Thread Steven Rostedt

From: Steven Rostedt (VMware) 

When running function tracing on a Linux guest running on VMware
Workstation, the guest would crash. This is due to tracing of the
sched_clock internal call of the VMware vmware_sched_clock(), which
causes an infinite recursion within the tracing code (clock calls must
not be traced).

Make vmware_sched_clock() not traced by ftrace.

Cc: sta...@vger.kernel.org
Fixes: 80e9a4f21fd7c ("x86/vmware: Add paravirt sched clock")
Reported-by: GwanYeong Kim 
Signed-off-by: Steven Rostedt (VMware) 
---
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index d9ab49bed8af..0eda91f8eeac 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -77,7 +77,7 @@ static __init int setup_vmw_sched_clock(char *s)
 }
 early_param("no-vmw-sched-clock", setup_vmw_sched_clock);
 
-static unsigned long long vmware_sched_clock(void)
+static unsigned long long notrace vmware_sched_clock(void)
 {
unsigned long long ns;
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-23 Thread Steven Rostedt

On Mon, 23 Nov 2020 13:08:55 +0200
Leon Romanovsky  wrote:


>  [   10.028024] Chain exists of:
>  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2

Note, the problem is that we have a location that grabs the xmit_lock while
holding target_list_lock (and possibly console_owner).


>  [   10.028028]
>  [   10.028028]  Possible interrupt unsafe locking scenario:
>  [   10.028029]
>  [   10.028029]CPU0CPU1
>  [   10.028030]
>  [   10.028030]   lock(_xmit_ETHER#2);
>  [   10.028032]local_irq_disable();
>  [   10.028032]lock(console_owner);
>  [   10.028034]lock(target_list_lock);
>  [   10.028035]   
>  [   10.028035] lock(console_owner);
>  [   10.028036]
>  [   10.028037]  *** DEADLOCK ***
>  [   10.028037]



>  [   10.028107] the dependencies between the lock to be acquired
>  [   10.028107]  and HARDIRQ-irq-unsafe lock:
>  [   10.028108] -> (_xmit_ETHER#2){+.-.}-{2:2} ops: 217 {
>  [   10.028110]HARDIRQ-ON-W at:
>  [   10.028111]__lock_acquire+0x8bc/0x1a94
>  [   10.028111]lock_acquire.part.0+0x170/0x360
>  [   10.028112]lock_acquire+0x68/0x8c
>  [   10.028113]_raw_spin_trylock+0x80/0xd0
>  [   10.028113]virtnet_poll+0xac/0x360

xmit_lock is taken in virtnet_poll() (via virtnet_poll_cleantx()).

This is called from the softirq, and interrupts are not disabled.

>  [   10.028114]net_rx_action+0x1b0/0x4e0
>  [   10.028115]__do_softirq+0x1f4/0x638
>  [   10.028115]do_softirq+0xb8/0xcc
>  [   10.028116]__local_bh_enable_ip+0x18c/0x200
>  [   10.028116]virtnet_napi_enable+0xc0/0xd4
>  [   10.028117]virtnet_open+0x98/0x1c0
>  [   10.028118]__dev_open+0x12c/0x200
>  [   10.028118]__dev_change_flags+0x1a0/0x220
>  [   10.028119]dev_change_flags+0x2c/0x70
>  [   10.028119]do_setlink+0x214/0xe20
>  [   10.028120]__rtnl_newlink+0x514/0x820
>  [   10.028120]rtnl_newlink+0x58/0x84
>  [   10.028121]rtnetlink_rcv_msg+0x184/0x4b4
>  [   10.028122]netlink_rcv_skb+0x60/0x124
>  [   10.028122]rtnetlink_rcv+0x20/0x30
>  [   10.028123]netlink_unicast+0x1b4/0x270
>  [   10.028124]netlink_sendmsg+0x1f0/0x400
>  [   10.028124]sock_sendmsg+0x5c/0x70
>  [   10.028125]sys_sendmsg+0x24c/0x280
>  [   10.028125]___sys_sendmsg+0x88/0xd0
>  [   10.028126]__sys_sendmsg+0x70/0xd0
>  [   10.028127]__arm64_sys_sendmsg+0x2c/0x40
>  [   10.028128]el0_svc_common.constprop.0+0x84/0x200
>  [   10.028128]do_el0_svc+0x2c/0x90
>  [   10.028129]el0_svc+0x18/0x50
>  [   10.028129]el0_sync_handler+0xe0/0x350
>  [   10.028130]el0_sync+0x158/0x180

[..]

>  [   10.028171]  ... key  at: [] 
> netdev_xmit_lock_key+0x10/0x390
>  [   10.028171]  ... acquired at:
>  [   10.028172]__lock_acquire+0x134c/0x1a94
>  [   10.028172]lock_acquire.part.0+0x170/0x360
>  [   10.028173]lock_acquire+0x68/0x8c
>  [   10.028173]_raw_spin_lock+0x64/0x90
>  [   10.028174]virtnet_poll_tx+0x84/0x120
>  [   10.028174]netpoll_poll_dev+0x12c/0x350
>  [   10.028175]netpoll_send_skb+0x39c/0x400
>  [   10.028175]netpoll_send_udp+0x2b8/0x440
>  [   10.028176]write_msg+0xfc/0x120 [netconsole]
>  [   10.028176]console_unlock+0x3ec/0x6a4

The above shows the problem. We have:

console_unlock() (which holds the console_owner lock)
write_msg() (which holds the target_list_lock)

Then we write_msg() calls:

netpoll_send_udp() {
  netpoll_send_skb() {
netpoll_poll_dev() {
  virtnet_poll_tx() (which takes the xmit_lock!)

  DEADLOCK!


In netpoll_send_skb() I see this:

/* tickle device maybe there is some cleanup */
netpoll_poll_dev(np->dev);

Which looks to me that it will call some code that should only be used in
softirq context. It's called with locks held that are taken in interrupt
context, and any locks that are taken in netpoll_poll_dev() must always be
taken with interrupts disabled. That is, if xmit_lock is taken within
netpoll_poll_dev(), then it must always be taken with interrupts disabled.
Otherwise you can have the deadlock that lockdep reported.

-- Steve




>  [   10.028177]

Re: netconsole deadlock with virtnet

2020-11-24 Thread Steven Rostedt

On Tue, 24 Nov 2020 11:22:03 +0800
Jason Wang  wrote:

> Btw, have a quick search, there are several other drivers that uses tx 
> lock in the tx NAPI.

tx NAPI is not the issue. The issue is that write_msg() (in netconsole.c)
calls this polling logic with the target_list_lock held.

Are those other drivers called by netconsole? If not, then this is unique
to virtio-net.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-23 Thread Steven Rostedt

On Mon, 23 Nov 2020 10:52:52 -0800
Jakub Kicinski  wrote:

> On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:
> > On Mon, 23 Nov 2020 13:08:55 +0200
> > Leon Romanovsky  wrote:
> > 
> >   
> > >  [   10.028024] Chain exists of:
> > >  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2
> > 
> > Note, the problem is that we have a location that grabs the xmit_lock while
> > holding target_list_lock (and possibly console_owner).  
> 
> Well, it try_locks the xmit_lock. Does lockdep understand try-locks?
> 
> (not that I condone the shenanigans that are going on here)

Does it?

virtnet_poll_tx() {
__netif_tx_lock() {
spin_lock(>_xmit_lock);

That looks like we can have:


CPU0CPU1

   lock(xmit_lock)

lock(console)
lock(target_list_lock)
__netif_tx_lock()
lock(xmit_lock);

[BLOCKED]

   
   lock(console)

   [BLOCKED]



 DEADLOCK.


So where is the trylock here?

Perhaps you need the trylock in virtnet_poll_tx()?

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-18 Thread Steven Rostedt

[ Adding netdev as perhaps someone there knows ]

On Wed, 18 Nov 2020 12:09:59 +0800
Jason Wang  wrote:

> > This CPU0 lock(_xmit_ETHER#2) -> hard IRQ -> lock(console_owner) is
> > basically
> > soft IRQ -> lock(_xmit_ETHER#2) -> hard IRQ -> printk()
> >
> > Then CPU1 spins on xmit, which is owned by CPU0, CPU0 spins on
> > console_owner, which is owned by CPU1?  

It still looks to me that the target_list_lock is taken in IRQ, (which can
be the case because printk calls write_msg() which takes that lock). And
someplace there's a:

lock(target_list_lock)
lock(xmit_lock)

which means you can remove the console lock from this scenario completely,
and you still have a possible deadlock between target_list_lock and
xmit_lock.

> 
> 
> If this is true, it looks not a virtio-net specific issue but somewhere 
> else.
> 
> I think all network driver will synchronize through bh instead of hardirq.

I think the issue is where target_list_lock is held when we take xmit_lock.
Is there anywhere in netconsole.c that can end up taking xmit_lock while
holding the target_list_lock? If so, that's the problem. As
target_list_lock is something that can be taken in IRQ context, which means
*any* other lock that is taking while holding the target_list_lock must
also protect against interrupts from happening while it they are held.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-17 Thread Steven Rostedt

On Tue, 17 Nov 2020 12:23:41 +0200
Leon Romanovsky  wrote:

> Hi,
> 
> Approximately two weeks ago, our regression team started to experience those
> netconsole splats. The tested code is Linus's master (-rc4) + netdev net-next
> + netdev net-rc.
> 
> Such splats are random and we can't bisect because there is no stable 
> reproducer.
> 
> Any idea, what is the root cause?
> 
> [   21.141117] netpoll: netconsole: local port 
> [   21.141947] netpoll: netconsole: local IPv4 address 10.236.251.7
> [   21.143138] netpoll: netconsole: interface 'eth1'
> [   21.143988] netpoll: netconsole: remote port 62517
> [   21.145127] netpoll: netconsole: remote IPv4 address 10.236.251.1
> [   21.146206] netpoll: netconsole: remote ethernet address 68:05:ca:aa:99:49
> [   21.149464] =
> [   21.149466] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> [   21.149467] 5.10.0-rc4_for_upstream_min_debug_2020_11_16_13_03 #1 Not 
> tainted
> [   21.149468] -
> [   21.149469] modprobe/596 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [   21.149471] 000194b4e298 (_xmit_ETHER#2){+.-.}-{2:2}, at: 
> virtnet_poll_tx+0x84/0x120
> [   21.149478] and this task is already holding:
> [   21.149478] 893d4018 (target_list_lock){}-{2:2}, at: 
> write_msg+0x6c/0x120 [netconsole]
> [   21.149483] which would create a new lock dependency:
> [   21.149484]  (target_list_lock){}-{2:2} -> (_xmit_ETHER#2){+.-.}-{2:2}
> [   21.149491] but this new dependency connects a HARDIRQ-irq-safe lock:
> [   21.149492]  (console_owner){-.-.}-{0:0}
> [   21.149496] ... which became HARDIRQ-irq-safe at:
> [   21.149496]   __lock_acquire+0xa78/0x1a94
> [   21.149498]   lock_acquire.part.0+0x170/0x360
> [   21.149498]   lock_acquire+0x68/0x8c
> [   21.149500]   console_unlock+0x1e8/0x6a4
> [   21.149500]   vprintk_emit+0x1c4/0x3c4
> [   21.149501]   vprintk_default+0x40/0x4c
> [   21.149502]   vprintk_func+0x10c/0x220
> [   21.149503]   printk+0x68/0x90
> [   21.149504]   crng_fast_load+0x1bc/0x1c0
> [   21.149505]   add_interrupt_randomness+0x280/0x290
> [   21.149506]   handle_irq_event+0x80/0x120
> [   21.149507]   handle_fasteoi_irq+0xac/0x200
> [   21.149508]   __handle_domain_irq+0x84/0xf0
> [   21.149509]   gic_handle_irq+0xd4/0x320
> [   21.149510]   el1_irq+0xd0/0x180
> [   21.149511]   arch_cpu_idle+0x24/0x44
> [   21.149512]   default_idle_call+0x48/0xa0
> [   21.149513]   do_idle+0x260/0x300
> [   21.149514]   cpu_startup_entry+0x30/0x6c
> [   21.149515]   rest_init+0x1b4/0x288
> [   21.149515]   arch_call_rest_init+0x18/0x24
> [   21.149516]   start_kernel+0x5cc/0x608
> [   21.149518] to a HARDIRQ-irq-unsafe lock:
> [   21.149519]  (_xmit_ETHER#2){+.-.}-{2:2}
> [   21.149523] ... which became HARDIRQ-irq-unsafe at:
> [   21.149524] ...  __lock_acquire+0x8bc/0x1a94
> [   21.149525]   lock_acquire.part.0+0x170/0x360
> [   21.149526]   lock_acquire+0x68/0x8c
> [   21.149527]   _raw_spin_trylock+0x80/0xd0
> [   21.149527]   virtnet_poll+0xac/0x360
> [   21.149528]   net_rx_action+0x1b0/0x4e0
> [   21.149529]   __do_softirq+0x1f4/0x638
> [   21.149530]   do_softirq+0xb8/0xcc
> [   21.149531]   __local_bh_enable_ip+0x18c/0x200
> [   21.149532]   virtnet_napi_enable+0xc0/0xd4
> [   21.149533]   virtnet_open+0x98/0x1c0
> [   21.149534]   __dev_open+0x12c/0x200
> [   21.149535]   __dev_change_flags+0x1a0/0x220
> [   21.149536]   dev_change_flags+0x2c/0x70
> [   21.149536]   do_setlink+0x214/0xe20
> [   21.149537]   __rtnl_newlink+0x514/0x820
> [   21.149538]   rtnl_newlink+0x58/0x84
> [   21.149539]   rtnetlink_rcv_msg+0x184/0x4b4
> [   21.149540]   netlink_rcv_skb+0x60/0x124
> [   21.149541]   rtnetlink_rcv+0x20/0x30
> [   21.149542]   netlink_unicast+0x1b4/0x270
> [   21.149543]   netlink_sendmsg+0x1f0/0x400
> [   21.149544]   sock_sendmsg+0x5c/0x70
> [   21.149545]   sys_sendmsg+0x24c/0x280
> [   21.149545]   ___sys_sendmsg+0x88/0xd0
> [   21.149546]   __sys_sendmsg+0x70/0xd0
> [   21.149547]   __arm64_sys_sendmsg+0x2c/0x40
> [   21.149548]   el0_svc_common.constprop.0+0x84/0x200
> [   21.149549]   do_el0_svc+0x2c/0x90
> [   21.149550]   el0_svc+0x18/0x50
> [   21.149551]   el0_sync_handler+0xe0/0x350
> [   21.149552]   el0_sync+0x158/0x180
> [   21.149553] other info that might help us debug this:
> [   21.149555] Chain exists of:
> [   21.149556]   console_owner --> target_list_lock --> _xmit_ETHER#2

So somewhere we have target_list_lock taken while holding the console_owner
lock (which can happen in interrupt). And what this is saying,
target_list_lock is held somewhere where _xmit_ETHER#2 is taken (which I'm
guessing from the back traces is the xmit_lock taken in virtnet_poll_tx().

Thus, you can have a deadlock with three CPUs (I need to update the lockdep
output to make it proper one of these days).

CPU0CPU1CPU2

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-13 Thread Steven Rostedt

On Thu, 13 May 2021 16:57:34 +0100
Stefan Hajnoczi  wrote:

> This approach relies on process hierarchy of the VMM (QEMU).
> Multi-process QEMU is in development and will allow VIRTIO devices to
> run as separate processes from the main QEMU. It then becomes harder to
> correlate a VIRTIO device process with its QEMU process.

And we need to know all these mapping regardless, as we need to map each
thread / process to the vCPU in order to correlate between host thread and
vCPU thread for showing in KernelShark.

Thus this mapping to find the main thread/process needs to be done
regardless.

> 
> So I think in the end this approach ends up being as fragile as parsing
> command-lines. The kernel doesn't really have the concept of a "VM" that
> the vhost_vsock is associated with :). Maybe just parse QEMU and crosvm
> command-lines?
>

That's what we do now, and it already broke once, and even parsing the
command line wont be enough for the stated reasons above.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-07 Thread Steven Rostedt

On Fri, 7 May 2021 16:11:20 +0200
Stefano Garzarella  wrote:

> Hi Steven,
> 
> On Wed, May 05, 2021 at 04:38:55PM -0400, Steven Rostedt wrote:
> >The new trace-cmd 3.0 (which is almost ready to be released) allows for
> >tracing between host and guests with timestamp synchronization such that
> >the events on the host and the guest can be interleaved in the proper order
> >that they occur. KernelShark now has a plugin that visualizes this
> >interaction.
> >
> >The implementation requires that the guest has a vsock CID assigned, and on
> >the guest a "trace-cmd agent" is running, that will listen on a port for
> >the CID. The on the host a "trace-cmd record -A guest@cid:port -e events"
> >can be called and the host will connect to the guest agent through the
> >cid/port pair and have the agent enable tracing on behalf of the host and
> >send the trace data back down to it.
> >
> >The problem is that there is no sure fire way to find the CID for a guest.
> >Currently, the user must know the cid, or we have a hack that looks for the
> >qemu process and parses the --guest-cid parameter from it. But this is
> >prone to error and does not work on other implementation (was told that
> >crosvm does not use qemu).  
> 
> For debug I think could be useful to link the vhost-vsock kthread to the 
> CID, but for the user point of view, maybe is better to query the VM 
> management layer, for example if you're using libvirt, you can easily do:
> 
> $ virsh dumpxml fedora34 | grep cid
>  

We looked into going this route, but then that means trace-cmd host/guest
tracing needs a way to handle every layer, as some people use libvirt
(myself included), some people use straight qemu, some people us Xen, and
some people use crosvm. We need to support all of them. Which is why I'm
looking at doing this from the lowest common denominator, and since vsock
is a requirement from trace-cmd to do this tracing, getting the thread
that's related to the vsock is that lowest denominator.

> 
> >
> >As I can not find a way to discover CIDs assigned to guests via any kernel
> >interface, I decided to create this one. Note, I'm not attached to it. If
> >there's a better way to do this, I would love to have it. But since I'm not
> >an expert in the networking layer nor virtio, I decided to stick to what I
> >know and add a debugfs interface that simply lists all the registered 
> >CIDs
> >and the worker task that they are associated with. The worker task at
> >least has the PID of the task it represents.  
> 
> I honestly don't know if it's the best interface, like I said maybe for 
> debugging it's fine, but if we want to expose it to the user in some 
> way, we could support devlink/netlink to provide information about the 
> vsock devices currently in use.

Ideally, a devlink/netlink is the right approach. I just had no idea on how
to implement that ;-)  So I went with what I know, which is debugfs files!



> >Signed-off-by: Steven Rostedt (VMware) 
> >---
> >diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >index 5e78fb719602..4f03b25b23c1 100644
> >--- a/drivers/vhost/vsock.c
> >+++ b/drivers/vhost/vsock.c
> >@@ -15,6 +15,7 @@
> > #include 
> > #include 
> > #include 
> >+#include 
> >
> > #include 
> > #include "vhost.h"
> >@@ -900,6 +901,128 @@ static struct miscdevice vhost_vsock_misc = {
> > .fops = _vsock_fops,
> > };
> >
> >+static struct dentry *vsock_file;
> >+
> >+struct vsock_file_iter {
> >+struct hlist_node   *node;
> >+int index;
> >+};
> >+
> >+
> >+static void *vsock_next(struct seq_file *m, void *v, loff_t *pos)
> >+{
> >+struct vsock_file_iter *iter = v;
> >+struct vhost_vsock *vsock;
> >+
> >+if (pos)
> >+(*pos)++;
> >+
> >+if (iter->index >= (int)HASH_SIZE(vhost_vsock_hash))
> >+return NULL;
> >+
> >+if (iter->node)
> >+iter->node = rcu_dereference_raw(hlist_next_rcu(iter->node));
> >+
> >+for (;;) {
> >+if (iter->node) {
> >+vsock = 
> >hlist_entry_safe(rcu_dereference_raw(iter->node),
> >+ struct vhost_vsock, hash);
> >+if (vsock->guest_cid)
> >+break;
> >+iter->node = 
> >rcu_dereference_raw(hlist_next_rcu(iter->node));
> >+continue;
> >+}
> >

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-07 Thread Steven Rostedt

On Fri, 7 May 2021 17:43:32 +0200
Stefano Garzarella  wrote:

> >The start/stop of a seq_file() is made for taking locks. I do this with all
> >my code in ftrace. Yeah, there's a while loop between the two, but that's
> >just to fill the buffer. It's not that long and it never goes to userspace
> >between the two. You can even use this for spin locks (but I wouldn't
> >recommend doing it for raw ones).  
> 
> Ah okay, thanks for the clarification!
> 
> I was worried because building with `make C=2` I had these warnings:
> 
> ../drivers/vhost/vsock.c:944:13: warning: context imbalance in 'vsock_start' 
> - wrong count at exit
> ../drivers/vhost/vsock.c:963:13: warning: context imbalance in 'vsock_stop' - 
> unexpected unlock
> 
> Maybe we need to annotate the functions somehow.

Yep, I it should have been.

static void *vsock_start(struct seq_file *m, loff_t *pos)
__acquires(rcu)
{
[...]

}

static void vsock_stop(struct seq_file *m, void *p)
__releases(rcu)
{
[...]
}

static int vsock_show(struct seq_file *m, void *v)
__must_hold(rcu)
{
[...]
}


And guess what? I just copied those annotations from sock_hash_seq_start(),
sock_hash_seq_show() and sock_hash_seq_stop() from net/core/sock_map.c
which is doing exactly the same thing ;-)

So there's definitely precedence for this.

> 
> >  
> >>  
> >> >+
> >> >+ iter->index = -1;
> >> >+ iter->node = NULL;
> >> >+ t = vsock_next(m, iter, NULL);
> >> >+
> >> >+ for (; iter->index < HASH_SIZE(vhost_vsock_hash) && l < *pos;
> >> >+  t = vsock_next(m, iter, ))
> >> >+ ;  
> >>
> >> A while() maybe was more readable...  
> >
> >Again, I just cut and pasted from my other code.
> >
> >If you have a good idea on how to implement this with netlink (something
> >that ss or netstat can dislpay), I think that's the best way to go.  
> 
> Okay, I'll take a look and get back to you.
> If it's too complicated, we can go ahead with this patch.

Awesome, thanks!

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-05 Thread Steven Rostedt

The new trace-cmd 3.0 (which is almost ready to be released) allows for
tracing between host and guests with timestamp synchronization such that
the events on the host and the guest can be interleaved in the proper order
that they occur. KernelShark now has a plugin that visualizes this
interaction.

The implementation requires that the guest has a vsock CID assigned, and on
the guest a "trace-cmd agent" is running, that will listen on a port for
the CID. The on the host a "trace-cmd record -A guest@cid:port -e events"
can be called and the host will connect to the guest agent through the
cid/port pair and have the agent enable tracing on behalf of the host and
send the trace data back down to it.

The problem is that there is no sure fire way to find the CID for a guest.
Currently, the user must know the cid, or we have a hack that looks for the
qemu process and parses the --guest-cid parameter from it. But this is
prone to error and does not work on other implementation (was told that
crosvm does not use qemu).

As I can not find a way to discover CIDs assigned to guests via any kernel
interface, I decided to create this one. Note, I'm not attached to it. If
there's a better way to do this, I would love to have it. But since I'm not
an expert in the networking layer nor virtio, I decided to stick to what I
know and add a debugfs interface that simply lists all the registered CIDs
and the worker task that they are associated with. The worker task at
least has the PID of the task it represents.

Now I can find the cid / host process in charge of the guest pair:

  # cat /sys/kernel/debug/vsock_list
  3 vhost-1954:2002

  # ps aux | grep 1954
  qemu1954  9.9 21.3 1629092 796148 ?  Sl   16:22   0:58  
/usr/bin/qemu-kvm -name guest=Fedora21,debug-threads=on -S -object 
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-Fedora21/master-key.aes
 -machine pc-1.2,accel=kvm,usb=off,dump-guest-core=off -cpu qemu64 -m 1000 
-overcommit mem-lock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 
1eefeeb0-3ac7-07c1-926e-236908313b4c -no-user-config -nodefaults -chardev 
socket,id=charmonitor,fd=32,server,nowait -mon 
chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot 
strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -blockdev 
{"driver":"host_device","filename":"/dev/mapper/vg_bxtest-GuestFedora","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}
 -blockdev 
{"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"}
 -device ide-hd,bus=ide.0,unit=0,drive=libvirt-1-
 format,id=ide0-0-0,bootindex=1 -netdev tap,fd=34,id=hostnet0 -device 
rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:9f:e9:d5,bus=pci.0,addr=0x3 
-netdev tap,fd=35,id=hostnet1 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:ec:dc:6e,bus=pci.0,addr=0x5 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-chardev 
pipe,id=charchannel0,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu0 
-device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=trace-pipe-cpu0
 -chardev 
pipe,id=charchannel1,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu1 
-device 
virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=trace-pipe-cpu1
 -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -sandbox 
on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -device 
vhost-vsock-pci,id=vsock0,guest-cid=3,vhostfd=16,bus=pci.0,addr=0x7 -msg 
 timestamp=on
  root2000  0.0  0.0  0 0 ?S16:22   0:00 
[kvm-pit/1954]
  root2002  0.0  0.0  0 0 ?S16:22   0:00 
[vhost-1954]


This is just an example of what I'm looking for. Just a way to find what
process is using what cid.

Signed-off-by: Steven Rostedt (VMware) 
---
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 5e78fb719602..4f03b25b23c1 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "vhost.h"
@@ -900,6 +901,128 @@ static struct miscdevice vhost_vsock_misc = {
.fops = _vsock_fops,
 };
 
+static struct dentry *vsock_file;
+
+struct vsock_file_iter {
+   struct hlist_node   *node;
+   int index;
+};
+
+
+static void *vsock_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   struct vsock_file_iter *iter = v;
+   struct vhost_vsock *vsock;
+
+   if (pos)
+   (*pos)++;
+
+   if (iter->index >= (int)HASH_SIZE(vhost_v

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-05 Thread Steven Rostedt

For kicks, I wrote this program that uses libtracefs to search all CIDS
(1-255), and find the kvm guests that are attached to them.

It traces the sched_wakeup and kvm_exit, looking for:

 this_task -> wakeup -> wakeup -> kvm_exit

when doing a connect to a cid.

When it finds the pid that did a kvm_exit, it knows that's the PID that
is woken by the vhost worker task. It's a little slow, and I would
really like a better way to do this, but it's at least an option that
is available now.

-- Steve
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

#define MAX_CID		256

static int this_pid;

static int open_vsock(unsigned int cid, unsigned int port)
{
	struct sockaddr_vm addr = {
		.svm_family = AF_VSOCK,
		.svm_cid = cid,
		.svm_port = port,
	};
	int sd;

	sd = socket(AF_VSOCK, SOCK_STREAM, 0);
	if (sd < 0)
		return -1;

	if (connect(sd, (struct sockaddr *), sizeof(addr)))
		return -1;

	return sd;
}

struct pids {
	struct pids		*next;
	int			pid;
};

struct trace_info {
	struct tracefs_instance		*instance;
	struct tep_handle		*tep;
	struct tep_event		*wake_up;
	struct tep_event		*kvm_exit;
	struct tep_format_field		*common_pid;
	struct tep_format_field		*wake_pid;
	struct pids			*pids;
	intcid;
	intpid;
};

static void tear_down_trace(struct trace_info *info)
{
	tracefs_instance_file_write(info->instance, "events/enable", "0");
	tracefs_instance_destroy(info->instance);
	tracefs_instance_free(info->instance);
	tep_free(info->tep);
}

static int setup_trace(struct trace_info *info)
{
	const char *systems[] = { "sched", "kvm", NULL};
	char *name;
	int ret;

	info->pids = NULL;

	ret = asprintf(, "vsock_find-%d\n", getpid());
	if (ret < 0)
		return ret;

	info->instance = tracefs_instance_create(name);
	free(name);
	if (!info->instance)
		return -1;

	tracefs_trace_off(info->instance);
	info->tep = tracefs_local_events_system(NULL, systems);
	if (!info->tep)
		goto fail;

	info->wake_up = tep_find_event_by_name(info->tep, "sched", "sched_waking");
	if (!info->wake_up) {
		fprintf(stderr, "Failed to find sched_waking\n");
		goto fail;
	}

	info->kvm_exit = tep_find_event_by_name(info->tep, "kvm", "kvm_exit");
	if (!info->kvm_exit) {
		fprintf(stderr, "Failed to find kvm_exit\n");
		goto fail;
	}

	info->wake_pid = tep_find_any_field(info->wake_up, "pid");
	if (!info->wake_pid) {
		fprintf(stderr, "Failed to find wake up pid\n");
		goto fail;
	}

	info->common_pid = tep_find_common_field(info->wake_up,
		 "common_pid");
	if (!info->common_pid) {
		fprintf(stderr, "Failed to find common pid\n");
		goto fail;
	}

	ret = tracefs_instance_file_write(info->instance, "events/sched/sched_waking/enable", "1");
	if (ret < 0) {
		fprintf(stderr, "Failed to enable sched_waking\n");
		goto fail;
	}

	ret = tracefs_instance_file_write(info->instance, "events/kvm/kvm_exit/enable", "1");
	if (ret < 0) {
		fprintf(stderr, "Failed to enable kvm_exit\n");
		goto fail;
	}

	return 0;
fail:
	tear_down_trace(info);
	return -1;
}


static void free_pids(struct pids *pids)
{
	struct pids *next;

	while (pids) {
		next = pids;
		pids = pids->next;
		free(next);
	}
}

static void add_pid(struct pids **pids, int pid)
{
	struct pids *new_pid;

	new_pid = malloc(sizeof(*new_pid));
	if (!new_pid)
		return;

	new_pid->pid = pid;
	new_pid->next = *pids;
	*pids = new_pid;
}

static bool match_pid(struct pids *pids, int pid)
{
	while (pids) {
		if (pids->pid == pid)
			return true;
		pids = pids->next;
	}
	return false;
}

static int callback(struct tep_event *event, struct tep_record *record,
		int cpu, void *data)
{
	struct trace_info *info = data;
	struct tep_handle *tep = info->tep;
	unsigned long long val;
	int type;
	int pid;
	int ret;

	ret = tep_read_number_field(info->common_pid, record->data, );
	if (ret < 0)
		return 0;

	pid = val;

	if (!match_pid(info->pids, pid))
		return 0;

	type = tep_data_type(tep, record);
	if (type == info->kvm_exit->id) {
		info->pid = pid;
		return -1;
	}

	if (type != info->wake_up->id)
		return 0;

	ret = tep_read_number_field(info->wake_pid, record->data, );
	if (ret < 0)
		return 0;

	add_pid(>pids, (int)val);
	return 0;
}

static void print_cid_pid(int cid, int pid)
{
	FILE *fp;
	char *path;
	char *buf = NULL;
	char *save;
	size_t l = 0;
	int tgid = -1;

	if (asprintf(, "/proc/%d/status", pid) < 0)
		return;

	fp = fopen(path, "r");
	free(path);
	if (!fp)
		return;

	while (getline(, , fp) > 0) {
		char *tok;

		if (strncmp(buf, "Tgid:", 5) != 0)
			continue;
		tok = strtok_r(buf, ":", );
		if (!tok)
			continue;
		tok = strtok_r(NULL, ":", );
		if (!tok)
			continue;
		while (isspace(*tok))
			tok++;
		tgid = strtol(tok, NULL, 0);
		break;
	}
	free(buf);

	if (tgid >= 0)
		printf("%d\t%d\n", cid, tgid);
}

static void find_cid(struct trace_info *info, int cid)
{
	int fd;

	add_pid(>pids, this_pid);

	tracefs_instance_file_clear(info->instance, "trace");
	tracefs_trace_on(info->instance);
	fd =

Re: [RFC PATCH v5 19/19] virtio/vsock: update trace event for SEQPACKET

2021-03-02 Thread Steven Rostedt

On Thu, 18 Feb 2021 08:42:15 +0300
Arseny Krasnov  wrote:

Not sure if this was pulled in yet, but I do have a small issue with this
patch.

> @@ -69,14 +82,19 @@ TRACE_EVENT(virtio_transport_alloc_pkt,
>   __entry->type = type;
>   __entry->op = op;
>   __entry->flags = flags;
> + __entry->msg_len = msg_len;
> + __entry->msg_cnt = msg_cnt;
>   ),
> - TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x",
> + TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x "
> +   "msg_len=%u msg_cnt=%u",

It's considered poor formatting to split strings like the above. This is
one of the exceptions for the 80 character limit. Do not break strings just
to keep it within 80 characters.

-- Steve


> __entry->src_cid, __entry->src_port,
> __entry->dst_cid, __entry->dst_port,
> __entry->len,
> show_type(__entry->type),
> show_op(__entry->op),
> -   __entry->flags)
> +   __entry->flags,
> +   __entry->msg_len,
> +   __entry->msg_cnt)
>  );
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 1/3] MAINTAINERS: Update maintainers for paravirt ops and VMware hypervisor interface

2021-11-11 Thread Steven Rostedt

On Thu, 11 Nov 2021 07:50:39 +0100
Greg KH  wrote:

> > Signed-off-by: Srivatsa S. Bhat (VMware) 
> > Acked-by: Alexey Makhalov 
> > Acked-by: Deep Shah 
> > Acked-by: Juergen Gross 
> > Cc: sta...@vger.kernel.org  
> 
> Why are MAINTAINERS updates needed for stable?  That's not normal :(

Probably not needed, but does it hurt?  And who's normal?

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 9/9] testmmiotrace: eliminate anonymous module_init & module_exit

2022-03-16 Thread Steven Rostedt

On Wed, 16 Mar 2022 12:20:10 -0700
Randy Dunlap  wrote:

> Eliminate anonymous module_init() and module_exit(), which can lead to
> confusion or ambiguity when reading System.map, crashes/oops/bugs,
> or an initcall_debug log.
> 
> Give each of these init and exit functions unique driver-specific
> names to eliminate the anonymous names.
> 
> Example 1: (System.map)
>  832fc78c t init
>  832fc79e t init
>  832fc8f8 t init
> 
> Example 2: (initcall_debug log)
>  calling  init+0x0/0x12 @ 1
>  initcall init+0x0/0x12 returned 0 after 15 usecs
>  calling  init+0x0/0x60 @ 1
>  initcall init+0x0/0x60 returned 0 after 2 usecs
>  calling  init+0x0/0x9a @ 1
>  initcall init+0x0/0x9a returned 0 after 74 usecs
> 
> Fixes: 8b7d89d02ef3 ("x86: mmiotrace - trace memory mapped IO")
> Signed-off-by: Randy Dunlap 
> Cc: Thomas Gleixner 
> Cc: Steven Rostedt 

Acked-by: Steven Rostedt (Google) 

-- Steve

> Cc: Ingo Molnar 
> Cc: Karol Herbst 
> Cc: Pekka Paalanen 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: nouv...@lists.freedesktop.org
> Cc: x...@kernel.org
> ---
>  arch/x86/mm/testmmiotrace.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> --- lnx-517-rc8.orig/arch/x86/mm/testmmiotrace.c
> +++ lnx-517-rc8/arch/x86/mm/testmmiotrace.c
> @@ -113,7 +113,7 @@ static void do_test_bulk_ioremapping(voi
>   synchronize_rcu();
>  }
>  
> -static int __init init(void)
> +static int __init testmmiotrace_init(void)
>  {
>   unsigned long size = (read_far) ? (8 << 20) : (16 << 10);
>   int ret = security_locked_down(LOCKDOWN_MMIOTRACE);
> @@ -136,11 +136,11 @@ static int __init init(void)
>   return 0;
>  }
>  
> -static void __exit cleanup(void)
> +static void __exit testmmiotrace_cleanup(void)
>  {
>   pr_debug("unloaded.\n");
>  }
>  
> -module_init(init);
> -module_exit(cleanup);
> +module_init(testmmiotrace_init);
> +module_exit(testmmiotrace_cleanup);
>  MODULE_LICENSE("GPL");

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] virtio: Workaround fix for hard hang on guest using fifos

2022-03-25 Thread Steven Rostedt

This is more of a workaround patch and probably not the proper fix. But
I'm doing some work that is using fifos for guests and this is causing a
hang that is quite annoying.

I currently have this patch applied to continue my work.

I was working on analyzing data transfers between host and guests via
virtio sockets (FIFOs on host, dev on guest), vsockets and TCP packets.

I wrote a program to test each by passing a 1GB file and timing it. I'm
using the splice system call to help move things along. In doing so, I
found that my pipe between splice calls originally used "page_size" for
data transfer, and that is not as efficient as finding out what the pipe
size is. So I changed the code to use pipe_size and while debugging it, the
guest locked up hard.

I'm attaching the "agent-fifo" that runs on the guest, and the
"client-fifo" that runs on the host (the names may be backwards, but
makes sense when you add how I test vsockets and network packets).

Here's what I did:

 # ./client-fifo /var/lib/virt/Guest/trace-pipe-cpu0.out /test/bigfile

Where the trace-pipe-cpu0.out is the receiving side from the guest's virtio
pipe. The /test/bigfile is created when data starts coming in from the
guest pipe.

 # dd if=/dev/urandom of=bigfile bs=1024 count=1048576
 # ./agent-fifo /dev/virtio-ports/trace-pipe-cpu0 bigfile

With the updates to change the size being passed in the splice from
page_size to pipe_size, this never finished (it would copy around a meg or
so). And stopped. When I killed the agent-fifo task on the guest, the guest
hung hard.

Debugging this, I found that the guest is stuck in the loop in
drivers/char/virt_console.c: __send_control_msg():

if (virtqueue_add_outbuf(vq, sg, 1, >cpkt, GFP_ATOMIC) == 0) {
virtqueue_kick(vq);
while (!virtqueue_get_buf(vq, )
&& !virtqueue_is_broken(vq))
cpu_relax();
}

It never exits that loop. My workaround (this patch) is to put in a
timeout, and exit out if it spins there for more than 5 seconds. This
makes the problem go away.

Below is my changes, but this is a band-aid, it is not the cure.

Workaround-fix-by: Steven Rostedt (Google) 
---
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index e3c430539a17..65f259f3f8cb 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -551,6 +551,7 @@ static ssize_t __send_control_msg(struct ports_device 
*portdev, u32 port_id,
struct scatterlist sg[1];
struct virtqueue *vq;
unsigned int len;
+   u64 end;
 
if (!use_multiport(portdev))
return 0;
@@ -567,9 +568,15 @@ static ssize_t __send_control_msg(struct ports_device 
*portdev, u32 port_id,
 
if (virtqueue_add_outbuf(vq, sg, 1, >cpkt, GFP_ATOMIC) == 0) {
virtqueue_kick(vq);
+   end = jiffies + 5 * HZ;
while (!virtqueue_get_buf(vq, )
-   && !virtqueue_is_broken(vq))
+   && !virtqueue_is_broken(vq)) {
+   if (unlikely(end < jiffies)) {
+   dev_warn(>vdev->dev, "send_control_msg 
timed out!\n");
+   break;
+   }
cpu_relax();
+   }
}
 
spin_unlock(>c_ovq_lock);
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#define _GNU_SOURCE
#include 

#ifndef F_GETPIPE_SZ
# define F_GETPIPE_SZ	1032 /* The Linux number for the option */
#endif

const char *this_name = "agent-fifo";

static void usage(void)
{
	printf("usage: %s dev bigfile\n"
	   "\n",this_name);
	exit(-1);
}

static void __vdie(const char *fmt, va_list ap, int err)
{
	int ret = errno;
	const char *p = this_name;

	if (err && errno)
		perror(p);
	else
		ret = -1;

	fprintf(stderr, "  ");
	vfprintf(stderr, fmt, ap);

	fprintf(stderr, "\n");
	exit(ret);
}

void die(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	__vdie(fmt, ap, 0);
	va_end(ap);
}

void pdie(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	__vdie(fmt, ap, 1);
	va_end(ap);
}

static unsigned long long time()
{
	struct timeval tv;

	gettimeofday(, NULL);
	return tv.tv_sec * 100 + tv.tv_usec;
}

static void print_time(unsigned long long time)
{
	unsigned long long seconds;
	unsigned long long usecs;

	seconds = time / 100;
	usecs = time - (seconds * 100);

	printf("time: %llu.%06llu\n", seconds, usecs);
}

int main(int argc, char *argv[])
{
	unsigned long long start;
	unsigned long long end;
	struct stat st;
	off_t size;
	char *file;
	char *port;
	int page_size;
	int pipe_size;
	int brass[2];
	int ret;
	int cfd;
	int fd;

	if (argc < 3)
		usage();

	port = argv[1];
	file = argv[2];

	fd = ope

Re: [PATCH] virtio: Workaround fix for hard hang on guest using fifos

2022-03-25 Thread Steven Rostedt

On Fri, 25 Mar 2022 17:36:27 -0700
Linus Torvalds  wrote:

> On Fri, Mar 25, 2022 at 2:20 PM Steven Rostedt  wrote:
> >
> > With the updates to change the size being passed in the splice from
> > page_size to pipe_size, this never finished (it would copy around a meg or
> > so). And stopped. When I killed the agent-fifo task on the guest, the guest
> > hung hard.  
> 
> Without knowing (or really caring) at all how virtqueue works, this
> sounds very much like the classic pipe deadlock where two processes
> communicate over a pair of pipes, sending each other commands, and
> replying to each other with status updates.

It does look like this. Note, this is not due to my application, since it
flows in one direction, but I think the way virtio pipes are implemented
can cause this to occur.

A virtio pipe on the guest looks like a normal bidirectional char device.
You open it, read it, write to it. No problem. On the host, there are two
FIFOs that are attached to the guest. One for each direction. That is, the
host is using two pipes that convert to a character device on the guest.

> 
> And you absolutely cannot do that if one side can possibly want to up
> fill the whole pipe.
> 
> Deadlock:
> 
>  - process A is trying to send data to process B (on 'pipe_A'), and
> blocks because the pipe is full
> 
>  - process B is reads the data and everything is fine, and A gets to continue
> 

I think you left out a step here, where A is blocked again.

>  - but then process B sends some stratus update the other way (on
> 'pipe_B' - you can't use the same pipe for bidirectional, it's why you
> use a pair of pipes or a socketpair) and waits for the result.
> 
>  - now A and B are both waiting for each other - A is waiting for B to
> empty the big bunch of data it's sending, and B is waiting for the
> result for the (small) command it sent.
> 
> and neither makes any progress.
> 
> You can find several mentions of these kinds of problems by just
> googling for "bidirectional pipe deadlock" or similar.
> 
> The solution is invariably to either
> 
>  (a) make sure that nobody writes even remotely close to enough data
> to fill a pipe before reading the other pipe (you can still fill up a
> pipe, but at least somebody is always going to succeed and make
> progress and do the read to make progress).
> 
>  (b) make sure everybody who writes to a pipe will use nonblocking IO
> (and is willing to do reads in between to satisfy the other end).
> 
> That first case is basically what one of the PIPE_BUF guarantees is
> all about (the other one is the atomicity it guarantees, ie you can
> write a "command packet" and be guaranteed that readers will see it
> without data mixed in from other writes).
> 
> I have no idea what your client/agent does and how it interacts with
> the virtio pipes, but it really _sounds_ like a very similar issue,
> where it used to work (because PIPE_BUF) and now no longer does
> (because pipe filled).

So the agent writes to the virtio char device on the guest, while the
client is reading the guest's FIFO on the host.

One directional.


> 
> And that virtio_console __send_control_msg() pattern very much sounds
> like a "send data and wait for ACK" behavior of "process B".

Something happens where data stops transferring (the virtio system waiting
for status?) This mostly happens when I run the agent under gdb. Which it
stops transferring data. Everything is still fine. I hit Ctrl^C twice to
get back to the gdb command prompt, and then I kill the process with the
gdb "kill" command. It asks me if I'm sure, I type "y", and then boom! The
system is hung. The guest locks up.

And printk()s show that the exit of the agent is in that loop.

 watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [agent-test:1165]
 Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep 
bluetooth ecdh_generic ecc cfg80211 rfkill cirrus drm_shmem_helper 
drm_kms_helper vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common 
pcspkr joydev serio_raw vsock virtio_console drm virtio_balloon 8139too 
i2c_piix4 floppy nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net 
net_failover failover virtio_pci virtio virtio_pci_legacy_dev 
virtio_pci_modern_dev virtio_ring 8139cp mii ata_generic pata_acpi
 CPU: 0 PID: 1165 Comm: agent-test Not tainted 5.17.0-test+ #41
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 RIP: 0010:virtqueue_get_buf_ctx_split+0x0/0xd0 [virtio_ring]
 Code: 75 a6 45 31 e4 39 45 2c 74 e6 0f 0b 83 c6 01 48 83 c2 10 39 c6 75 b4 45 
31 e4 39 45 2c 74 d1 0f 0b 66 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 41 54 53 
80 7f 3b 00 0f 85 9a 00 00 00 48 8b 47 68
 RSP: 0018:b85c012e7c68 EFLAGS: 0246
 RAX:  RBX: 9517051713

Re: [PATCH 2/3] x86/xen: move paravirt lazy code

2023-09-13 Thread Steven Rostedt

On Wed, 13 Sep 2023 13:38:27 +0200
Juergen Gross  wrote:

> diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
> index 44a3f565264d..0577f0cdd231 100644
> --- a/include/trace/events/xen.h
> +++ b/include/trace/events/xen.h
> @@ -6,26 +6,26 @@
>  #define _TRACE_XEN_H
>  
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  struct multicall_entry;
>  
>  /* Multicalls */
>  DECLARE_EVENT_CLASS(xen_mc__batch,
> - TP_PROTO(enum paravirt_lazy_mode mode),
> + TP_PROTO(enum xen_lazy_mode mode),
>   TP_ARGS(mode),
>   TP_STRUCT__entry(
> - __field(enum paravirt_lazy_mode, mode)
> + __field(enum xen_lazy_mode, mode)
>   ),
>   TP_fast_assign(__entry->mode = mode),
>   TP_printk("start batch LAZY_%s",
> -   (__entry->mode == PARAVIRT_LAZY_MMU) ? "MMU" :
> -   (__entry->mode == PARAVIRT_LAZY_CPU) ? "CPU" : "NONE")
> +   (__entry->mode == XEN_LAZY_MMU) ? "MMU" :
> +   (__entry->mode == XEN_LAZY_CPU) ? "CPU" : "NONE")

There's helper functions that make the above easier to implement as well as
exports the symbols so that user space can parse this better:

TRACE_DEFINE_ENUM(XEN_LAZY_NONE);
TRACE_DEFINE_ENUM(XEN_LAZY_MMU);
TRACE_DEFINE_ENUM(XEN_LAZY_CPU);

[..]

TP_printk("start batch LAZY_%s",
  __print_symbolic(mode,
   { XEN_LAZY_NONE, "NONE" },
   { XEN_LAZY_MMU,  "MMU   },
   { XEN_LAZY_CPU,  "CPU"  }))

Then user space parsers that read the raw data can convert these events
into something humans can read.

-- Steve

>   );
>  #define DEFINE_XEN_MC_BATCH(name)\
>   DEFINE_EVENT(xen_mc__batch, name,   \
> - TP_PROTO(enum paravirt_lazy_mode mode), \
> + TP_PROTO(enum xen_lazy_mode mode),  \
>TP_ARGS(mode))
>  
>  DEFINE_XEN_MC_BATCH(xen_mc_batch);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-14 Thread Steven Rostedt

On Thu, 9 Jun 2022 15:02:20 +0200
Petr Mladek  wrote:

> > I'm somewhat curious whether we can actually remove that trace event.  
> 
> Good question.
> 
> Well, I think that it might be useful. It allows to see trace and
> printk messages together.

Yes people still use it. I was just asked about it at Kernel Recipes. That
is, someone wanted printk mixed in with the tracing, and I told them about
this event (which they didn't know about but was happy to hear that it
existed).

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 33/44] ftrace: WARN on rcuidle

2022-09-26 Thread Steven Rostedt



Nit, the subject should have "tracing:" an not "ftrace:" as the former
encompasses the tracing infrastructure and the latter is for the function
hook part of that.

On Mon, 19 Sep 2022 12:00:12 +0200
Peter Zijlstra  wrote:

> CONFIG_GENERIC_ENTRY disallows any and all tracing when RCU isn't
> enabled.
> 
> XXX if s390 (the only other GENERIC_ENTRY user as of this writing)
> isn't comfortable with this, we could switch to
> HAVE_NOINSTR_VALIDATION which is x86_64 only atm.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/tracepoint.h |   13 -
>  kernel/trace/trace.c   |3 +++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -178,6 +178,16 @@ static inline struct tracepoint *tracepo
>  #endif /* CONFIG_HAVE_STATIC_CALL */
>  
>  /*
> + * CONFIG_GENERIC_ENTRY archs are expected to have sanitized entry and idle
> + * code that disallow any/all tracing/instrumentation when RCU isn't 
> watching.
> + */
> +#ifdef CONFIG_GENERIC_ENTRY
> +#define RCUIDLE_COND(rcuidle)(rcuidle)
> +#else

Should probably move the below comment to here:

 /* srcu can't be used from NMI */

> +#define RCUIDLE_COND(rcuidle)(rcuidle && in_nmi())
> +#endif
> +
> +/*
>   * it_func[0] is never NULL because there is at least one element in the 
> array
>   * when the array itself is non NULL.
>   */
> @@ -189,7 +199,8 @@ static inline struct tracepoint *tracepo
>   return; \
>   \
>   /* srcu can't be used from NMI */   \

And remove the above.

-- Steve

> - WARN_ON_ONCE(rcuidle && in_nmi());  \
> + if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\
> + return; \
>   \
>   /* keep srcu and sched-rcu usage consistent */  \
>   preempt_disable_notrace();  \
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3104,6 +3104,9 @@ void __trace_stack(struct trace_array *t
>   return;
>   }
>  
> + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_GENERIC_ENTRY)))
> + return;
> +
>   /*
>* When an NMI triggers, RCU is enabled via ct_nmi_enter(),
>* but if the above rcu_is_watching() failed, then the NMI
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Steven Rostedt

On Mon, 23 Jan 2023 21:50:12 +0100
Peter Zijlstra  wrote:

> All RCU disabled code should be noinstr and hence we should never get
> here -- when we do, WARN about it and make sure to not actually do
> tracing.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/kernel/ftrace.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
>   if (unlikely(atomic_read(>tracing_graph_pause)))
>   return;
>  
> + if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
> + return;
> +

Please add this to after recursion trylock below. Although WARN_ONCE()
should not not have recursion issues, as function tracing can do weird
things, I rather be safe than sorry, and not have the system triple boot
due to some path that might get added in the future.

If rcu_is_watching() is false, it will still get by the below recursion
check and warn. That is, the below check should be done before this
function calls any other function.

>   bit = ftrace_test_recursion_trylock(ip, *parent);
>   if (bit < 0)
>   return;
> 

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Steven Rostedt

On Mon, 23 Jan 2023 16:53:04 -0500
Steven Rostedt  wrote:

> On Mon, 23 Jan 2023 21:50:12 +0100
> Peter Zijlstra  wrote:
> 
> > All RCU disabled code should be noinstr and hence we should never get
> > here -- when we do, WARN about it and make sure to not actually do
> > tracing.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> >  arch/x86/kernel/ftrace.c |3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
> > if (unlikely(atomic_read(>tracing_graph_pause)))
> > return;
> >  
> > +   if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
> > +   return;
> > +  
> 
> Please add this to after recursion trylock below. Although WARN_ONCE()
> should not not have recursion issues, as function tracing can do weird
> things, I rather be safe than sorry, and not have the system triple boot
> due to some path that might get added in the future.
> 
> If rcu_is_watching() is false, it will still get by the below recursion
> check and warn. That is, the below check should be done before this
> function calls any other function.
> 
> > bit = ftrace_test_recursion_trylock(ip, *parent);
> > if (bit < 0)
> > return;
> >   
> 

Actually, perhaps we can just add this, and all you need to do is create
and set CONFIG_NO_RCU_TRACING (or some other name).

This should cover all ftrace locations. (Uncompiled).

-- Steve

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index c303f7a114e9..10ee3fbb9113 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,22 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
+#ifdef CONFIG_NO_RCU_TRACING
+# define trace_warn_on_no_rcu(ip)  \
+   ({  \
+   bool __ret = false; \
+   if (!trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) { \
+   trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \
+   __ret = WARN_ONCE(!rcu_is_watching(),   \
+ "RCU not on for: %pS\n", (void *)ip); 
\
+   trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \
+   }   \
+   __ret;  \
+   })
+#else
+# define trace_warn_on_no_rcu(ip)  false
+#endif
+
 /*
  * Preemption is promised to be disabled when return bit >= 0.
  */
@@ -144,6 +160,9 @@ static __always_inline int 
trace_test_and_set_recursion(unsigned long ip, unsign
unsigned int val = READ_ONCE(current->trace_recursion);
int bit;
 
+   if (trace_warn_on_no_rcu(ip))
+   return -1;
+
bit = trace_get_context_bit() + start;
if (unlikely(val & (1 << bit))) {
/*
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 0/6] use canonical ftrace path whenever possible

2023-03-10 Thread Steven Rostedt

On Fri, 10 Mar 2023 12:06:58 -0700
Ross Zwisler  wrote:

> On Fri, Mar 10, 2023 at 03:29:49AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 15, 2023 at 03:33:44PM -0700, Ross Zwisler wrote:  
> > > Changes in v2:
> > >  * Dropped patches which were pulled into maintainer trees.
> > >  * Split BPF patches out into another series targeting bpf-next.
> > >  * trace-agent now falls back to debugfs if tracefs isn't present.
> > >  * Added Acked-by from m...@redhat.com to series.
> > >  * Added a typo fixup for the virtio-trace README.
> > > 
> > > Steven, assuming there are no objections, would you feel comfortable
> > > taking this series through your tree?  
> > 
> > Acked-by: Michael S. Tsirkin 
> > 
> > if you want the virtio patches through my tree after all, let me know.  
> 
> Yes, please, that would be great.  I'll send out v3 with the few patches that
> haven't gotten a response, but I'll drop the virtio patches assuming you've
> got them.
>

Since the last patch 6/6 is not tracing related, I would prefer it to go
through the virtio tree.

For patches 1-5, please add:

Reviewed-by: Steven Rostedt (Google) 

Thanks!

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 01/20] x86: move prepare_ftrace_return prototype to header

2023-05-16 Thread Steven Rostedt

On Tue, 16 May 2023 21:35:30 +0200
Arnd Bergmann  wrote:

> From: Arnd Bergmann 
> 
> On 32-bit builds, the prepare_ftrace_return() function only has a global
> definition, but no prototype before it, which causes a warning:
> 
> arch/x86/kernel/ftrace.c:625:6: warning: no previous prototype for 
> ‘prepare_ftrace_return’ [-Wmissing-prototypes]
>   625 | void prepare_ftrace_return(unsigned long ip, unsigned long *parent,
> 
> Move the prototype that is already needed for some configurations into
> a header file where it can be seen unconditionally.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/x86/include/asm/ftrace.h | 3 +++
>  arch/x86/kernel/ftrace.c  | 3 ---
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index 5061ac98ffa1..b8d4a07f9595 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h

Acked-by: Steven Rostedt (Google) 

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

69 matches

Mail list logo