from:"John"

Re: kvmclock doesn't work, help?

2015-12-18 Thread John Stultz

On Fri, Dec 18, 2015 at 12:25 PM, Andy Lutomirski <l...@amacapital.net> wrote:
> [cc: John Stultz -- maybe you have ideas on how this should best
> integrate with the core code]
>
> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
>> On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
>>> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosa...@redhat.com> 
>>> wrote:
>>> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>>> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> 
>>> >> wrote:
>>> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>>> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti 
>>> >> >> <mtosa...@redhat.com> wrote:
>>> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>>> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski 
>>> >> >> >> <l...@amacapital.net> wrote:
>>> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini 
>>> >> >> >> > <pbonz...@redhat.com> wrote:
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>>> >> >> >> >>> > RAW TSC NTP corrected TSC
>>> >> >> >> >>> > t0  10  10
>>> >> >> >> >>> > t1  20  19.99
>>> >> >> >> >>> > t2  30  29.98
>>> >> >> >> >>> > t3  40  39.97
>>> >> >> >> >>> > t4  50  49.96
>>> >> >
>>> >> > (1)
>>> >> >
>>> >> >> >> >>> >
>>> >> >> >> >>> > ...
>>> >> >> >> >>> >
>>> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>>> >> >> >> >>> > you can see what will happen.
>>> >> >> >> >>>
>>> >> >> >> >>> Sure, but why would you ever switch from one to the other?
>>> >> >> >> >>
>>> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  
>>> >> >> >> >> After
>>> >> >> >> >> resume, the TSC certainly increases at the same rate as before, 
>>> >> >> >> >> but the
>>> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased 
>>> >> >> >> >> slower
>>> >> >> >> >> than the guest kvmclock.
>>> >> >> >> >
>>> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>>> >> >> >> >
>>> >> >> >> > If it's the host's, then wouldn't systemtime be reset after 
>>> >> >> >> > resume to
>>> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>>> >> >> >> > backwards.
>>> >> >> >> >
>>> >> >> >> > If it's the guest's, then the guest's NTP correction is applied 
>>> >> >> >> > on top
>>> >> >> >> > of kvmclock, and this shouldn't matter.
>>> >> >> >> >
>>> >> >> >> > I still feel like I'm missing something very basic here.
>>> >> >> >> >
>>> >> >> >>
>>> >> >> >> OK, I think I get it.
>>> >> >> >>
>>> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the 
>>> >> >> >> host's
>>> >> >> >> correction to the guest.  If it did, indeed, propagate the 
>>> >> >> >> correction
>>> >> >> >> then, after resume, the host's new system_time would match the 
>>> >> >> >> g

Re: [PATCH 0/7] kvmtool: Cleanup kernel loading

2015-11-02 Thread Dimitri John Ledkov

On 2 November 2015 at 14:58, Will Deacon  wrote:
> On Fri, Oct 30, 2015 at 06:26:53PM +, Andre Przywara wrote:
>> Hi,
>
> Hello Andre,
>
>> this series cleans up kvmtool's kernel loading functionality a bit.
>> It has been broken out of a previous series I sent [1] and contains
>> just the cleanup and bug fix parts, which should be less controversial
>> and thus easier to merge ;-)
>> I will resend the pipe loading part later on as a separate series.
>>
>> The first patch properly abstracts kernel loading to move
>> responsibility into each architecture's code. It removes quite some
>> ugly code from the generic kvm.c file.
>> The later patches address the naive usage of read(2) to, well, read
>> data from files. Doing this without coping with the subtleties of
>> the UNIX read semantics (returning with less or none data read is not
>> an error) can provoke hard to debug failures.
>> So these patches make use of the existing and one new wrapper function
>> to make sure we read everything we actually wanted to.
>> The last patch moves the ARM kernel loading code into the proper
>> location to be in line with the other architectures.
>>
>> Please have a look and give some comments!
>
> Looks good to me, but I'd like to see some comments from some mips/ppc/x86
> people on the changes you're making over there.

Looks mostly good to me, as one of the kvmtool down streams. Over at
https://github.com/clearlinux/kvmtool we have some patches to tweak
the x86 boot flow, which will need rebasing/retweaking) specifically
this commit here -
https://github.com/clearlinux/kvmtool/commit/a8dee709f85735d16739d0eda0cc00d3c1b17477

-- 
Regards,

Dimitri.
53 sleeps till Christmas, or less

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] kvmtool: Cleanup kernel loading

2015-11-02 Thread Dimitri John Ledkov

On 2 November 2015 at 14:58, Will Deacon  wrote:
> On Fri, Oct 30, 2015 at 06:26:53PM +, Andre Przywara wrote:
>> Hi,
>
> Hello Andre,
>
>> this series cleans up kvmtool's kernel loading functionality a bit.
>> It has been broken out of a previous series I sent [1] and contains
>> just the cleanup and bug fix parts, which should be less controversial
>> and thus easier to merge ;-)
>> I will resend the pipe loading part later on as a separate series.
>>
>> The first patch properly abstracts kernel loading to move
>> responsibility into each architecture's code. It removes quite some
>> ugly code from the generic kvm.c file.
>> The later patches address the naive usage of read(2) to, well, read
>> data from files. Doing this without coping with the subtleties of
>> the UNIX read semantics (returning with less or none data read is not
>> an error) can provoke hard to debug failures.
>> So these patches make use of the existing and one new wrapper function
>> to make sure we read everything we actually wanted to.
>> The last patch moves the ARM kernel loading code into the proper
>> location to be in line with the other architectures.
>>
>> Please have a look and give some comments!
>
> Looks good to me, but I'd like to see some comments from some mips/ppc/x86
> people on the changes you're making over there.

Looks mostly good to me, as one of the kvmtool down streams. Over at
https://github.com/clearlinux/kvmtool we have some patches to tweak
the x86 boot flow, which will need rebasing/retweaking) specifically
this commit here -
https://github.com/clearlinux/kvmtool/commit/a8dee709f85735d16739d0eda0cc00d3c1b17477

-- 
Regards,

Dimitri.
53 sleeps till Christmas, or less

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH kvmtool] Skip a few messages by default: command line args; flat binary; earlyprintk.

2015-10-06 Thread Dimitri John Ledkov

Hello,

A bit of context. I'm from Clear Linux* Project for Intel Architecture
and we use lkvm as the hypervisor in Clear Containers for Docker
Engine http://blog.surgut.co.uk/2015/09/clear-containers-for-docker-engine.html

For us, we really do not want any output coming from the hypervisor,
or kernel, or init. We use systemd, and the only unit that has TTY
connected for output is the ultimate docker workload user has
requested. Thus I am avert --vey-quiet mode for lkvm. I understand
that this is outside of the usual intended use-case for kvmtool (i.e.
kernel development), but it's really lean and nice to work with.

On 30 September 2015 at 17:11, Andre Przywara <andre.przyw...@arm.com> wrote:
> Hi Dimitri,
>
> thanks for sharing your patches.
>
> On 29/09/15 17:59, Dimitri John Ledkov wrote:
>> The partial command line args & earlyprintk=serial are still enabled
>> in the debug mode. Warning that a flat binary kernel image is attemped
>> to be loaded is completely dropped. These are not that informative,
>> once one is past intial debugging, and only polute the console.
>>
>> Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
>> ---
>>  builtin-run.c | 10 ++
>>  kvm.c |  1 -
>>  x86/kvm.c |  8 ++--
>>  3 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/builtin-run.c b/builtin-run.c
>> index e0c8732..8edbf88 100644
>> --- a/builtin-run.c
>> +++ b/builtin-run.c
>> @@ -613,10 +613,12 @@ static struct kvm *kvm_cmd_run_init(int argc, const 
>> char **argv)
>>
>>   kvm->cfg.real_cmdline = real_cmdline;
>>
>> - printf("  # %s run -k %s -m %Lu -c %d --name %s\n", KVM_BINARY_NAME,
>> - kvm->cfg.kernel_filename,
>> - (unsigned long long)kvm->cfg.ram_size / 1024 / 1024,
>> - kvm->cfg.nrcpus, kvm->cfg.guest_name);
>> + if (do_debug_print) {
>> + printf("  # %s run -k %s -m %Lu -c %d --name %s\n", 
>> KVM_BINARY_NAME,
>> +kvm->cfg.kernel_filename,
>> +(unsigned long long)kvm->cfg.ram_size / 1024 / 1024,
>> +kvm->cfg.nrcpus, kvm->cfg.guest_name);
>> + }
>
> I like the general idea. In fact I have this very patch (among others)
> in my tree too. I applied similar guarding to other messages as well
> (mostly those that only show up on ARM, but also the "ended normally"
> message). Like any good UNIX tool kvmtool should keep quiet if it has
> nothing worthwhile to say ;-)
> But looking at it more closely, I see that there is pr_debug() defined
> doing that "if (do_debug_print)" already. The only issue is that is
> prints source line information, which is not really useful here. But
> then again there does not seem to be any user of it?
>

I'd be happy to change these to pr_debug() messages.


> So what about the following:
> - We avoid printing pr_info() messages in the default case. Looking at
> its current users in the tree this information is not really useful for
> normal users. We enable pr_info() output only if do_debug_print is
> enabled or introduce another command line flag (--verbose?) for that.
> - We check each user of pr_info() to see whether this information is
> actually "informational" or whether it should be converted to pr_warn.
> - We change the above line to use pr_info instead of printf.

Sounds good to me. That should work as well.

> - We fix the EOL mayhem we have atm while at it.
>

Not quite sure what you mean by `EOL mayhem' could you please elaborate?

> If you don't mind I will give this a try later this week.
>

Thumbs up!


>>
>>   if (init_list__init(kvm) < 0)
>>   die ("Initialisation failed");
>> diff --git a/kvm.c b/kvm.c
>> index 10ed230..1081072 100644
>> --- a/kvm.c
>> +++ b/kvm.c
>> @@ -378,7 +378,6 @@ bool kvm__load_kernel(struct kvm *kvm, const char 
>> *kernel_filename,
>>   if (ret)
>>   goto found_kernel;
>>
>> - pr_warning("%s is not a bzImage. Trying to load it as a flat 
>> binary...", kernel_filename);
>
> I think on x86 this message is useful to have: to point people to the
> fact that they are trying to load a kernel which most probably isn't one.
> Do you actually load a "flat binary", so not a Linux bzImage? If yes,
> what is it? Does this work for you? I didn't have the impression that
> this code was actually used at all.

We do use flat kernel loading, and we have extra patches for that. But
I haven't reconciled that with current up

[PATCH kvmtool] kvmtool: expose the TSC Deadline Timer feature to the guest

2015-09-29 Thread Dimitri John Ledkov

From: Arjan van de Ven <ar...@linux.intel.com>

with the TSC deadline timer feature, we don't need to calibrate the apic
timers anymore, which saves more than 100 milliseconds of boot time.

Signed-off-by: Arjan van de Ven <ar...@linux.intel.com>
Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
---
 x86/cpuid.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/x86/cpuid.c b/x86/cpuid.c
index c3b67d9..1d8bd23 100644
--- a/x86/cpuid.c
+++ b/x86/cpuid.c
@@ -31,6 +31,9 @@ static void filter_cpuid(struct kvm_cpuid2 *kvm_cpuid)
/* Set X86_FEATURE_HYPERVISOR */
if (entry->index == 0)
entry->ecx |= (1 << 31);
+/* Set CPUID_EXT_TSC_DEADLINE_TIMER*/
+   if (entry->index == 0)
+   entry->ecx |= (1 << 24);
break;
case 6:
/* Clear X86_FEATURE_EPB */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kvmtool] Skip a few messages by default: command line args; flat binary; earlyprintk.

2015-09-29 Thread Dimitri John Ledkov

The partial command line args & earlyprintk=serial are still enabled
in the debug mode. Warning that a flat binary kernel image is attemped
to be loaded is completely dropped. These are not that informative,
once one is past intial debugging, and only polute the console.

Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
---
 builtin-run.c | 10 ++
 kvm.c |  1 -
 x86/kvm.c |  8 ++--
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin-run.c b/builtin-run.c
index e0c8732..8edbf88 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -613,10 +613,12 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
 
kvm->cfg.real_cmdline = real_cmdline;
 
-   printf("  # %s run -k %s -m %Lu -c %d --name %s\n", KVM_BINARY_NAME,
-   kvm->cfg.kernel_filename,
-   (unsigned long long)kvm->cfg.ram_size / 1024 / 1024,
-   kvm->cfg.nrcpus, kvm->cfg.guest_name);
+   if (do_debug_print) {
+   printf("  # %s run -k %s -m %Lu -c %d --name %s\n", 
KVM_BINARY_NAME,
+  kvm->cfg.kernel_filename,
+  (unsigned long long)kvm->cfg.ram_size / 1024 / 1024,
+  kvm->cfg.nrcpus, kvm->cfg.guest_name);
+   }
 
if (init_list__init(kvm) < 0)
die ("Initialisation failed");
diff --git a/kvm.c b/kvm.c
index 10ed230..1081072 100644
--- a/kvm.c
+++ b/kvm.c
@@ -378,7 +378,6 @@ bool kvm__load_kernel(struct kvm *kvm, const char 
*kernel_filename,
if (ret)
goto found_kernel;
 
-   pr_warning("%s is not a bzImage. Trying to load it as a flat 
binary...", kernel_filename);
 #endif
 
ret = load_elf_binary(kvm, fd_kernel, fd_initrd, kernel_cmdline);
diff --git a/x86/kvm.c b/x86/kvm.c
index 512ad67..4a5fa41 100644
--- a/x86/kvm.c
+++ b/x86/kvm.c
@@ -124,8 +124,12 @@ void kvm__arch_set_cmdline(char *cmdline, bool video)
"i8042.dumbkbd=1 i8042.nopnp=1");
if (video)
strcat(cmdline, " video=vesafb console=tty0");
-   else
-   strcat(cmdline, " console=ttyS0 earlyprintk=serial 
i8042.noaux=1");
+   else {
+   strcat(cmdline, " console=ttyS0 i8042.noaux=1");
+   if (do_debug_print) {
+   strcat(cmdline, " earlyprintk=serial");
+   }
+   }
 }
 
 /* Architecture-specific KVM init */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvmtool: don't rely on $HOME

2015-09-18 Thread Dimitri John Ledkov

Hello,

On 17 September 2015 at 15:03, Alban Crequy  wrote:
> kvm__set_dir() called in main() and kvm__get_dir() rely on $HOME. But in
> some environments (such as starting lkvm through systemd-run), $HOME is
> undefined. This causes bind() to use a socket path containing "(null)"
> and this fails. The current code does not check errors returned by
> realpath().
>
> Symptoms:
>
> | bind: No such file or directory
> |   Error: Failed adding socket to epoll
> |  Warning: Failed init: kvm_ipc__init
> |
> |  Fatal: Initialisation failed
>
> This bug was first reported on https://github.com/coreos/rkt/issues/1393
>
> Instead of using "$HOME/.lkvm/" (i.e. "/root/.lkvm/"), this patch uses
> "/var/lib/lkvm/". This also improve the error reporting by printing the
> socket filename.
>

kvmtool is used as the hypervisor in Clear Containers, at clearlinux.org.

I have seen similar issue, and e.g. I would welcome a --pid or
--socket options in lkvm, however I wouldn't want to remove the
current HOME handling as it is good enough.

What we have done, is essentially pass a HOME variable to the location
we want lkvm to use. Sure, a subfolder will be created. Can you simply
set appropriate HOME variable?

Failing that, lkvm could migrate to XDG spec and use XDG_CONFIG_HOME &
XDG_CONFIG_DIRS to store things, which should work for both regular
users and restricted environments (ending up with /etc/xdg/lkvm for
home-less case).

Regards,

Dimitri.

> Signed-off-by: Alban Crequy 
> ---
>  include/kvm/kvm.h |  5 ++---
>  kvm-ipc.c |  1 +
>  kvm.c | 26 --
>  main.c|  6 +-
>  4 files changed, 16 insertions(+), 22 deletions(-)
>
> diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> index 37155db..368f256 100644
> --- a/include/kvm/kvm.h
> +++ b/include/kvm/kvm.h
> @@ -16,8 +16,7 @@
>  #define SIGKVMEXIT (SIGRTMIN + 0)
>  #define SIGKVMPAUSE(SIGRTMIN + 1)
>
> -#define KVM_PID_FILE_PATH  "/.lkvm/"
> -#define HOME_DIR   getenv("HOME")
> +#define KVM_PID_FILE_PATH  "/var/lib/lkvm/"
>  #define KVM_BINARY_NAME"lkvm"
>
>  #ifndef PAGE_SIZE
> @@ -70,7 +69,7 @@ struct kvm {
> int vm_state;
>  };
>
> -void kvm__set_dir(const char *fmt, ...);
> +int kvm__set_dir(void);
>  const char *kvm__get_dir(void);
>
>  int kvm__init(struct kvm *kvm);
> diff --git a/kvm-ipc.c b/kvm-ipc.c
> index 857b0dc..d94456c 100644
> --- a/kvm-ipc.c
> +++ b/kvm-ipc.c
> @@ -60,6 +60,7 @@ static int kvm__create_socket(struct kvm *kvm)
> r = bind(s, (struct sockaddr *), len);
> if (r < 0) {
> perror("bind");
> +   pr_err("Cannot bind on %s", full_name);
> goto fail;
> }
>
> diff --git a/kvm.c b/kvm.c
> index 10ed230..482f47b 100644
> --- a/kvm.c
> +++ b/kvm.c
> @@ -63,31 +63,21 @@ extern struct kvm_ext kvm_req_ext[];
>
>  static char kvm_dir[PATH_MAX];
>
> -static int set_dir(const char *fmt, va_list args)
> +int kvm__set_dir(void)
>  {
> -   char tmp[PATH_MAX];
> +   int err;
>
> -   vsnprintf(tmp, sizeof(tmp), fmt, args);
> -
> -   mkdir(tmp, 0777);
> -
> -   if (!realpath(tmp, kvm_dir))
> -   return -errno;
> +   err = mkdir(KVM_PID_FILE_PATH, 0700);
> +   if (err != 0 && errno != EEXIST) {
> +   perror("mkdir " KVM_PID_FILE_PATH);
> +   return 1;
> +   }
>
> -   strcat(kvm_dir, "/");
> +   snprintf(kvm_dir, sizeof(kvm_dir), KVM_PID_FILE_PATH);
>
> return 0;
>  }
>
> -void kvm__set_dir(const char *fmt, ...)
> -{
> -   va_list args;
> -
> -   va_start(args, fmt);
> -   set_dir(fmt, args);
> -   va_end(args);
> -}
> -
>  const char *kvm__get_dir(void)
>  {
> return kvm_dir;
> diff --git a/main.c b/main.c
> index 05bc82c..22cc4e2 100644
> --- a/main.c
> +++ b/main.c
> @@ -13,7 +13,11 @@ static int handle_kvm_command(int argc, char **argv)
>
>  int main(int argc, char *argv[])
>  {
> -   kvm__set_dir("%s/%s", HOME_DIR, KVM_PID_FILE_PATH);
> +   int ret;
> +
> +   ret = kvm__set_dir();
> +   if (ret != 0)
> +   return ret;
>
> return handle_kvm_command(argc - 1, [1]);
>  }
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Dimitri.
98 sleeps till Christmas

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvmtool: don't rely on $HOME

2015-09-18 Thread Dimitri John Ledkov

On 18 September 2015 at 13:56, Will Deacon  wrote:
> On Fri, Sep 18, 2015 at 11:51:37AM +0100, Riku Voipio wrote:
>> On 17 September 2015 at 18:53, Will Deacon  wrote:
>> > On Thu, Sep 17, 2015 at 03:03:15PM +0100, Alban Crequy wrote:
>> >> kvm__set_dir() called in main() and kvm__get_dir() rely on $HOME. But in
>> >> some environments (such as starting lkvm through systemd-run), $HOME is
>> >> undefined. This causes bind() to use a socket path containing "(null)"
>> >> and this fails. The current code does not check errors returned by
>> >> realpath().
>> >>
>> >> Symptoms:
>> >>
>> >> | bind: No such file or directory
>> >> |   Error: Failed adding socket to epoll
>> >> |  Warning: Failed init: kvm_ipc__init
>> >> |
>> >> |  Fatal: Initialisation failed
>> >>
>> >> This bug was first reported on https://github.com/coreos/rkt/issues/1393
>> >>
>> >> Instead of using "$HOME/.lkvm/" (i.e. "/root/.lkvm/"), this patch uses
>> >> "/var/lib/lkvm/". This also improve the error reporting by printing the
>> >> socket filename.
>> >
>> > Hmm, but that requires lkvm to be run with sufficient privileges to
>> > write to /var/lib, which I don't think is generally the case. I think we
>> > have a few options:
>> >
>> >   (1) Try /var/lib/lkvm if $HOME is NULL
>> >   (2) Use an alternative environment variable for the pid prefix
>> >   (3) Add a --pid command line option for the pidfile
>> >   (4) ???ow
>> >
>> > Any preferences? What do other projects do?
>>
>> The right place to put a pid file would be $XDG_RUNTIME_DIR aka
>> /run/user/$UID/. System services write their pidfiles and sockets to
>> plain /run
>
> Ok, that certainly sounds like the right things to do for temporary
> structures then. Thanks.
>
>> The place where to put the rootfs structure created in builtin-setup.c
>> is more complicated. Is the  created rootfs epheremal? Then sticking
>> under /run(user/UID) is fine.
>
> Currently, the filesystem persists across multiple invocations of lkvm,
> so I can imagine somebody being surprised if what they thought was
> persistent was lost over something like a reboot.

http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html

In that case that data could go into XDG_DATA_HOME, XDG_CONFIG_HOME,
or XDG_CACHE_HOME depending on what semantic meaning it has.

-- 
Regards,

Dimitri.
98 sleeps till Christmas

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 kvmtool] Make static libc and guest-init functionality optional.

2015-09-16 Thread Dimitri John Ledkov

Hello Will,

Looks good to me =)

On 15 September 2015 at 18:20, Will Deacon <will.dea...@arm.com> wrote:
> Hi Dmitri,
>
> On Fri, Sep 11, 2015 at 03:40:00PM +0100, Dimitri John Ledkov wrote:
>> If one typically only boots full disk-images, one wouldn't necessaraly
>> want to statically link glibc, for the guest-init feature of the
>> kvmtool. As statically linked glibc triggers haevy security
>> maintainance.
>>
>> Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
>> ---
>>  Changes since v1:
>>  - rename CONFIG_HAS_LIBC to CONFIG_GUEST_INIT for clarity
>>  - use more ifdefs, instead of runtime check of _binary_guest_init_size==0
>
> The idea looks good, but I think we can tidy some of this up at the same
> time by moving all the guest_init code in builtin_setup.c.
>
> How about the patch below?
>
> Will
>
> --->8
>
> From cdce942c1a3a04635065a7972ca4e21386664756 Mon Sep 17 00:00:00 2001
> From: Dimitri John Ledkov <dimitri.j.led...@intel.com>
> Date: Fri, 11 Sep 2015 15:40:00 +0100
> Subject: [PATCH] Make static libc and guest-init functionality optional.
>
> If one typically only boots full disk-images, one wouldn't necessaraly
> want to statically link glibc, for the guest-init feature of the
> kvmtool. As statically linked glibc triggers haevy security
> maintainance.
>
> Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
> [will: moved all the guest_init handling into builtin_setup.c]
> Signed-off-by: Will Deacon <will.dea...@arm.com>
> ---
>  Makefile| 12 +++-
>  builtin-run.c   | 29 +
>  builtin-setup.c | 19 ++-
>  include/kvm/builtin-setup.h |  1 +
>  4 files changed, 23 insertions(+), 38 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 7b17d529d13b..f1701aa7b8ec 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -34,8 +34,6 @@ bindir_SQ = $(subst ','\'',$(bindir))
>  PROGRAM:= lkvm
>  PROGRAM_ALIAS := vm
>
> -GUEST_INIT := guest/init
> -
>  OBJS   += builtin-balloon.o
>  OBJS   += builtin-debug.o
>  OBJS   += builtin-help.o
> @@ -279,8 +277,13 @@ ifeq ($(LTO),1)
> endif
>  endif
>
> -ifneq ($(call try-build,$(SOURCE_STATIC),,-static),y)
> -$(error No static libc found. Please install glibc-static package.)
> +ifeq ($(call try-build,$(SOURCE_STATIC),,-static),y)
> +   CFLAGS  += -DCONFIG_GUEST_INIT
> +   GUEST_INIT  := guest/init
> +   GUEST_OBJS  = guest/guest_init.o
> +else
> +   $(warning No static libc found. Skipping guest init)
> +   NOTFOUND+= static-libc
>  endif
>
>  ifeq (y,$(ARCH_WANT_LIBFDT))
> @@ -356,7 +359,6 @@ c_flags = -Wp,-MD,$(depfile) $(CFLAGS)
>  # $(OTHEROBJS) are things that do not get substituted like this.
>  #
>  STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
> -GUEST_OBJS = guest/guest_init.o
>
>  $(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
> $(E) "  LINK" $@
> diff --git a/builtin-run.c b/builtin-run.c
> index 1ee75ad3f010..e0c87329e52b 100644
> --- a/builtin-run.c
> +++ b/builtin-run.c
> @@ -59,9 +59,6 @@ static int  kvm_run_wrapper;
>
>  bool do_debug_print = false;
>
> -extern char _binary_guest_init_start;
> -extern char _binary_guest_init_size;
> -
>  static const char * const run_usage[] = {
> "lkvm run [] []",
> NULL
> @@ -345,30 +342,6 @@ void kvm_run_help(void)
> usage_with_options(run_usage, options);
>  }
>
> -static int kvm_setup_guest_init(struct kvm *kvm)
> -{
> -   const char *rootfs = kvm->cfg.custom_rootfs_name;
> -   char tmp[PATH_MAX];
> -   size_t size;
> -   int fd, ret;
> -   char *data;
> -
> -   /* Setup /virt/init */
> -   size = (size_t)&_binary_guest_init_size;
> -   data = (char *)&_binary_guest_init_start;
> -   snprintf(tmp, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), rootfs);
> -   remove(tmp);
> -   fd = open(tmp, O_CREAT | O_WRONLY, 0755);
> -   if (fd < 0)
> -   die("Fail to setup %s", tmp);
> -   ret = xwrite(fd, data, size);
> -   if (ret < 0)
> -   die("Fail to setup %s", tmp);
> -   close(fd);
> -
> -   return 0;
> -}
> -
>  static int kvm_run_set_sandbox(struct kvm *kvm)
>  {
> const char *guestfs_name = kvm->cfg.custom_rootfs_name;
> @@ -631,7 +604,7 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
> **argv)
>
> if (!kvm

Re: [PATCH kvmtool] Make static libc and guest-init functionality optional.

2015-09-11 Thread Dimitri John Ledkov

On 11 September 2015 at 13:47, Andre Przywara <andre.przyw...@arm.com> wrote:
> Hi Dimitri,
>
> thanks for sharing this patch and sorry for the delay.

No worries, I have a few more patches to send, polishing them for release.

>
> (CC:ing Will)
>
> On 04/09/15 13:04, Dimitri John Ledkov wrote:
>> If one typically only boots full disk-images, one wouldn't necessaraly
>> want to statically link glibc, for the guest-init feature of the
>> kvmtool. As statically linked glibc triggers haevy security
>> maintainance.
>
> I like the idea of making guest-init optional, and actually was bitten
> by this annoying static libc requirement once before.
> Some comments below:
>

\o/

>>
>> Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
>> ---
>>  Makefile| 11 ++-
>>  builtin-run.c   |  7 +++
>>  builtin-setup.c |  7 +++
>>  3 files changed, 20 insertions(+), 5 deletions(-)
>>
>> diff --git a/Makefile b/Makefile
>> index 1534e6f..42a629a 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -34,8 +34,6 @@ bindir_SQ = $(subst ','\'',$(bindir))
>>  PROGRAM  := lkvm
>>  PROGRAM_ALIAS := vm
>>
>> -GUEST_INIT := guest/init
>> -
>>  OBJS += builtin-balloon.o
>>  OBJS += builtin-debug.o
>>  OBJS += builtin-help.o
>> @@ -279,8 +277,12 @@ ifeq ($(LTO),1)
>>   endif
>>  endif
>>
>> -ifneq ($(call try-build,$(SOURCE_STATIC),,-static),y)
>> -$(error No static libc found. Please install glibc-static package.)
>> +ifeq ($(call try-build,$(SOURCE_STATIC),,-static),y)
>> + CFLAGS  += -DCONFIG_HAS_LIBC
>
> The name CONFIG_HAS_LIBC seems a bit misleading to me, so at least this
> symbol should read CONFIG_HAS_STATIC_LIBC. But I'd prefer to have it
> named after it's user instead: CONFIG_GUEST_INIT (or the like), since
> this is what it protects in the code.
>

OK, sounds good. I am bad at naming things =) this looks good.


>> + GUEST_INIT := guest/init
>> + GUEST_OBJS = guest/guest_init.o
>> +else
>> + NOTFOUND+= static-libc
>>  endif
>>
>>  ifeq (y,$(ARCH_WANT_LIBFDT))
>> @@ -356,7 +358,6 @@ c_flags   = -Wp,-MD,$(depfile) $(CFLAGS)
>>  # $(OTHEROBJS) are things that do not get substituted like this.
>>  #
>>  STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
>> -GUEST_OBJS = guest/guest_init.o
>>
>>  $(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
>>   $(E) "  LINK" $@
>> diff --git a/builtin-run.c b/builtin-run.c
>> index 1ee75ad..0f67471 100644
>> --- a/builtin-run.c
>> +++ b/builtin-run.c
>> @@ -59,8 +59,13 @@ static int  kvm_run_wrapper;
>>
>>  bool do_debug_print = false;
>>
>> +#ifdef CONFIG_HAS_LIBC
>>  extern char _binary_guest_init_start;
>>  extern char _binary_guest_init_size;
>> +#else
>> +static char _binary_guest_init_start=0;
>> +static char _binary_guest_init_size=0;
>> +#endif
>>
>>  static const char * const run_usage[] = {
>>   "lkvm run [] []",
>> @@ -354,6 +359,8 @@ static int kvm_setup_guest_init(struct kvm *kvm)
>>   char *data;
>>
>>   /* Setup /virt/init */
>> + if (!_binary_guest_init_size)
>> + die("Guest init not compiled");
>
> I wonder if comparing with 0 is safe in every case. I appreciate not
> spoiling the code with #ifdefs, but putting one around here seems
> cleaner to me (especially if you look at the error message).

Ok, I can put the #ifdef here as well. Note that the non-extern
declaration will still be needed in the code above, as otherwise the
build fails to link without static-libc.

>
>>   size = (size_t)&_binary_guest_init_size;
>>   data = (char *)&_binary_guest_init_start;
>>   snprintf(tmp, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), rootfs);
>> diff --git a/builtin-setup.c b/builtin-setup.c
>> index 8b45c56..d77e5e0 100644
>> --- a/builtin-setup.c
>> +++ b/builtin-setup.c
>> @@ -16,8 +16,13 @@
>>  #include 
>>  #include 
>>
>> +#ifdef CONFIG_HAS_LIBC
>>  extern char _binary_guest_init_start;
>>  extern char _binary_guest_init_size;
>> +#else
>> +static char _binary_guest_init_start=0;
>> +static char _binary_guest_init_size=0;
>> +#endif
>>
>>  static const char *instance_name;
>>
>> @@ -131,6 +136,8 @@ static int copy_init(const char *guestfs_name)
>>   int fd, ret;
>>   char *data;
>>
>> + if (!_binary_guest_init_size)
>> + die("Guest init not compiled");
>
> Same as above.

Ack.

>
> Cheers,
> Andre.
>
>>   size = (size_t)&_binary_guest_init_size;
>>   data = (char *)&_binary_guest_init_start;
>>   snprintf(path, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), 
>> guestfs_name);
>>

-- 
Regards,

Dimitri.
Pura Vida!

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 kvmtool] Make static libc and guest-init functionality optional.

2015-09-11 Thread Dimitri John Ledkov

If one typically only boots full disk-images, one wouldn't necessaraly
want to statically link glibc, for the guest-init feature of the
kvmtool. As statically linked glibc triggers haevy security
maintainance.

Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
---
 Changes since v1:
 - rename CONFIG_HAS_LIBC to CONFIG_GUEST_INIT for clarity
 - use more ifdefs, instead of runtime check of _binary_guest_init_size==0
 
 Makefile| 11 ++-
 builtin-run.c   |  6 ++
 builtin-setup.c |  6 ++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/Makefile b/Makefile
index 1534e6f..bc6059c 100644
--- a/Makefile
+++ b/Makefile
@@ -34,8 +34,6 @@ bindir_SQ = $(subst ','\'',$(bindir))
 PROGRAM:= lkvm
 PROGRAM_ALIAS := vm
 
-GUEST_INIT := guest/init
-
 OBJS   += builtin-balloon.o
 OBJS   += builtin-debug.o
 OBJS   += builtin-help.o
@@ -279,8 +277,12 @@ ifeq ($(LTO),1)
endif
 endif
 
-ifneq ($(call try-build,$(SOURCE_STATIC),,-static),y)
-$(error No static libc found. Please install glibc-static package.)
+ifeq ($(call try-build,$(SOURCE_STATIC),,-static),y)
+   CFLAGS  += -DCONFIG_GUEST_INIT
+   GUEST_INIT := guest/init
+   GUEST_OBJS = guest/guest_init.o
+else
+   NOTFOUND+= static-libc
 endif
 
 ifeq (y,$(ARCH_WANT_LIBFDT))
@@ -356,7 +358,6 @@ c_flags = -Wp,-MD,$(depfile) $(CFLAGS)
 # $(OTHEROBJS) are things that do not get substituted like this.
 #
 STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
-GUEST_OBJS = guest/guest_init.o
 
 $(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
$(E) "  LINK" $@
diff --git a/builtin-run.c b/builtin-run.c
index 1ee75ad..e27acd6 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -59,8 +59,10 @@ static int  kvm_run_wrapper;
 
 bool do_debug_print = false;
 
+#ifdef CONFIG_GUEST_INIT
 extern char _binary_guest_init_start;
 extern char _binary_guest_init_size;
+#endif
 
 static const char * const run_usage[] = {
"lkvm run [] []",
@@ -347,6 +349,7 @@ void kvm_run_help(void)
 
 static int kvm_setup_guest_init(struct kvm *kvm)
 {
+#ifdef CONFIG_GUEST_INIT
const char *rootfs = kvm->cfg.custom_rootfs_name;
char tmp[PATH_MAX];
size_t size;
@@ -367,6 +370,9 @@ static int kvm_setup_guest_init(struct kvm *kvm)
close(fd);
 
return 0;
+#else
+   die("Guest init not compiled");
+#endif
 }
 
 static int kvm_run_set_sandbox(struct kvm *kvm)
diff --git a/builtin-setup.c b/builtin-setup.c
index 8b45c56..ff796c3 100644
--- a/builtin-setup.c
+++ b/builtin-setup.c
@@ -16,8 +16,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_GUEST_INIT
 extern char _binary_guest_init_start;
 extern char _binary_guest_init_size;
+#endif
 
 static const char *instance_name;
 
@@ -126,6 +128,7 @@ static const char *guestfs_symlinks[] = {
 
 static int copy_init(const char *guestfs_name)
 {
+#ifdef CONFIG_GUEST_INIT
char path[PATH_MAX];
size_t size;
int fd, ret;
@@ -144,6 +147,9 @@ static int copy_init(const char *guestfs_name)
close(fd);
 
return 0;
+#else
+   die("Guest init not compiled");
+#endif
 }
 
 static int copy_passwd(const char *guestfs_name)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kvmtool] Make static libc and guest-init functionality optional.

2015-09-04 Thread Dimitri John Ledkov

If one typically only boots full disk-images, one wouldn't necessaraly
want to statically link glibc, for the guest-init feature of the
kvmtool. As statically linked glibc triggers haevy security
maintainance.

Signed-off-by: Dimitri John Ledkov <dimitri.j.led...@intel.com>
---
 Makefile| 11 ++-
 builtin-run.c   |  7 +++
 builtin-setup.c |  7 +++
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/Makefile b/Makefile
index 1534e6f..42a629a 100644
--- a/Makefile
+++ b/Makefile
@@ -34,8 +34,6 @@ bindir_SQ = $(subst ','\'',$(bindir))
 PROGRAM:= lkvm
 PROGRAM_ALIAS := vm
 
-GUEST_INIT := guest/init
-
 OBJS   += builtin-balloon.o
 OBJS   += builtin-debug.o
 OBJS   += builtin-help.o
@@ -279,8 +277,12 @@ ifeq ($(LTO),1)
endif
 endif
 
-ifneq ($(call try-build,$(SOURCE_STATIC),,-static),y)
-$(error No static libc found. Please install glibc-static package.)
+ifeq ($(call try-build,$(SOURCE_STATIC),,-static),y)
+   CFLAGS  += -DCONFIG_HAS_LIBC
+   GUEST_INIT := guest/init
+   GUEST_OBJS = guest/guest_init.o
+else
+   NOTFOUND+= static-libc
 endif
 
 ifeq (y,$(ARCH_WANT_LIBFDT))
@@ -356,7 +358,6 @@ c_flags = -Wp,-MD,$(depfile) $(CFLAGS)
 # $(OTHEROBJS) are things that do not get substituted like this.
 #
 STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
-GUEST_OBJS = guest/guest_init.o
 
 $(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
$(E) "  LINK" $@
diff --git a/builtin-run.c b/builtin-run.c
index 1ee75ad..0f67471 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -59,8 +59,13 @@ static int  kvm_run_wrapper;
 
 bool do_debug_print = false;
 
+#ifdef CONFIG_HAS_LIBC
 extern char _binary_guest_init_start;
 extern char _binary_guest_init_size;
+#else
+static char _binary_guest_init_start=0;
+static char _binary_guest_init_size=0;
+#endif
 
 static const char * const run_usage[] = {
"lkvm run [] []",
@@ -354,6 +359,8 @@ static int kvm_setup_guest_init(struct kvm *kvm)
char *data;
 
/* Setup /virt/init */
+   if (!_binary_guest_init_size)
+   die("Guest init not compiled");
size = (size_t)&_binary_guest_init_size;
data = (char *)&_binary_guest_init_start;
snprintf(tmp, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), rootfs);
diff --git a/builtin-setup.c b/builtin-setup.c
index 8b45c56..d77e5e0 100644
--- a/builtin-setup.c
+++ b/builtin-setup.c
@@ -16,8 +16,13 @@
 #include 
 #include 
 
+#ifdef CONFIG_HAS_LIBC
 extern char _binary_guest_init_start;
 extern char _binary_guest_init_size;
+#else
+static char _binary_guest_init_start=0;
+static char _binary_guest_init_size=0;
+#endif
 
 static const char *instance_name;
 
@@ -131,6 +136,8 @@ static int copy_init(const char *guestfs_name)
int fd, ret;
char *data;
 
+   if (!_binary_guest_init_size)
+   die("Guest init not compiled");
size = (size_t)&_binary_guest_init_size;
data = (char *)&_binary_guest_init_start;
snprintf(path, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), 
guestfs_name);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2015-06-24 Thread John Nielsen

On Jun 24, 2015, at 9:50 AM, Paolo Bonzini pbonz...@redhat.com wrote:

 On 23/06/2015 00:08, John Nielsen wrote:
 I’m resurrecting an old thread since I haven’t heard anything in a
 while. Has anyone looked in to the KVM+apicv bug documented above as
 well as here:
 
 https://bugs.launchpad.net/qemu/+bug/1329956 ?
 
 If appropriate, where should I go to file a KVM bug (since this isn’t
 really Qemu’s problem)?
 
 Hi John, does this happen with the latest upstream kernel version ?
 
 I know for sure it happens with 4.0.4 and I’m not aware of any newer changes 
 that would affect it.--
 To unsubscribe from this list: send the line unsubscribe kvm in
 
 Can you reproduce it with 10.1?
 
 I did this:
 
 1) download
 http://download.pcbsd.org/iso/10.1-RELEASE/amd64/PCBSD10.1.2-x64-trueos-server.raw.xz
 and unpack it
 
 2) run it with qemu-kvm -drive
 if=virtio,PCBSD10.1.2-x64-trueos-server.raw -smp 2
 
 3) login as root/pcbsd, type reboot
 
 I would like to know if I'm doing anything wrong.  My machine is a Xeon
 E5 v3 (Haswell).  My SeaBIOS build doesn't have the atkbd0 bug, but just
 to rule that out, can you send me your Seabios binary
 (/usr/share/qemu/bios*.bin) as well?

Interesting. Using the same PC-BSD image I am able to reproduce on a server 
running slightly older software but I can not reproduce running bleeding edge. 
I verified enable_apicv=Y on both. In both cases I ran
qemu-kvm -drive if=virtio,file=PCBSD10.1.2-x64-trueos-server.raw -smp 2 -vnc 
0.0.0.0:0

Specifically:

Breaks (VM hangs during boot after pressing ctrl-alt-del):
kernel 3.12.22
qemu-kvm-1.7.0-3.el6.x86_64
seabios-1.7.3.1-1.el6.noarch
Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz

Works (VM reboots normally):
kernel 4.0.4
qemu-kvm-2.3.0-6.el7.centos.x86_64
seabios-bin-1.8.1-1.el7.centos.noarch
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz


Unfortunately I no longer have the test environment I used a few days ago to 
reproduce this issue so I can’t verify the software versions that were in use. 
It’s possible I was mistaken about the kernel version (I thought it was 4.0.4). 
Perhaps it really is fixed in the newer kernel? In any case, this is great 
news! I would be interested in identifying the patch(es) that fixed the issue 
to make back-porting them easier, but I won’t have time to do a binary search 
anytime soon.

Thanks for looking in to this again. If anyone else is interested in 
identifying what specifically fixed the issue please let me know if I can do 
anything to help.

JN

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2015-06-22 Thread John Nielsen

On Jun 22, 2015, at 3:48 PM, Bandan Das b...@redhat.com wrote:

 John Nielsen li...@jnielsen.net writes:
 
 On Jun 17, 2014, at 10:48 AM, John Nielsen li...@jnielsen.net wrote:
 
 On Jun 17, 2014, at 12:05 AM, Gleb Natapov g...@kernel.org wrote:
 
 On Tue, Jun 17, 2014 at 06:21:23AM +0200, Paolo Bonzini wrote:
 Il 16/06/2014 18:47, John Nielsen ha scritto:
 On Jun 16, 2014, at 10:39 AM, Paolo Bonzini pbonz...@redhat.com
 wrote:
 
 Il 16/06/2014 18:09, John Nielsen ha scritto:
 The only substantial difference on the hardware side is the
 CPU.  The hosts where the problem occurs use Intel(R)
 Xeon(R) CPU E5-2650 v2 @ 2.60GHz, while the hosts that don't
 show the problem use the prior revision, Intel(R) Xeon(R)
 CPU E5-2650 0 @ 2.00GHz.
 Can you do grep . /sys/module/kvm_intel/parameters/* on both
 hosts please?
 No differences that I can see. Output below.
 Not really:
 
 Working host: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz # grep
 . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:N
 
 Problem host: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz # grep
 . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:Y
 So we have a clue.  Let me study the code more, I'll try to get
 back with a suggestion.
 Wow, can't believe I missed that. Good catch!
 
 Does disabling apicv on E5-2650 v2 make reboot problem go away?
 Yes it does!
 
 # modprobe kvm_intel /sys/module/kvm_intel/parameters/enable_apicv:Y
 # /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m
 512 -smp 2,sockets=1,cores=1,threads=2 -drive
 file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2
 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net
 none
 
 [problem occurs]
 
 # rmmod kvm_intel # modprobe kvm_intel enable_apicv=N
 /sys/module/kvm_intel/parameters/enable_apicv:N #
 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512
 -smp 2,sockets=1,cores=1,threads=2 -drive
 file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2
 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net
 none
 
 [problem does not occur]
 
 Thank you. This both narrows the problem considerably and provides
 an acceptable workaround. It would still be nice to see it fixed, of
 course. Keep me CC'ed as I'm not on the KVM list.
 
 I’m resurrecting an old thread since I haven’t heard anything in a
 while. Has anyone looked in to the KVM+apicv bug documented above as
 well as here:
 
 https://bugs.launchpad.net/qemu/+bug/1329956 ?
 
 If appropriate, where should I go to file a KVM bug (since this isn’t
 really Qemu’s problem)?
 
 Hi John, does this happen with the latest upstream kernel version ?

I know for sure it happens with 4.0.4 and I’m not aware of any newer changes 
that would affect it.--
To unsubscribe from this list: send the line unsubscribe kvm in

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2015-06-22 Thread John Nielsen

On Jun 17, 2014, at 10:48 AM, John Nielsen li...@jnielsen.net wrote:

 On Jun 17, 2014, at 12:05 AM, Gleb Natapov g...@kernel.org wrote:
 
 On Tue, Jun 17, 2014 at 06:21:23AM +0200, Paolo Bonzini wrote:
 Il 16/06/2014 18:47, John Nielsen ha scritto:
 On Jun 16, 2014, at 10:39 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 
 Il 16/06/2014 18:09, John Nielsen ha scritto:
 The only substantial difference on the hardware side is the CPU.
 The hosts where the problem occurs use Intel(R) Xeon(R) CPU
 E5-2650 v2 @ 2.60GHz, while the hosts that don't show the
 problem use the prior revision, Intel(R) Xeon(R) CPU E5-2650 0 @
 2.00GHz.
 
 Can you do grep . /sys/module/kvm_intel/parameters/* on both hosts 
 please?
 
 No differences that I can see. Output below.
 
 Not really:
 
 Working host:
 Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
 # grep . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:N
 
 Problem host:
 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
 # grep . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:Y
 
 So we have a clue.  Let me study the code more, I'll try to get back with a
 suggestion.
 
 Wow, can't believe I missed that. Good catch!
 
 Does disabling apicv on E5-2650 v2 make reboot problem go away?
 
 Yes it does!
 
 # modprobe kvm_intel
 /sys/module/kvm_intel/parameters/enable_apicv:Y
 # /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
 2,sockets=1,cores=1,threads=2 -drive 
 file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none
 
 [problem occurs]
 
 # rmmod kvm_intel
 # modprobe kvm_intel enable_apicv=N
 /sys/module/kvm_intel/parameters/enable_apicv:N
 # /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
 2,sockets=1,cores=1,threads=2 -drive 
 file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none
 
 [problem does not occur]
 
 Thank you. This both narrows the problem considerably and provides an 
 acceptable workaround. It would still be nice to see it fixed, of course. 
 Keep me CC'ed as I'm not on the KVM list.

I’m resurrecting an old thread since I haven’t heard anything in a while. Has 
anyone looked in to the KVM+apicv bug documented above as well as here:

https://bugs.launchpad.net/qemu/+bug/1329956 ?

If appropriate, where should I go to file a KVM bug (since this isn’t really 
Qemu’s problem)?

Thanks,

JN--
To unsubscribe from this list: send the line unsubscribe kvm in

[no subject]

2015-01-22 Thread Mr John Wong




i need your assistance in transferring some funds

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Infinite IRQ injection loop in QEMU

2014-10-22 Thread John Snow


Hello all;

I've been working on improving the AHCI device emulation for QEMU but 
have recently run into an issue where Windows 8 guests -- upon trying to 
resume from hibernation -- manage to trigger an infinite IRQ injection 
loop where it seems that the IRQ never quite properly gets cleared.


I am still working on troubleshooting it further, but I wanted to see if 
anyone had advice or experience with this type of issue.


In a nutshell:
- Windows 8 boots up inside of QEMU/KVM
- Windows 8 is suspended to disk either via shut down or explicit 
hibernate. QEMU exits.

- Windows 8 is resumed
- Windows 8 resets the AHCI device and begins re-initializing it
- Once the active AHCI port is reset, it issues an interrupt to indicate 
it has a pending message (set of register values) ready for the host to 
synchronize state with the HBA. This interrupt appears to be legacy PCI 
and not MSI.

- This triggers an infinite injection loop.

Here are some characteristic traces from perf record, grabbing 
kvm-related entries with user space traces.


Here's where the interrupt first appears to become stuck, showing when 
it is set: http://pastebin.com/KPevxCw2


It looks like pin #16, vec=177. All activity in the guest and QEMU now 
apparently ceases, and then the perf script shows many, many loops which 
look like the following: http://pastebin.com/qYh9035y


which repeats over-and-over. It does not appear that QEMU is re-setting 
the IRQ, and there are no further calls from the guest into ICH9 or AHCI 
related code to set/unset any device registers.


In talking with Stefan, we think that the irr bit is possibly not 
getting cleared (or getting set again?) after the EOI (see the first 
paste) -- does anyone have experience with debugging this type of issue, 
or have some hints about what may be happening?


Thanks in advance for any advice.
--John S.

(As a post-script: the kernel I am using is the version provided by 
David Airlie for MST [Multi-Stream Transport] support in Linux, which is 
still experimental. Sorry for the non-stock kernel! 
http://airlied.livejournal.com/79657.html)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge

2014-09-04 Thread John Stultz

On Thu, Sep 4, 2014 at 9:00 AM, Chris J Arges
chris.j.ar...@canonical.com wrote:


 On 09/04/2014 07:58 AM, Paolo Bonzini wrote:
 Commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds
 based, 2014-07-16) forgot to add tk-xtime_sec, thus breaking kvmclock on
 hosts that have a reliable TSC.  Add it back; and since the field boot_ns
 is not anymore related to the host boot-based clock, rename 
 boot_ns-nsec_base
 and the existing nsec_base-snsec_base.

 Cc: Thomas Gleixner t...@linutronix.de
 Cc: John Stultz john.stu...@linaro.org
 Reported-by: Chris J Arges chris.j.ar...@canonical.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/x86.c | 11 ++-
  1 file changed, 6 insertions(+), 5 deletions(-)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 8f1e22d3b286..92493e10937c 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1020,8 +1020,8 @@ struct pvclock_gtod_data {
   u32 shift;
   } clock;

 - u64 boot_ns;
   u64 nsec_base;
 + u64 snsec_base;
  };

  static struct pvclock_gtod_data pvclock_gtod_data;
 @@ -1042,8 +1042,9 @@ static void update_pvclock_gtod(struct timekeeper *tk)
   vdata-clock.mult   = tk-tkr.mult;
   vdata-clock.shift  = tk-tkr.shift;

 - vdata-boot_ns  = boot_ns;
 - vdata-nsec_base= tk-tkr.xtime_nsec;
 + vdata-nsec_base= tk-xtime_sec * (u64)NSEC_PER_SEC
 + + boot_ns;
 + vdata-snsec_base   = tk-tkr.xtime_nsec;

   write_seqcount_end(vdata-seq);
  }
 @@ -1413,10 +1414,10 @@ static int do_monotonic_boot(s64 *t, cycle_t 
 *cycle_now)
   do {
   seq = read_seqcount_begin(gtod-seq);
   mode = gtod-clock.vclock_mode;
 - ns = gtod-nsec_base;
 + ns = gtod-snsec_base;
   ns += vgettsc(cycle_now);
   ns = gtod-clock.shift;
 - ns += gtod-boot_ns;
 + ns += gtod-nsec_base;
   } while (unlikely(read_seqcount_retry(gtod-seq, seq)));
   *t = ns;



 Paulo,
 I've tested with the above patch and I still have issues with the
 kvmclock test offset; however the cycle tests pass now.

 Here is trace data:
 http://people.canonical.com/~arges/kvm/trace-4.dat.xz

 Uptime:
  15:58:02 up  1:00,  1 user,  load average: 0.59, 0.60, 0.31

 Here is the output:

 ./x86-run x86/kvmclock_test.flat -smp 2 --append 1000 `date +%s`
 qemu-system-x86_64 -enable-kvm -device pc-testdev -device
 isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
 -device pci-testdev -kernel x86/kvmclock_test.flat -smp 2 --append
 1000 1409846210
 enabling apic
 enabling apic
 kvm-clock: cpu 0, msr 0x:44d4c0
 kvm-clock: cpu 0, msr 0x:44d4c0
 Wallclock test, threshold 5
 Seconds get from host: 1409846210
 Seconds get from kvmclock: 2819688866
 Offset:1409842656
 offset too large!


Hey, thanks for reporting the issue and sending an initial patch (even
if its not quite all sorted yet).

Is the test you're using here available somewhere? Are there any
special requirements to run it?

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/7] random, timekeeping: Collect timekeeping entropy in the timekeeping code

2014-08-20 Thread John Stultz

On Thu, Aug 14, 2014 at 12:43 AM, Andy Lutomirski l...@amacapital.net wrote:
 Currently, init_std_data calls ktime_get_real().  This imposes
 awkward constraints on when init_std_data can be called, and
 init_std_data is unlikely to collect the full unpredictable data
 available to the timekeeping code, especially after resume.

 Remove this code from random.c and add the appropriate
 add_device_randomness calls to timekeeping.c instead.

 Cc: John Stultz john.stu...@linaro.org
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  drivers/char/random.c |  2 --
  kernel/time/timekeeping.c | 11 +++
  2 files changed, 11 insertions(+), 2 deletions(-)

 diff --git a/drivers/char/random.c b/drivers/char/random.c
 index 7673e60..8dc3e3a 100644
 --- a/drivers/char/random.c
 +++ b/drivers/char/random.c
 @@ -1263,12 +1263,10 @@ static void seed_entropy_store(void *ctx, u32 data)
  static void init_std_data(struct entropy_store *r)
  {
 int i;
 -   ktime_t now = ktime_get_real();
 unsigned long rv;
 char log_prefix[128];

 r-last_pulled = jiffies;
 -   mix_pool_bytes(r, now, sizeof(now), NULL);
 for (i = r-poolinfo-poolbytes; i  0; i -= sizeof(rv)) {
 rv = random_get_entropy();
 mix_pool_bytes(r, rv, sizeof(rv), NULL);
 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
 index 32d8d6a..9609db9 100644
 --- a/kernel/time/timekeeping.c
 +++ b/kernel/time/timekeeping.c
 @@ -23,6 +23,7 @@
  #include linux/stop_machine.h
  #include linux/pvclock_gtod.h
  #include linux/compiler.h
 +#include linux/random.h

  #include tick-internal.h
  #include ntp_internal.h
 @@ -835,6 +836,9 @@ void __init timekeeping_init(void)
 memcpy(shadow_timekeeper, timekeeper, sizeof(timekeeper));

 write_seqcount_end(timekeeper_seq);
 +
 +   add_device_randomness(tk, sizeof(tk));
 +


So I can't (and really don't want to) vouch for the correctness side
of this. The initial idea of using the structure instead of reading
the time worried me a bit, but we have already read the clocksource
and stored it in cycle_last so there's a wee bit more then just the
RTC time and a bunch of zeros in the timekeeper structure.

Though on some systems the read_persistent_clock call can't access the
RTC at timekeeping_init, so I'm not sure we're really getting that
much more then the cycle_last clocksource value here. Probably should
add something like this to the RTC hctosys logic.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2014-06-20 Thread John Nielsen

On Jun 16, 2014, at 10:21 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 16/06/2014 18:47, John Nielsen ha scritto:
 On Jun 16, 2014, at 10:39 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 
 Il 16/06/2014 18:09, John Nielsen ha scritto:
 The only substantial difference on the hardware side is the CPU.
 The hosts where the problem occurs use Intel(R) Xeon(R) CPU
 E5-2650 v2 @ 2.60GHz, while the hosts that don't show the
 problem use the prior revision, Intel(R) Xeon(R) CPU E5-2650 0 @
 2.00GHz.
 
 Can you do grep . /sys/module/kvm_intel/parameters/* on both hosts please?
 
 No differences that I can see. Output below.
 
 Not really:
 
 Working host:
 Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
 # grep . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:N
 
 Problem host:
 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
 # grep . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:Y
 
 So we have a clue.  Let me study the code more, I'll try to get back with a 
 suggestion.

Paolo, have you had an opportunity to look in to this some more?

Thanks,

JN

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2014-06-20 Thread John Nielsen

On Jun 20, 2014, at 1:53 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 20/06/2014 17:41, John Nielsen ha scritto:
 
  So we have a clue.  Let me study the code more, I'll try to get back with 
  a suggestion.
 Paolo, have you had an opportunity to look in to this some more?
 
 Not yet, sorry.
 
 One possibility is this though.  Can you try migrating (or saving/restoring) 
 the guest when it's hung, and see if it resuscitates?

The guest is still hung after a save/restore.

# /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
2,sockets=1,cores=1,threads=2 -drive 
file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
-device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none -monitor 
stdio
QEMU 2.0.50 monitor - type 'help' for more information
(qemu) stop
(qemu) savevm smphang
(qemu) q
# /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
2,sockets=1,cores=1,threads=2 -drive 
file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
-device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none -monitor 
stdio -loadvm smphang
QEMU 2.0.50 monitor - type 'help' for more information
(qemu) 

[The VNC console shows the same hung kernel screen as when I ran savevm]

JN

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2014-06-17 Thread John Nielsen

On Jun 17, 2014, at 12:05 AM, Gleb Natapov g...@kernel.org wrote:

 On Tue, Jun 17, 2014 at 06:21:23AM +0200, Paolo Bonzini wrote:
 Il 16/06/2014 18:47, John Nielsen ha scritto:
 On Jun 16, 2014, at 10:39 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 
 Il 16/06/2014 18:09, John Nielsen ha scritto:
 The only substantial difference on the hardware side is the CPU.
 The hosts where the problem occurs use Intel(R) Xeon(R) CPU
 E5-2650 v2 @ 2.60GHz, while the hosts that don't show the
 problem use the prior revision, Intel(R) Xeon(R) CPU E5-2650 0 @
 2.00GHz.
 
 Can you do grep . /sys/module/kvm_intel/parameters/* on both hosts 
 please?
 
 No differences that I can see. Output below.
 
 Not really:
 
 Working host:
 Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
 # grep . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:N
 
 Problem host:
 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
 # grep . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:Y
 
 So we have a clue.  Let me study the code more, I'll try to get back with a
 suggestion.

Wow, can't believe I missed that. Good catch!

 Does disabling apicv on E5-2650 v2 make reboot problem go away?

Yes it does!

# modprobe kvm_intel
/sys/module/kvm_intel/parameters/enable_apicv:Y
# /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
2,sockets=1,cores=1,threads=2 -drive 
file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
-device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none

[problem occurs]

# rmmod kvm_intel
# modprobe kvm_intel enable_apicv=N
/sys/module/kvm_intel/parameters/enable_apicv:N
# /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
2,sockets=1,cores=1,threads=2 -drive 
file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
-device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none

[problem does not occur]

Thank you. This both narrows the problem considerably and provides an 
acceptable workaround. It would still be nice to see it fixed, of course. Keep 
me CC'ed as I'm not on the KVM list.

JN

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

VM outperforming host

2013-12-29 Thread John Paul Walters

HI,

I’ve been benchmarking of several GPU-enabled applications on both physical 
hardware and within KVM.  To my surprise, I’ve found a small subset of 
benchmarks that are able to outperform the host system by as much as 15% in 
some cases, and I’m hoping that someone may be able to offer some insight into 
what might be the cause, or where to start looking. The system in question:

Host:
* Arch Linux with 3.12 kernel 
* qemu 1.7 
* 2 Xeon E5-2670 (total of 16 cores), 48 GB RAM (split evenly over 2 NUMA 
nodes) 
* 1 NVIDIA K20m GPU, with gigabit ethernet networking  
* 10Gbe and Infiniband adapters, but neither are in use

Guest:
* CentOS 6.4 with 2.6.32-358.23.2 kernel
* 20 GB RAM and 8 physical cores from NUMA node 0
* default networking
* K20m GPU using PCIe passthrough

The qemu command line:
qemu-system-x86_64 -enable-kvm -M q35 -m 20576 -cpu host -smp 
8,sockets=1,cores=8,threads=1 -device ahci,bus=pcie.0,id=ahci -bios 
/usr/share/qemu/bios.bin -drive 
file=/root/centos_6.4/centos_flat.img,id=disk,format=raw -device 
ide-hd,bus=ahci.0,drive=disk -vnc 0.0.0.0:1 -redir tcp:52109::22 -device 
pci-assign,host=08:00.0

I’ve ensured that the VM runs entirely within a single NUMA node by creating a 
cpuset with the appropriate physical cores and memory nodes.  I’ve done the 
same for the host system tests.  I’ve also loaded the host system with CentOS 
6.4 and rerun the same experiments, hoping that this issue was related to the 
host system kernel or Arch Linux.  It wasn’t.  

So far, I’ve tried disabling unused PCIe devices on the host, hoping that doing 
so would speed up the host side experiments, but it didn’t.  I’ve disabled 
transparent huge pages after I noticed that the VM memory appears to be backed 
by them.  This reduced the performance of the guest slightly, but did not come 
close to canceling out the performance gains of the VM.  I’ve experimented with 
several combinations of NUMA-related scheduler options, with virtually no 
effect.  Drivers and libraries are identical between host and guest. Does 
anyone have any suggestions for tracking down either where I’m losing 
performance on the host, or gaining performance in the VM?

thanks for any help or suggestions,
JP--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] make mac programming for virtio net more robust

2013-01-11 Thread John Fastabend


On 1/10/2013 11:46 PM, Michael S. Tsirkin wrote:

On Fri, Jan 11, 2013 at 12:53:07PM +1030, Rusty Russell wrote:

Michael S. Tsirkin m...@redhat.com writes:


On Thu, Jan 10, 2013 at 10:45:39PM +0800, ak...@redhat.com wrote:

From: Amos Kong ak...@redhat.com

Currenly mac is programmed byte by byte. This means that we
have an intermediate step where mac is wrong.

Second patch introduced a new vq control command to set mac
address in one time.


As you mention we could alternatively do it without
new commands, simply add a feature bit that says that MACs are
in the mac table.
This would be a much bigger patch, and I'm fine with either way.
Rusty what do you think?


Hmm, mac filtering and my mac address are not quite the same thing.  I
don't know if it matters for anyone: does it?
The mac address is abused
for things like identifying machines, etc.


I don't know either. I think net core differentiates between mac and
uc_list because linux has to know which mac to use when building
up packets, so at some level, I agree it might be useful to identify the
machine.

BTW netdev/davem should have been copied on this, Amos I think it's a
good idea to remember to do it next time you post.



If we keep it as a separate concept, Amos' patch seems to make sense.


Yes. It also keeps the patch small, I just thought I'd mention the
option.



Cheers,
Rusty.




Don't have the entire context here but if you implement the
ndo_fdb_dump() probably hooking it up to ndo_dflt_fdb_dump() you could
use the 'bridge' tool dump the uc_list.

Then use ndo_fdb_add() and ndo_fdb_del() to add and remove entries
from the uc_list. We do this today in macvlan and the ixgbe driver when
it is in SR-IOV mode and the embedded switch needs to be programmed.

fdb is forwarding database its a bit different then mac filtering
in that its telling the switch how to forward mac addresses, in
ixgbe and macvlan at least we have been overloading it a bit to also
stop filtering the mac address. I think this makes sense if you setup
forwarding to a port it doesn't make much sense to then drop them.

Maybe its not entirely applicable here just thought I would mention it.

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 13/18] time: export time information for KVM pvclock

2012-11-14 Thread John Stultz


On 11/14/2012 04:08 PM, Marcelo Tosatti wrote:

As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Thanks for the updates here.  The notifier method is interesting, and if 
it works well, we may want to extend it later to cover the vsyscall code 
too, but that can be done in a later iteration.


Acked-by: John Stultz johns...@us.ibm.com

thanks
-john

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 14/18] time: export time information for KVM pvclock

2012-11-09 Thread John Stultz


On 10/24/2012 06:13 AM, Marcelo Tosatti wrote:

As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Thanks Marcelo, I like this much better then what you were proposing 
privately earlier!


Fairly minor nit below.


Index: vsyscall/kernel/time/timekeeping.c
===
--- vsyscall.orig/kernel/time/timekeeping.c
+++ vsyscall/kernel/time/timekeeping.c
@@ -21,6 +21,7 @@
  #include linux/time.h
  #include linux/tick.h
  #include linux/stop_machine.h
+#include linux/pvclock_gtod.h


  static struct timekeeper timekeeper;
@@ -180,6 +181,79 @@ static inline s64 timekeeping_get_ns_raw
return nsec + arch_gettimeoffset();
  }

+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
+
+/**
+ * pvclock_gtod_register_notifier - register a pvclock timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_register_notifier(struct notifier_block *nb)
+{
+   struct timekeeper *tk = timekeeper;
+   unsigned long flags;
+   int ret;
+
+   write_seqlock_irqsave(tk-lock, flags);
+   ret = raw_notifier_chain_register(pvclock_gtod_chain, nb);
+   write_sequnlock_irqrestore(tk-lock, flags);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
+
+/**
+ * pvclock_gtod_unregister_notifier - unregister a pvclock
+ * timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
+{
+   struct timekeeper *tk = timekeeper;
+   unsigned long flags;
+   int ret;
+
+   write_seqlock_irqsave(tk-lock, flags);
+   ret = raw_notifier_chain_unregister(pvclock_gtod_chain, nb);
+   write_sequnlock_irqrestore(tk-lock, flags);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
+
+struct pvclock_gtod_data pvclock_gtod_data;
+EXPORT_SYMBOL_GPL(pvclock_gtod_data);
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+   struct pvclock_gtod_data *vdata = pvclock_gtod_data;
+
+   write_seqcount_begin(vdata-seq);
+
+   /* copy pvclock gtod data */
+   vdata-clock.vclock_mode = tk-clock-archdata.vclock_mode;
+   vdata-clock.cycle_last  = tk-clock-cycle_last;
+   vdata-clock.mask= tk-clock-mask;
+   vdata-clock.mult= tk-mult;
+   vdata-clock.shift   = tk-shift;
+
+   vdata-monotonic_time_sec= tk-xtime_sec
+   + tk-wall_to_monotonic.tv_sec;
+   vdata-monotonic_time_snsec  = tk-xtime_nsec
+   + (tk-wall_to_monotonic.tv_nsec
+tk-shift);
+   while (vdata-monotonic_time_snsec =
+   (((u64)NSEC_PER_SEC)  tk-shift)) {
+   vdata-monotonic_time_snsec -=
+   ((u64)NSEC_PER_SEC)  tk-shift;
+   vdata-monotonic_time_sec++;
+   }
+
+   write_seqcount_end(vdata-seq);
+   raw_notifier_call_chain(pvclock_gtod_chain, 0, NULL);
+}
+


My only request is could the update_pvclock_gtod() be implemented 
similarly to the update_vsyscall, where the update function lives in the 
pvclock code (maybe using a weak symbol or something) so we don't have 
to have all these pvclock details in the timekeeping core?


thanks
-john


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=10002 jiffies)

2012-10-08 Thread John Stultz


On 09/30/2012 04:59 AM, Fengguang Wu wrote:

On Sun, Sep 30, 2012 at 01:32:46PM +0200, Avi Kivity wrote:

On 09/30/2012 01:23 PM, Fengguang Wu wrote:

On Sun, Sep 30, 2012 at 01:10:55PM +0200, Avi Kivity wrote:

On 09/28/2012 05:35 AM, Paul E. McKenney wrote:

On Thu, Sep 27, 2012 at 12:40:44PM +0800, Fengguang Wu wrote:

On Wed, Sep 26, 2012 at 09:28:50PM -0700, Paul E. McKenney wrote:

On Thu, Sep 27, 2012 at 10:54:00AM +0800, Fengguang Wu wrote:

On Wed, Sep 26, 2012 at 09:45:43AM -0700, Paul E. McKenney wrote:

On Wed, Sep 26, 2012 at 04:15:01PM +0800, Fengguang Wu wrote:

[ . . . ]


But could you also please send your .config file and a description of

.config attached.


the workload you are running?

It's basically the below commands. The exact initrd is not relevant in
this case because it's a boot time warning before user space is
started. The stalls roughly happen 1 time on every 10 boots.

Yow!!!

You have severe cross-CPU time-synchronization problems.  See for
example the first dmesg, with the relevant part extracted right here.
One CPU believes that it is about 37 seconds past boot, and the other
CPU beleives that it is about 137 seconds past boot.  Given that large
of a time difference, an RCU CPU stall warning is expected behavior.

Good spot! Yeah I noticed that huge timestamp gap, however didn't take
it seriously enough..


Get your two CPUs in agreement about what time it is, and I bet that
the CPU stall warnings will go away.

Possibly KVM related? Because the warnings show up in many test boxes
running KVM and so is not likely some hardware specific issue.

I vaguely recall seeing something recently.  But let's ask the KVM and
timekeeping guys.

From the logs it looks like hpet (why not kvmclock?) is used for the
clock, it should not generate such drifts since it is a global clock.
Can you verify current_clocksource on a boot that actually failed (in
case the clocksource is switched during runtime)?

I've checked out the dmesg that's cited by Paul, attached. Yes it
contains lines

[4.970051] Switching to clocksource hpet

and then

[7.250353] Switching to clocksource tsc

And there is no kvm-clock lines. Oh well for this particular kernel:


Ah, tsc will certainly break on kvm if the hardware doesn't provide a
constant tsc source.  I'm surprised the guest kernel didn't detect it
and switch back to hpet though.

Thanks, it's good to know the root cause. All the dmesgs show the same hpet+tsc
switching pattern (and never switch back):

$ grep Switching dmesg-kvm_bisect2-inn-*21
dmesg-kvm_bisect2-inn-41931-2012-09-27-10-37-51-3.6.0-rc7-bisect2-00078-g593d100-21:[
4.111415] Switching to clocksource hpet
dmesg-kvm_bisect2-inn-41931-2012-09-27-10-37-51-3.6.0-rc7-bisect2-00078-g593d100-21:[
6.550098] Switching to clocksource tsc


Is this still an open issue? Fengguang's mail sounds like its resolved, 
but I'm not sure it is.


The switching from HPET - TSC  I believe is expected, as the refined 
calibration will delay the TSC from being registered for a few seconds.  
However, its unclear why the TSC, if it is faulty, isn't being caught 
and demoted by the clocksource watchdog.


I'm also curious why this originally bisected down to 
06ae115a1d551cd952d8  (when using the kvm clock) if it was more of a 
hardware issue. And in those logs, I don't see the printk time-stamp 
inconsistencies that were alluded to in this thread.


Fengguang: Is this still reproducible? Do you have any details (dmesg) 
about host system as well?


thanks
-john

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net

2012-06-25 Thread John Fastabend


On 6/25/2012 3:07 AM, Michael S. Tsirkin wrote:

On Mon, Jun 25, 2012 at 05:16:48PM +0800, Jason Wang wrote:

Hello All:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

Test Environment:
- Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
- Two directed connected 82599

Test Summary:

- Highlights: huge improvements on TCP_RR test
- Lowlights: regression on small packet transmission, higher cpu utilization
  than single queue, need further optimization


Didn't review yet, reacting this this paragraph:

To avoid regressions, it seems reasonable to make
the device use a single queue by default for now.
Add a way to switch multiqueue on/off using ethtool.

This way guest admin can tune the device for the
workload manually until we manage to imlement some
self-tuning heuristics.



Ethtool already has this switch 'ethtool -L' can be
used to set the number tx/rx channels. So you would
likely just need to add a set_channels hook.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next PATCH v0 3/5] net: add fdb generic dump routine

2012-03-26 Thread John Fastabend

On 3/25/2012 6:09 AM, Roopa Prabhu wrote:
 
 
 
 On 3/18/12 11:52 PM, John Fastabend john.r.fastab...@intel.com wrote:
 
 This adds a generic dump routine drivers can call. It
 should be sufficient to handle any bridging model that
 uses the unicast address list. This should be most SR-IOV
 enabled NICs.

 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 ---


[...]

 +/**
 + * ndo_dflt_fdb_dump: default netdevice operation to dump an FDB table.
 + * @nlh: netlink message header
 + * @dev: netdevice
 + *
 + * Default netdevice operation to dump the existing unicast address list.
 + * Returns zero on success.
 + */
 +int ndo_dflt_fdb_dump(struct sk_buff *skb,
 +struct netlink_callback *cb,
 +struct net_device *dev,
 +int idx)
 +{
 + struct netdev_hw_addr *ha;
 + struct nlmsghdr *nlh;
 + struct ndmsg *ndm;
 + u32 pid, seq;
 +
 + pid = NETLINK_CB(cb-skb).pid;
 + seq = cb-nlh-nlmsg_seq;
 +
 + netif_addr_lock_bh(dev);
 + list_for_each_entry(ha, dev-uc.list, list) {
 +  if (idx  cb-args[0])
 +   goto skip;
 
 Any reason why its only uc ?. What about mc ?

Sure this might be useful to know for embedded devices
and likely more useful for the macvlan driver. I'll add
it in the next version.

Thanks,
John


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH v0 0/5] Series short description

2012-03-19 Thread John Fastabend

This series is a follow up to this thread:

http://www.spinics.net/lists/netdev/msg191360.html

This adds two NTF_XXX bits to signal if the PF_BRIDGE
netlink command should be parsed by the embedded bridge
or the SW bridge. The insight here is the SW bridge is
always the master device (NTF_MASTER) and the embedded
bridge is the lower device (NTF_LOWERDEV). Without either
flag set the command is parsed by the SW bridge to support
existing tooling.

To make this work correctly I added three new ndo ops

ndo_fdb_add
ndo_fdb_del
ndo_fdb_dump

to add, delete, and dump FDB entries. These operations
can be used by drivers to program embedded nics or by
software bridges. We have at least three SW bridge now
net/bridge, openvswitch, and macvlan. And three variants
of embedded bridges SR-IOV devices, multi-function devices
and Distributed Switch Architecture (DSA).

I think at least in this case adding netdevice ops is
the cleanest way to implement this. I thought about
notifier hooks and other methods but this seems to be
the simplest.

I've tested these three scenarios, embedded bridge only,
sw bridge only, and embedded bridge and SW bridge. These
are working on the Intel 82599 devices with this patch
series. I am also working on a patch for the macvlan
drivers. I'll submit that as an RFC shortly so far I
only have the passthru mode wired up.

Thanks to Stephen, Ben, and Jamal for bearing with me
and the feedback on the last round of patches.

As always any comments/feedback appreciated!

---

John Fastabend (5):
  ixgbe: allow RAR table to be updated in promisc mode
  ixgbe: enable FDB netdevice ops
  net: add fdb generic dump routine
  net: addr_list: add exclusive dev_uc_add
  net: add generic PF_BRIDGE:RTM_XXX FDB hooks


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   80 +-
 include/linux/neighbour.h |3 
 include/linux/netdevice.h |   27 +++
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 
 net/bridge/br_fdb.c   |  128 
 net/bridge/br_netlink.c   |   12 --
 net/bridge/br_private.h   |   15 ++
 net/core/dev_addr_lists.c |   19 ++
 net/core/rtnetlink.c  |  194 +
 10 files changed, 363 insertions(+), 122 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH v0 2/5] net: addr_list: add exclusive dev_uc_add

2012-03-19 Thread John Fastabend

This adds a dev_uc_add_excl() call similar to the original
dev_uc_add() except it sets the global bit. With this
change the reference count will not be bumped and -EEXIST
will be returned if a duplicate address exists.

This is useful for drivers that support SR-IOV and want
to manage the unicast lists.

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 include/linux/netdevice.h |1 +
 net/core/dev_addr_lists.c |   19 +++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4208901..5e43cec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2571,6 +2571,7 @@ extern int dev_addr_init(struct net_device *dev);
 
 /* Functions used for unicast addresses handling */
 extern int dev_uc_add(struct net_device *dev, unsigned char *addr);
+extern int dev_uc_add_excl(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_del(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_sync(struct net_device *to, struct net_device *from);
 extern void dev_uc_unsync(struct net_device *to, struct net_device *from);
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 29c07fe..c7d27ad 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -377,6 +377,25 @@ EXPORT_SYMBOL(dev_addr_del_multiple);
  */
 
 /**
+ * dev_uc_add_excl - Add a global secondary unicast address
+ * @dev: device
+ * @addr: address to add
+ */
+int dev_uc_add_excl(struct net_device *dev, unsigned char *addr)
+{
+   int err;
+
+   netif_addr_lock_bh(dev);
+   err = __hw_addr_add_ex(dev-uc, addr, dev-addr_len,
+  NETDEV_HW_ADDR_T_UNICAST, true);
+   if (!err)
+   __dev_set_rx_mode(dev);
+   netif_addr_unlock_bh(dev);
+   return err;
+}
+EXPORT_SYMBOL(dev_uc_add_excl);
+
+/**
  * dev_uc_add - Add a secondary unicast address
  * @dev: device
  * @addr: address to add

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH v0 3/5] net: add fdb generic dump routine

2012-03-19 Thread John Fastabend

This adds a generic dump routine drivers can call. It
should be sufficient to handle any bridging model that
uses the unicast address list. This should be most SR-IOV
enabled NICs.

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 net/core/rtnetlink.c |   56 ++
 1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 8c3278a..35ee2d6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2082,6 +2082,62 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg)
return err;
 }
 
+/**
+ * ndo_dflt_fdb_dump: default netdevice operation to dump an FDB table.
+ * @nlh: netlink message header
+ * @dev: netdevice
+ *
+ * Default netdevice operation to dump the existing unicast address list.
+ * Returns zero on success.
+ */
+int ndo_dflt_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct netdev_hw_addr *ha;
+   struct nlmsghdr *nlh;
+   struct ndmsg *ndm;
+   u32 pid, seq;
+
+   pid = NETLINK_CB(cb-skb).pid;
+   seq = cb-nlh-nlmsg_seq;
+
+   netif_addr_lock_bh(dev);
+   list_for_each_entry(ha, dev-uc.list, list) {
+   if (idx  cb-args[0])
+   goto skip;
+
+   nlh = nlmsg_put(skb, pid, seq,
+   RTM_NEWNEIGH, sizeof(*ndm), NLM_F_MULTI);
+   if (!nlh)
+   break;
+
+   ndm = nlmsg_data(nlh);
+   ndm-ndm_family  = AF_BRIDGE;
+   ndm-ndm_pad1= 0;
+   ndm-ndm_pad2= 0;
+   ndm-ndm_flags   = NTF_LOWERDEV;
+   ndm-ndm_type= 0;
+   ndm-ndm_ifindex = dev-ifindex;
+   ndm-ndm_state   = NUD_PERMANENT;
+
+   NLA_PUT(skb, NDA_LLADDR, ETH_ALEN, ha-addr);
+
+   nlmsg_end(skb, nlh);
+skip:
+   ++idx;
+   }
+   netif_addr_unlock_bh(dev);
+
+   return idx;
+nla_put_failure:
+   netif_addr_unlock_bh(dev);
+   nlmsg_cancel(skb, nlh);
+   return idx;
+}
+EXPORT_SYMBOL(ndo_dflt_fdb_dump);
+
 static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
int idx = 0;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH v0 4/5] ixgbe: enable FDB netdevice ops

2012-03-19 Thread John Fastabend

Enable FDB ops on ixgbe when in SR-IOV mode.

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   59 +
 1 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1d8f9f8..32adb4f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7586,6 +7586,62 @@ static int ixgbe_set_features(struct net_device *netdev,
 
 }
 
+static int ixgbe_ndo_fdb_add(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr,
+u16 flags)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm-ndm_state  NUD_PERMANENT) {
+   pr_info(%s: FDB only supports static addresses\n,
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_add_excl(dev, addr);
+
+   /* Only return duplicate errors if NLM_F_EXCL is set */
+   if (err == -EEXIST  !(flags  NLM_F_EXCL))
+   err = 0;
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_del(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm-ndm_state  NUD_PERMANENT) {
+   pr_info(%s: FDB only supports static addresses\n,
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_del(dev, addr);
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED)
+   idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
+
+   return idx;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
@@ -7620,6 +7676,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 #endif /* IXGBE_FCOE */
.ndo_set_features = ixgbe_set_features,
.ndo_fix_features = ixgbe_fix_features,
+   .ndo_fdb_add= ixgbe_ndo_fdb_add,
+   .ndo_fdb_del= ixgbe_ndo_fdb_del,
+   .ndo_fdb_dump   = ixgbe_ndo_fdb_dump,
 };
 
 static void __devinit ixgbe_probe_vf(struct ixgbe_adapter *adapter,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH v0 5/5] ixgbe: allow RAR table to be updated in promisc mode

2012-03-19 Thread John Fastabend

This allows RAR table updates while in promiscuous. With
SR-IOV enabled it is valuable to allow the RAR table to
be updated even when in promisc mode to configure forwarding

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++--
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 32adb4f..d1925b5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3400,16 +3400,17 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
}
ixgbe_vlan_filter_enable(adapter);
hw-addr_ctrl.user_set_promisc = false;
-   /*
-* Write addresses to available RAR registers, if there is not
-* sufficient space to store all the addresses then enable
-* unicast promiscuous mode
-*/
-   count = ixgbe_write_uc_addr_list(netdev);
-   if (count  0) {
-   fctrl |= IXGBE_FCTRL_UPE;
-   vmolr |= IXGBE_VMOLR_ROPE;
-   }
+   }
+
+   /*
+* Write addresses to available RAR registers, if there is not
+* sufficient space to store all the addresses then enable
+* unicast promiscuous mode
+*/
+   count = ixgbe_write_uc_addr_list(netdev);
+   if (count  0) {
+   fctrl |= IXGBE_FCTRL_UPE;
+   vmolr |= IXGBE_VMOLR_ROPE;
}
 
if (adapter-num_vfs) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next PATCH v0 1/5] net: add generic PF_BRIDGE:RTM_ FDB hooks

2012-03-19 Thread John Fastabend

Forgot to change the title resending with a title that won't
be dropped by netdev and kvm mailing lists. And updated my
local repo so it won't happen again.

---

This adds two new flags NTF_MASTER and NTF_LOWERDEV that can
now be used to specify where PF_BRIDGE netlink commands should
be sent. NTF_MASTER sends the commands to the 'dev-master'
device for parsing. Typically this will be the linux net/bridge,
macvlan, or open-vswitch devices. Also without any flags set
the command will be handled by the master device as well so
that current user space tools continue to work as expected.

The NTF_LOWERDEV flag will push the PF_BRIDGE commands to the
device. In the basic example below the commands are then parsed
and programmed in the embedded bridge.

Note if both NTF_LOWERDEV and NTF_MASTER bits are set then the
command will be sent both to 'dev-master' and 'dev' this allows
user space to easily keep the embedded bridge and software bridge
in sync.

To support this new net device ops were added to call into
the device and the existing bridging code was refactored
to use these. There should be no change from user space.

A basic setup with a SR-IOV enabled NIC looks like this,

  veth0  veth2
|  |
  
  |  bridge0 |    software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \   propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge | hardware offloaded switching
  

In this case the embedded bridge must be managed to allow 'veth0'
to communicate with 'ethx.y' correctly. At present drivers managing
the embedded bridge either send frames onto the network which
then get dropped by the switch OR the embedded bridge will flood
these frames. With this patch we have a mechanism to manage the
embedded bridge correctly from user space. This example is specific
to SR-IOV but replacing the VF with another PF or dropping this
into the DSA framework generates similar management issues.

Examples session using the 'br'[1] tool to add, dump and then
delete a mac address with a new embedded option and enabled
ixgbe driver:

# br fdb add 22:35:19:ac:60:59 dev eth3
# br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
#br fdb add 22:35:19:ac:60:59 embedded dev eth3
#br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
eth322:35:19:ac:60:59   local embedded
#br fdb del 22:35:19:ac:60:59 embedded dev eth3

I added a couple lines to 'br' to set the flags correctly is all. It
is my opinion that the merit of this patch is now embedded and SW
bridges can both be modeled correctly in user space using very nearly
the same message passing.

[1] 'br' tool was published as an RFC here and will be renamed 'bridge'
http://patchwork.ozlabs.org/patch/117664/

Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for
valuable feedback, suggestions, and review.

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 include/linux/neighbour.h |3 +
 include/linux/netdevice.h |   26 
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 +
 net/bridge/br_fdb.c   |  128 ++
 net/bridge/br_netlink.c   |   12 
 net/bridge/br_private.h   |   15 -
 net/core/rtnetlink.c  |  138 +
 8 files changed, 217 insertions(+), 112 deletions(-)

diff --git a/include/linux/neighbour.h b/include/linux/neighbour.h
index b188f68..3a94409 100644
--- a/include/linux/neighbour.h
+++ b/include/linux/neighbour.h
@@ -33,6 +33,9 @@ enum {
 #define NTF_PROXY  0x08/* == ATF_PUBL */
 #define NTF_ROUTER 0x80
 
+#define NTF_LOWERDEV   0x02
+#define NTF_MASTER 0x04
+
 /*
  * Neighbor Cache Entry States.
  */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4535a4e..4208901 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -54,6 +54,7 @@
 #include net/netprio_cgroup.h
 
 #include linux/netdev_features.h
+#include linux/neighbour.h
 
 struct netpoll_info;
 struct phy_device;
@@ -904,6 +905,19 @@ struct netdev_fcoe_hbainfo {
  * feature set might be less than what was returned by ndo_fix_features()).
  * Must return 0 or -errno if it changed dev-features itself.
  *
+ * int (*ndo_fdb_add)(struct ndmsg *ndm, struct net_device *dev,
+ *   unsigned char *addr, u16 flags)
+ * Adds an FDB entry to dev for addr. The ndmsg contains flags to indicate
+ * if the dev-master FDB

Re: [net-next PATCH v0 0/5] Series short description

2012-03-19 Thread John Fastabend

On 3/19/2012 3:55 PM, Stephen Hemminger wrote:
 On Mon, 19 Mar 2012 18:38:08 -0400 (EDT)
 David Miller da...@davemloft.net wrote:

 From: John Fastabend john.r.fastab...@intel.com
 Date: Sun, 18 Mar 2012 23:51:45 -0700

 This series is a follow up to this thread:

 http://www.spinics.net/lists/netdev/msg191360.html

 Can the interested parties please review this series?

 I'm willing to apply this right now if it looks OK, but if
 it needs more revisions we'll have to defer.

 Please don't rush this into this merge window. It needs more than
 1 full day of review.

Dave, its probably fine to push this to 3.5 then. I can
resubmit after you close the merge window if you want? This
has been somewhat broken for SR-IOV cards for multiple
kernel releases now anyways one more wont hurt too much.

I'll work with Roopa to get the macvlan driver plugged into
the fdb ops in the meantime and maybe get DSA as well.

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next PATCH v0 0/5] Series short description

2012-03-19 Thread John Fastabend

On 3/19/2012 5:35 PM, David Miller wrote:
 From: John Fastabend john.r.fastab...@intel.com
 Date: Mon, 19 Mar 2012 17:27:00 -0700

 Dave, its probably fine to push this to 3.5 then.

 Fair enough.

Stephen, please let me know if you see any issues though
because without these we have no way to forward packets
correctly in the embedded switch. So we can't really
use SR-IOV and virtual interfaces together correctly. And
the macvlan device in passthru mode is putting the device
in promiscuous mode which isn't great either.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH v1 0/4] net: bridge: FDB management

2012-03-09 Thread John Fastabend

This series is a follow up to the previous thread here:

http://lists.openwall.net/netdev/2012/02/29/31

There are some significant changes in this series. First
I add two NTF_XXX bits to signal if the PF_BRIDGE netlink
command should be parsed by the embedded bridge or the
SW bridge. The insight here is the SW bridge is always the
master device (NTF_MASTER) and the embedded bridge is
the lower device (NTF_LOWERDEV). Without either flag set
the command is parsed by the SW bridge to support existing
tooling.

To make this work correctly I added three new ndo ops

ndo_fdb_add
ndo_fdb_del
ndo_fdb_dump

to add, delete, and dump FDB entries. These operations
can be used by drivers to program embedded nics or by
software bridges. We have at least three SW bridge now
net/bridge, openvswitch, and macvlan. And three variants
of embedded bridges SR-IOV devices, multi-function devices
and Distributed Switch Architecture (DSA).

I think at least in this case adding netdevice ops is
the cleanest way to implement this. I thought about
notifier hooks and other methods but for now at least
this seems to be the simplest.

I'm going to drop this into my testbed and let it run
for a few days. But I think (hope?) this series is close
to being ready for a non-RFC submission. I'll probably
audit the patches once more as well.

Thanks to Stephen, Ben, and Jamal for bearing with me
and the feedback on the last round of patches.

As always any comments/feedback is appreciated!

---

John Fastabend (4):
  ixgbe: enable FDB netdevice ops
  net: add fdb generic dump routine
  net: addr_list: add exclusive dev_uc_add
  net: add generic PF_BRIDGE:RTM_XXX FDB hooks


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   59 
 include/linux/neighbour.h |3 
 include/linux/netdevice.h |   27 +++
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 
 net/bridge/br_fdb.c   |  128 
 net/bridge/br_netlink.c   |   12 --
 net/bridge/br_private.h   |   15 ++
 net/core/dev_addr_lists.c |   19 ++
 net/core/rtnetlink.c  |  194 +
 10 files changed, 352 insertions(+), 112 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH v1 2/4] net: addr_list: add exclusive dev_uc_add

2012-03-09 Thread John Fastabend

This adds a dev_uc_add_excl() call similar to the original
dev_uc_add() except it sets the global bit. With this
change the reference count will not be bumped and -EEXIST
will be returned if a duplicate address exists.

This is useful for drivers that support SR-IOV and want
to manage the unicast lists.

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 include/linux/netdevice.h |1 +
 net/core/dev_addr_lists.c |   19 +++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3963992..7e4a86f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2553,6 +2553,7 @@ extern int dev_addr_init(struct net_device *dev);
 
 /* Functions used for unicast addresses handling */
 extern int dev_uc_add(struct net_device *dev, unsigned char *addr);
+extern int dev_uc_add_excl(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_del(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_sync(struct net_device *to, struct net_device *from);
 extern void dev_uc_unsync(struct net_device *to, struct net_device *from);
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 29c07fe..c7d27ad 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -377,6 +377,25 @@ EXPORT_SYMBOL(dev_addr_del_multiple);
  */
 
 /**
+ * dev_uc_add_excl - Add a global secondary unicast address
+ * @dev: device
+ * @addr: address to add
+ */
+int dev_uc_add_excl(struct net_device *dev, unsigned char *addr)
+{
+   int err;
+
+   netif_addr_lock_bh(dev);
+   err = __hw_addr_add_ex(dev-uc, addr, dev-addr_len,
+  NETDEV_HW_ADDR_T_UNICAST, true);
+   if (!err)
+   __dev_set_rx_mode(dev);
+   netif_addr_unlock_bh(dev);
+   return err;
+}
+EXPORT_SYMBOL(dev_uc_add_excl);
+
+/**
  * dev_uc_add - Add a secondary unicast address
  * @dev: device
  * @addr: address to add

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH v1 3/4] net: add fdb generic dump routine

2012-03-09 Thread John Fastabend

This adds a generic dump routine drivers can call. It
should be sufficient to handle any bridging model that
uses the unicast address list. This should be most SR-IOV
enabled NICs.

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 net/core/rtnetlink.c |   56 ++
 1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 8c3278a..35ee2d6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2082,6 +2082,62 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg)
return err;
 }
 
+/**
+ * ndo_dflt_fdb_dump: default netdevice operation to dump an FDB table.
+ * @nlh: netlink message header
+ * @dev: netdevice
+ *
+ * Default netdevice operation to dump the existing unicast address list.
+ * Returns zero on success.
+ */
+int ndo_dflt_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct netdev_hw_addr *ha;
+   struct nlmsghdr *nlh;
+   struct ndmsg *ndm;
+   u32 pid, seq;
+
+   pid = NETLINK_CB(cb-skb).pid;
+   seq = cb-nlh-nlmsg_seq;
+
+   netif_addr_lock_bh(dev);
+   list_for_each_entry(ha, dev-uc.list, list) {
+   if (idx  cb-args[0])
+   goto skip;
+
+   nlh = nlmsg_put(skb, pid, seq,
+   RTM_NEWNEIGH, sizeof(*ndm), NLM_F_MULTI);
+   if (!nlh)
+   break;
+
+   ndm = nlmsg_data(nlh);
+   ndm-ndm_family  = AF_BRIDGE;
+   ndm-ndm_pad1= 0;
+   ndm-ndm_pad2= 0;
+   ndm-ndm_flags   = NTF_LOWERDEV;
+   ndm-ndm_type= 0;
+   ndm-ndm_ifindex = dev-ifindex;
+   ndm-ndm_state   = NUD_PERMANENT;
+
+   NLA_PUT(skb, NDA_LLADDR, ETH_ALEN, ha-addr);
+
+   nlmsg_end(skb, nlh);
+skip:
+   ++idx;
+   }
+   netif_addr_unlock_bh(dev);
+
+   return idx;
+nla_put_failure:
+   netif_addr_unlock_bh(dev);
+   nlmsg_cancel(skb, nlh);
+   return idx;
+}
+EXPORT_SYMBOL(ndo_dflt_fdb_dump);
+
 static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
int idx = 0;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH v1 4/4] ixgbe: enable FDB netdevice ops

2012-03-09 Thread John Fastabend

Enable FDB ops on ixgbe when in SR-IOV mode.

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   59 +
 1 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 23a4665..c41439c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7484,6 +7484,62 @@ static int ixgbe_set_features(struct net_device *netdev,
 
 }
 
+static int ixgbe_ndo_fdb_add(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr,
+u16 flags)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm-ndm_state  NUD_PERMANENT) {
+   pr_info(%s: FDB only supports static addresses\n,
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_add_excl(dev, addr);
+
+   /* Only return duplicate errors if NLM_F_EXCL is set */
+   if (err == -EEXIST  !(flags  NLM_F_EXCL))
+   err = 0;
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_del(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm-ndm_state  NUD_PERMANENT) {
+   pr_info(%s: FDB only supports static addresses\n,
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_del(dev, addr);
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED)
+   idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
+
+   return idx;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
@@ -7518,6 +7574,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 #endif /* IXGBE_FCOE */
.ndo_set_features = ixgbe_set_features,
.ndo_fix_features = ixgbe_fix_features,
+   .ndo_fdb_add= ixgbe_ndo_fdb_add,
+   .ndo_fdb_del= ixgbe_ndo_fdb_del,
+   .ndo_fdb_dump   = ixgbe_ndo_fdb_dump,
 };
 
 static void __devinit ixgbe_probe_vf(struct ixgbe_adapter *adapter,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH v1 1/4] net: add generic PF_BRIDGE:RTM_ FDB hooks

2012-03-09 Thread John Fastabend

Resending this patch with a new title apparently PF_BRIDGE:RTM_XXX
FDB hooks looks like spam. Sorry for the noise if you get a dup.

---
net: add generic PF_BRIDGE:RTM_XXX FDB hooks

This adds two new flags NTF_MASTER and NTF_LOWERDEV that can
now be used to specify where PF_BRIDGE netlink commands should
be sent. NTF_MASTER sends the commands to the 'dev-master'
device for parsing. Typically this will be the linux net/bridge,
macvlan, or open-vswitch devices. Also without any flags set
the command will be handled by the master device as well so
that current user space tools continue to work as expected.

The NTF_LOWERDEV flag will push the PF_BRIDGE commands to the
device. In the basic example below the commands are then parsed
and programmed in the embedded bridge.

Note if both NTF_LOWERDEV and NTF_MASTER bits are set then the
command will be sent both to 'dev-master' and 'dev' this allows
user space to easily keep the embedded bridge and software bridge
in sync.

To support this new net device ops were added to call into
the device and the existing bridging code was refactored
to use these. There should be no change from user space.

A basic setup with a SR-IOV enabled NIC looks like this,

  veth0  veth2
|  |
  
  |  bridge0 |    software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \   propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge | hardware offloaded switching
  

In this case the embedded bridge must be managed to allow 'veth0'
to communicate with 'ethx.y' correctly. At present drivers managing
the embedded bridge either send frames onto the network which
then get dropped by the switch OR the embedded bridge will flood
these frames. With this patch we have a mechanism to manage the
embedded bridge correctly from user space. This example is specific
to SR-IOV but replacing the VF with another PF or dropping this
into the DSA framework generates similar management issues.

Examples session using the 'br'[1] tool to add, dump and then
delete a mac address with a new embedded option and enabled
ixgbe driver:

# br fdb add 22:35:19:ac:60:59 dev eth3
# br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
#br fdb add 22:35:19:ac:60:59 embedded dev eth3
#br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
eth322:35:19:ac:60:59   local embedded
#br fdb del 22:35:19:ac:60:59 embedded dev eth3

I added a couple lines to 'br' to set the flags correctly is all. It
is my opinion that the merit of this patch is now embedded and SW
bridges can both be modeled correctly in user space using very nearly
the same message passing.

[1] 'br' tool was published as an RFC here and will be renamed 'bridge'
http://patchwork.ozlabs.org/patch/117664/

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 include/linux/neighbour.h |3 +
 include/linux/netdevice.h |   26 
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 +
 net/bridge/br_fdb.c   |  128 ++
 net/bridge/br_netlink.c   |   12 
 net/bridge/br_private.h   |   15 -
 net/core/rtnetlink.c  |  138 +
 8 files changed, 217 insertions(+), 112 deletions(-)

diff --git a/include/linux/neighbour.h b/include/linux/neighbour.h
index b188f68..3a94409 100644
--- a/include/linux/neighbour.h
+++ b/include/linux/neighbour.h
@@ -33,6 +33,9 @@ enum {
 #define NTF_PROXY  0x08/* == ATF_PUBL */
 #define NTF_ROUTER 0x80
 
+#define NTF_LOWERDEV   0x02
+#define NTF_MASTER 0x04
+
 /*
  * Neighbor Cache Entry States.
  */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a89933b..3963992 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -54,6 +54,7 @@
 #include net/netprio_cgroup.h
 
 #include linux/netdev_features.h
+#include linux/neighbour.h
 
 struct netpoll_info;
 struct phy_device;
@@ -904,6 +905,19 @@ struct netdev_fcoe_hbainfo {
  * feature set might be less than what was returned by ndo_fix_features()).
  * Must return 0 or -errno if it changed dev-features itself.
  *
+ * int (*ndo_fdb_add)(struct ndmsg *ndm, struct net_device *dev,
+ *   unsigned char *addr, u16 flags)
+ * Adds an FDB entry to dev for addr. The ndmsg contains flags to indicate
+ * if the dev-master FDB should be updated or the devices internal FDB.
+ * int (*ndo_fdb_del)(struct ndmsg *ndm, struct

Re: [RFC PATCH v1 4/4] ixgbe: enable FDB netdevice ops

2012-03-09 Thread John Fastabend

On 3/9/2012 7:48 PM, Stephen Hemminger wrote:
 
 Enable FDB ops on ixgbe when in SR-IOV mode.

 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 
 Will all this break anything on the vf client? What if the vf is running
 a bridge.

No shouldn't break anything.

Actually, implementing these ops in the VF driver (ixgbevf in this
case) will allow bridging to work on the VF. Because at least on
the intel 82599 devices the VF can't be put in promiscuous mode
so bridging doesn't work as expected. With the ops implemented
we could at least get traffic forwarded correctly for addresses
that were known above the VF.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-05 Thread John Fastabend

On 3/5/2012 8:53 AM, Lennert Buytenhek wrote:
 On Tue, Feb 28, 2012 at 08:40:06PM -0800, John Fastabend wrote:
 
 Also if there are embedded switches with learning capabilities they
 might want to trigger events to user space. In this case having
 a protocol type makes user space a bit easier to manage. I've
 added Lennert so maybe he can comment I think the Marvell chipsets
 might support something along these lines. The SR-IOV chipsets I'm
 aware of _today_ don't do learning. Learning makes the event model
 more plausible.
 
 net/dsa currently configures any switch chips in the system to do
 auto-learning.  However, I would much prefer to disable that, and have
 the switch chip just pass up packets for new source addresses, have
 Linux do the learning, and then mirror the Linux software FDB into
 the hardware instead -- that avoids having to manually flush the
 hardware FDB on certain STP state transitions or having to configure
 the hardware to use a shorter address learning timeout when we're in
 the middle of an STP topology change, which are problems we are
 running into in practice.
 

Great. And the plan is we should be able to use the same daemon with
minimal changes (currently a flag) to control both sw and hw bridges.

 Just curious -- while your patches allow propagating FDB entries
 into the hardware, do you also have hooks to tell the hardware which
 ports are to share address databases?
 

Not in the current patches. I don't have hardware right now
that can instantiate multiple bridges. When I get some I was hoping
to do something similar to this patch and use netlink commands
to create/delete bridges and add/remove ports to them. This would
be modifying the existing commands to work for both software and
hardware bridges.

By a bridge instantiation I mean a shared address database in this case.

 For net/dsa, we currently have:
 
   http://patchwork.ozlabs.org/patch/16578/
 
 While I think this is conceptually sound, the implementation is hacky,
 and I wonder how you've solved it for your setup, and if DSA can
 piggy-back off that.

Yep anything we come up with should work in both cases.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread John Fastabend

On 3/1/2012 6:14 AM, Michael S. Tsirkin wrote:
 On Wed, Feb 29, 2012 at 09:25:56AM -0800, John Fastabend wrote:
 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,

 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/


 .John
 
 One place where this might not work well would be
 macvtap which is not a network device so it doesn't have
 its own address, instead it inherits one from macvlan.
 

But is macvtap really doing any forwarding or implementing any
RX filters? Took a quick scan and it looks like the forwarding
logic is all in the macvlan code paths. In this case I suspect
if we enable macvlan then any device built on top of it would
work.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread John Fastabend

On 3/1/2012 5:36 AM, Jamal Hadi Salim wrote:
 On Wed, 2012-02-29 at 10:19 -0800, John Fastabend wrote:
 

 I want to see a unified API so that user space control applications (RSTP, 
 TRILL?)
 can use one set of netlink calls for both software bridge and hardware 
 offloaded
 bridges.  Does this proposal meet that requirement?

 
 I dont see any issues with those requirements being met.
 
 Jamal, so why do They have to be different calls? I'm not so sure 
 anymore...
 moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but 
 that
 is just cosmetic.
 
 I may not want to use the s/ware bridge i.e I may want to use h/ware
 bridge. I may want to use both. So there are 3 variations there. You
 need at least 1.5 bits to represent them if you are going to use the
 same interface. There may be features in either h/ware but not in
 s/ware and vice-versa. 
 A single interface with flags which say this applies to hware:sware:both
 would be good, but it may be harder to achieve - thats why i suggested
 they be different.
 
 cheers,
 jamal
 

Hmm so I think what I'll do is this...

 both: ndm_flags = 0 
 sw  : ndm_flags = NTF_SW_FDB
 hw  : ndm_flags = NTF_HW_FDB

Then current tools will work with embedded bridges and software bridges
with the interesting case being when a port supporting an offloaded FDB
is attached to a SW bridge. Doing both in this case seems to be a reasonable
default to me.

The tricky bit will be pulling the message handlers out of the ./net/bridge
code so that we don't have to always load the bridge module to add entries
to a macvlan for example. I need to look at a few other things today but
I'll code up a patch for this tomorrow.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend

On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:
 
 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH

  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table

 
 Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
 to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
 may play a role in the user space app populating the FDB, i dont think
 they are necessary players.
 Learning could be via a table entry miss and packet redirect to user
 space.
 So my suggestion is to use FDB_*ENTRY for names
  

Well I think NETLINK_ROUTE is the most correct type to use in this
case. Per netlink.h its for routing and device hooks.

#define NETLINK_ROUTE   0   /* Routing/device hook  
*/

And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
were merely a copy from the SW BRIDGE code paths. How about,

PF_BRIDGE:RTM_FDB_NEWENTRY
PF_BRIDGE:RTM_FDB_DELENTRY
PF_BRIDGE:RTM_FDB_GETENTRY

And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
rtnl locking semantics for free.

 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.
 
 They have to be different calls from the calls that talk to the s/ware
 bridge. In my opinion, as controversial as this may sound, you need to
 be flexible enough that some vendor can replace these calls with
 proprietary calls which are more efficient for their hardware. So a
 plugin to replace these calls in the user space code would be a 
 good idea. Alternatively, you could make that something they do at
 the driver level i.e from user space to kernel it is hardware, please
 addthistotheFDBtable() call and the implementation of that could be
 proprietary to the specific hardware.
 

Agreed. I think adding some ndo_ops for bridging offloads here would
work. For example the DSA infrastructure and/or macvlan devices might
need this. Along the lines of extending this RFC,

[RFC] hardware bridging support for DSA switches
http://patchwork.ozlabs.org/patch/16578/


.John

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend

On 2/29/2012 9:52 AM, Stephen Hemminger wrote:
 On Wed, 29 Feb 2012 09:25:56 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:

 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH

  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table


 Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
 to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
 may play a role in the user space app populating the FDB, i dont think
 they are necessary players.
 Learning could be via a table entry miss and packet redirect to user
 space.
 So my suggestion is to use FDB_*ENTRY for names
  

 Well I think NETLINK_ROUTE is the most correct type to use in this
 case. Per netlink.h its for routing and device hooks.

 #define NETLINK_ROUTE   0   /* Routing/device hook   
*/

 And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
 were merely a copy from the SW BRIDGE code paths. How about,

 PF_BRIDGE:RTM_FDB_NEWENTRY
 PF_BRIDGE:RTM_FDB_DELENTRY
 PF_BRIDGE:RTM_FDB_GETENTRY

 And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
 rtnl locking semantics for free.

 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.

 They have to be different calls from the calls that talk to the s/ware
 bridge. In my opinion, as controversial as this may sound, you need to
 be flexible enough that some vendor can replace these calls with
 proprietary calls which are more efficient for their hardware. So a
 plugin to replace these calls in the user space code would be a 
 good idea. Alternatively, you could make that something they do at
 the driver level i.e from user space to kernel it is hardware, please
 addthistotheFDBtable() call and the implementation of that could be
 proprietary to the specific hardware.


 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,

 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/
 
 I want to see a unified API so that user space control applications (RSTP, 
 TRILL?)
 can use one set of netlink calls for both software bridge and hardware 
 offloaded
 bridges.  Does this proposal meet that requirement?
 

With the patches I sent out last night the same netlink calls are used
for both SW and HW with a flag set in ndm_flags to indicate it is a hardware
entry. The flag is needed when a port has offload support and is also
a slave of a SW bridge. Another option would be to apply the command to both
hardware and software tables. This might be good enough and user space would
not have to make distinctions between HW and SW bridges. Also helps with my
original use case where I want the SW and HW bridge FDBs to be in sync.

In response to Jamal's comment I proposed changing the type to RTM_FDB_XXXENTRY
but the message contents are the same in both cases.

Jamal, so why do They have to be different calls? I'm not so sure anymore...
moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but that
is just cosmetic.

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend

On 2/18/2012 4:41 AM, jamal wrote:
 On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:
 
 Yes I agree that is the goal.

 One last comment:
 With synchronization there are other challenges when the entry in the
 hardware conflicts with the entry in software when you intend the
 behavior to be the same. This is not such a big deal with bridging but
 becomes more apparent when you start offloading ACLs etc.


 OK and these sorts of conflicts certainly don't need to be resolved
 by kernel code. So I think this is a reasonable reason to drive the
 synchronization into a user space daemon.
 
 
 Yep. 
 Thanks for listening John. Waiting to see them patches.
 
 cheers,
 jamal
 
 
 

+Lennert

OK back to this. The last piece is where to put these messages...
we could take PF_ROUTE:RTM_*NEIGH

 PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
 switch.
 PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
 switch.
 PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table

The neighbor code is using the PF_UNSPEC protocol type so we won't
collide with these unless someone was using PF_ROUTE and relying on
falling back to PF_UNSPEC however I couldn't find any programs that
did this iproute2 certainly doesn't. And the bridge pieces are using
PF_BRIDGE so no collision there.

I briefly thought about trying to pull the PF_BRIDGE protocol out
and use this for both types but I think its better to leave the
bridge code alone and there is also the issue of disambiguating a msg
at a port which has both an embedded switch and has SW bridge for a
master.

Also if there are embedded switches with learning capabilities they
might want to trigger events to user space. In this case having
a protocol type makes user space a bit easier to manage. I've
added Lennert so maybe he can comment I think the Marvell chipsets
might support something along these lines. The SR-IOV chipsets I'm
aware of _today_ don't do learning. Learning makes the event model
more plausible.

The other mechanism would be to embed some more attributes into the
PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
support learning and triggering events then we likely also don't
want to send these events to every app with RTNLGRP_LINK set.

Plus there is already a proliferation of LINK attributes and dumping
the FDB out of this seems a bit much but could be done with some
bitmasks. Although the current ext_filter_mask u32 doesn't seem to
be sufficient for events to trigger this.

so much for a short note...

Thanks
.John




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend

On 2/28/2012 8:40 PM, John Fastabend wrote:
 On 2/18/2012 4:41 AM, jamal wrote:
 On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:

 Yes I agree that is the goal.

 One last comment:
 With synchronization there are other challenges when the entry in the
 hardware conflicts with the entry in software when you intend the
 behavior to be the same. This is not such a big deal with bridging but
 becomes more apparent when you start offloading ACLs etc.


 OK and these sorts of conflicts certainly don't need to be resolved
 by kernel code. So I think this is a reasonable reason to drive the
 synchronization into a user space daemon.


 Yep. 
 Thanks for listening John. Waiting to see them patches.

 cheers,
 jamal



 
 +Lennert
 
 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH
 
  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
 
 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.
 
 I briefly thought about trying to pull the PF_BRIDGE protocol out
 and use this for both types but I think its better to leave the
 bridge code alone and there is also the issue of disambiguating a msg
 at a port which has both an embedded switch and has SW bridge for a
 master.

Maybe I gave up too quickly here I could use a bit in the ndm_flags to
specify embedded or sw bridge. But would require having the bridge
module loaded.

 
 Also if there are embedded switches with learning capabilities they
 might want to trigger events to user space. In this case having
 a protocol type makes user space a bit easier to manage. I've
 added Lennert so maybe he can comment I think the Marvell chipsets
 might support something along these lines. The SR-IOV chipsets I'm
 aware of _today_ don't do learning. Learning makes the event model
 more plausible.
 

Just checked looks like the DSA infrastructure has commands to enable
STP so guess it is doing learning.

 The other mechanism would be to embed some more attributes into the
 PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
 support learning and triggering events then we likely also don't
 want to send these events to every app with RTNLGRP_LINK set.
 
 Plus there is already a proliferation of LINK attributes and dumping
 the FDB out of this seems a bit much but could be done with some
 bitmasks. Although the current ext_filter_mask u32 doesn't seem to
 be sufficient for events to trigger this.
 
 so much for a short note...
 
 Thanks
 .John
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 1/2] kvm tools, seabios: Add --bios option to vm run

2012-02-24 Thread John Floren

Yeah, I'll try to work something up... got approval from Keith.

On Fri, Feb 24, 2012 at 8:27 AM, ron minnich rminn...@gmail.com wrote:
 I think you need to look at the change floren and I worked up,
 assuming I can get him to release it to you.

 Our change is a bit less complex and adds fewer lines of code. You've
 added an extra option to load a bios, and it will be somewhat limited
 as those ever-demanding users continue to demand more and more :-)

 What we did instead is let users specify the load address of the
 kernel, and the IP from which to start. So our command line ends up
 with --kernel being a seabios image, and using switches to set the
 load address at e and the initial IP at 0. Of course, now that
 I know the memory is reflected to high memory too, the initial IP
 could just as easily be fff0.

 Anyway, John, can you give us a look at a patch please?

 ron
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 1/2] kvm tools, seabios: Add --bios option to vm run

2012-02-24 Thread John Floren

On Fri, Feb 24, 2012 at 11:18 AM, Pekka Enberg penb...@kernel.org wrote:
 On Fri, Feb 24, 2012 at 9:01 PM, Pekka Enberg penb...@kernel.org wrote:
 Aah, I guess we need to implement proper support for QEMU BIOS config
 port (0x510) because the dummy port is accidentally asking for a
 boot menu.

 If I disable boot menu support from SeaBIOS, I'm now seeing this PCI
 out of address space error which I suppose is what Ron and John were
 talking about earlier:

 [snip, snip]

 So looking at SeaBIOS code, it seems to me we could simply make LKVM
 lie to it by claiming to be coreboot and get away with it, no? We'd
 basically avoid all the PCI allocation passes and such.

 How difficult is it to initialize a struct cb_header data structure?
 src/coreboot.c::find_cb_header() seems to be rather strict but
 certainly nothing that we fundamentally can't do.

 Another option is to extend the Xen codepaths to support LKVM too.

                                Pekka

Just tell SeaBIOS to compile for Coreboot. That seemed to get me going.

John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 1/2] kvm tools, seabios: Add --bios option to vm run

2012-02-24 Thread John Floren

Here are some small changes I made to get SeaBIOS to work on lkvm.
Some may not be entirely needed.

Oh, and I've also attached the .config I used.

John

On Fri, Feb 24, 2012 at 11:29 AM, John Floren j...@jfloren.net wrote:
 On Fri, Feb 24, 2012 at 11:18 AM, Pekka Enberg penb...@kernel.org wrote:
 On Fri, Feb 24, 2012 at 9:01 PM, Pekka Enberg penb...@kernel.org wrote:
 Aah, I guess we need to implement proper support for QEMU BIOS config
 port (0x510) because the dummy port is accidentally asking for a
 boot menu.

 If I disable boot menu support from SeaBIOS, I'm now seeing this PCI
 out of address space error which I suppose is what Ron and John were
 talking about earlier:

 [snip, snip]

 So looking at SeaBIOS code, it seems to me we could simply make LKVM
 lie to it by claiming to be coreboot and get away with it, no? We'd
 basically avoid all the PCI allocation passes and such.

 How difficult is it to initialize a struct cb_header data structure?
 src/coreboot.c::find_cb_header() seems to be rather strict but
 certainly nothing that we fundamentally can't do.

 Another option is to extend the Xen codepaths to support LKVM too.

                                Pekka

 Just tell SeaBIOS to compile for Coreboot. That seemed to get me going.

 John
From 3f304bb583627d000be0c8ea8039c0c7c8a4c781 Mon Sep 17 00:00:00 2001
From: John Floren j...@jfloren.net
Date: Fri, 24 Feb 2012 17:11:44 -0800
Subject: [PATCH] Tweaks to allow SeaBIOS to boot on lkvm. Remember to build
 for Coreboot when you do make menuconfig.

---
 src/Kconfig  |8 
 src/clock.c  |3 +++
 src/coreboot.c   |4 ++--
 src/paravirt.c   |4 ++--
 src/pirtable.c   |2 +-
 src/post.c   |8 
 src/virtio-blk.c |   25 +++--
 7 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/src/Kconfig b/src/Kconfig
index 250663a..21e13aa 100644
--- a/src/Kconfig
+++ b/src/Kconfig
@@ -108,7 +108,7 @@ menu Hardware support
 help
 Support for AHCI disk code.
 config VIRTIO_BLK
-depends on DRIVES  !COREBOOT
+depends on DRIVES
 bool virtio-blk controllers
 default y
 help
@@ -298,13 +298,13 @@ endmenu
 
 menu BIOS Tables
 config PIRTABLE
-depends on !COREBOOT
+#depends on !COREBOOT
 bool PIR table
 default y
 help
 Support generation of a PIR table in 0xf000 segment.
 config MPTABLE
-depends on !COREBOOT
+#depends on !COREBOOT
 bool MPTable
 default y
 help
@@ -316,7 +316,7 @@ menu BIOS Tables
 Support generation of SM BIOS tables.  This is also
 sometimes called DMI.
 config ACPI
-depends on !COREBOOT
+#depends on !COREBOOT
 bool ACPI
 default y
 help
diff --git a/src/clock.c b/src/clock.c
index e8a48a1..ee6c15b 100644
--- a/src/clock.c
+++ b/src/clock.c
@@ -68,6 +68,7 @@ u8 no_tsc VAR16VISIBLE;
 static void
 calibrate_tsc(void)
 {
+if (0) {
 u32 eax, ebx, ecx, edx, cpuid_features = 0;
 cpuid(0, eax, ebx, ecx, edx);
 if (eax  0)
@@ -107,6 +108,8 @@ calibrate_tsc(void)
 
 dprintf(1, CPU Mhz=%u\n, hz / 100);
 }
+SET_GLOBAL(cpu_khz, 2128000);
+}
 
 static u64
 emulate_tsc(void)
diff --git a/src/coreboot.c b/src/coreboot.c
index e328c15..976f20d 100644
--- a/src/coreboot.c
+++ b/src/coreboot.c
@@ -179,9 +179,9 @@ coreboot_fill_map(void)
 fail:
 // No table found..  Use 16Megs as a dummy value.
 dprintf(1, Unable to find coreboot table!\n);
-RamSize = 16*1024*1024;
+RamSize = 512*1024*1024;
 RamSizeOver4G = 0;
-add_e820(0, 16*1024*1024, E820_RAM);
+add_e820(0, RamSize, E820_RAM);
 return;
 }
 
diff --git a/src/paravirt.c b/src/paravirt.c
index 9cf77de..c38a84d 100644
--- a/src/paravirt.c
+++ b/src/paravirt.c
@@ -46,8 +46,8 @@ void qemu_cfg_port_probe(void)
 char *sig = QEMU;
 int i;
 
-if (CONFIG_COREBOOT)
-return;
+//if (CONFIG_COREBOOT)
+//return;
 
 qemu_cfg_present = 1;
 
diff --git a/src/pirtable.c b/src/pirtable.c
index 4c3f1ff..7cb358e 100644
--- a/src/pirtable.c
+++ b/src/pirtable.c
@@ -17,7 +17,7 @@ struct pir_table {
 } PACKED;
 
 extern struct pir_table PIR_TABLE;
-#if CONFIG_PIRTABLE  !CONFIG_COREBOOT
+#if CONFIG_PIRTABLE // !CONFIG_COREBOOT
 struct pir_table PIR_TABLE __aligned(16) VAR16EXPORT = {
 .pir = {
 .version = 0x0100,
diff --git a/src/post.c b/src/post.c
index b4ad1fa..880a238 100644
--- a/src/post.c
+++ b/src/post.c
@@ -157,10 +157,10 @@ ram_probe(void)
 static void
 init_bios_tables(void)
 {
-if (CONFIG_COREBOOT) {
-coreboot_copy_biostable();
-return;
-}
+//if (CONFIG_COREBOOT) {
+//coreboot_copy_biostable();
+//return;
+//}
 if (usingXen()) {
 	xen_copy_biostables();
 	return;
diff --git a/src/virtio-blk.c b/src/virtio-blk.c
index b869189..9022601 100644
--- a/src/virtio-blk.c
+++ b/src

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-17 Thread John Fastabend

On 2/17/2012 6:28 AM, jamal wrote:
 On Wed, 2012-02-15 at 17:26 -0800, John Fastabend wrote:
 On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:

 Roopa was likely on the right track here,

 http://patchwork.ozlabs.org/patch/123064/

 Doesnt seem related to the bridging stuff - the modeling looks
 reasonable however.


 The operations are really the same ADD/DEL/GET additional MAC
 addresses to a port, in this case a macvlan type port. The
 difference is the  macvlan port type drops any packet with an
 address not in the FDB where the bridge type floods these.
 
 Ok.
 [the vlan piece really should have been an integrated part of bridging;
 in the early days this was the case]
 
 
 [root@jf-dev1-dcblab src]# br fdb help
 Usage: br fdb { add | del | replace } ADDR dev DEV
br fdb {show} [ dev DEV ]

 In my example I just dumped all bridge devices,

 
 Ok, makes sense.
 
 
 Seems we need both a synchronize and a { add | del | replace } option.
 
 I am conflicted on this.
 Not sure if that is a command line thing or something built into a user
 space daemon. It may be useful to have the command line variant but i
 feel having a daemon take care of things helps in faster
 synchronization.
 I think user space is a good spot to add such functionality (as opposed
 to the kernel). That way user space can work with h/ware switching such
 as yours as well as a standalone switching chips (from sillicon vendors
 like Marvel etc).
 IMO, the average user doesnt need to be aware of such low level stuff;
 so the default should be for the user not to be responsible for
 configuration of synchronization. IOW, I want to just run well
 understood user interface tools things like ifconfig, ip link etc, the
 new br tool and not even need to be aware that we are offloading.
 So as long as s/w br0 is mapping to the bridge on ixgb-0 i dont need
 to know ixgb0 h/w bridge exists.
 

Yes I agree that is the goal.

 One last comment:
 With synchronization there are other challenges when the entry in the
 hardware conflicts with the entry in software when you intend the
 behavior to be the same. This is not such a big deal with bridging but
 becomes more apparent when you start offloading ACLs etc.
 

OK and these sorts of conflicts certainly don't need to be resolved
by kernel code. So I think this is a reasonable reason to drive the
synchronization into a user space daemon.

 
 So I think what your saying is a per port bit to disable learning...
 hmm but if you start tweaking it too much it looks less and less like a
 802.1D bridge and more like something you would want to build with tc or
 openvswitch or tc+bridge or tc+macvlan.
 
 These are pretty commodity features in most silicon switching chips ive
 come across. You have a knob to control learning and another to control
 flooding.
 

All right this looks like a follow up patch to me. First build the interface
to configure the HW FDB. Then a second series to add a flooding knob which
works for both embedded switches and software switches.

 cheers,
 jamal
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread John Fastabend

On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:
 
 Roopa was likely on the right track here,

 http://patchwork.ozlabs.org/patch/123064/
 
 Doesnt seem related to the bridging stuff - the modeling looks
 reasonable however.
 

The operations are really the same ADD/DEL/GET additional MAC
addresses to a port, in this case a macvlan type port. The
difference is the  macvlan port type drops any packet with an
address not in the FDB where the bridge type floods these.

 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.

 An ideal user space interaction IMHO would look like,

 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local
 
 Looks nice, where is the targeted bridge(eg br0) in that syntax?

[root@jf-dev1-dcblab src]# br fdb help
Usage: br fdb { add | del | replace } ADDR dev DEV
   br fdb {show} [ dev DEV ]

In my example I just dumped all bridge devices,

#br fdb show dev bridge0

 
 Using Stephen's br tool. First command adds FDB entry to SW bridge and
 if the same tool could be used to add entries to embedded bridge I think
 that would be the best case. 
 
 That would be nice (although adds dependency on the presence of the
 s/ware bridge). Would be nicer to have either a knob in the kernel to
 say synchronize with h/w bridge foo which can be turned off.  
 

Seems we need both a synchronize and a { add | del | replace } option.

 So no RTNETLINK error on the second cmd. Then
 embedded FDB entries could be dumped this way also so I get a complete view
 of my FDB setup across multiple sw bridges and embedded bridges.
 
 So if you had multiple h/ware bridges - which one is tied to br0? 
 

Not sure I follow but does the additional dev parameter above answer this?

 
 Yes. The hardware has a bit to support this which is currently not exposed
 to user space. That's a case where we have 'yet another knob' that needs
 a clean solution. This causes real bugs today when users try to use the
 macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
 all part of the 802.1Qbg spec which people actually want to use with Linux
 so a good clean solution is probably needed.
 
 
 I think the knobs to flood and learn are important. The hardware
 seems to have the flood but not the learn/discover. I think the
 s/ware bridge needs to have both. At the moment - as pointed out in that
 *NEIGH* notification, s/w bridge assumes a policy that could be
 considered a security flaw in some circles - just because you are my
 neighbor does not mean i trust you to come into my house; i may trust
 you partially and allow you only to come through the front door. Even in
 Canada with a default policy of not locking your door we sometimes lock
 our doors ;-
 
 
 I have no problem with drawing the line here and trying to implement 
 something
 over PF_BRIDGE:RTM_xxx nlmsgs. 
 
 
 My comment/concern was in regard to the bridge built-in policy of
 reading from the neighbor updates (refer to above comments)
 

So I think what your saying is a per port bit to disable learning...

hmm but if you start tweaking it too much it looks less and less like a
802.1D bridge and more like something you would want to build with tc or
openvswitch or tc+bridge or tc+macvlan.

.John

 cheers,
 jamal
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread John Fastabend

On 2/14/2012 5:18 AM, jamal wrote:
 On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:
 
 The use case here is multiple VFs but the same solution should work with
 multiple PFs as well. FDB controls should be independent of how the ports
 are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.
 
 Makes sense.
 
 With events and ADD/DEL/GET FDB controls we can solve both cases. This also
 solves Roopa's case with macvlan where she wants to add additional addresses
 to macvlan ports.
 
 Not familiar with that issue - I'll prowl the list.

Roopa was likely on the right track here,

http://patchwork.ozlabs.org/patch/123064/

But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
netlink messages. And if possible drive this without extending ndo_ops.

An ideal user space interaction IMHO would look like,

[root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
[root@jf-dev1-dcblab iproute2]# ./br/br fdb
portmac addrflags
veth2   36:a6:35:9b:96:c4   local
veth4   aa:54:b0:7b:42:ef   local
veth0   2a:e8:5c:95:6c:1b   local
veth6   6e:26:d5:43:a3:36   local
veth0   f2:c1:39:76:6a:fb
veth8   4e:35:16:af:87:13   local
veth10  52:e5:62:7b:57:88   static
veth10  aa:a9:35:21:15:c4   local
[root@jf-dev1-dcblab iproute2]# ./br/br fdb add dev eth3 to 52:e5:62:7b:57:88
RTNETLINK answers: Invalid argument

Using Stephen's br tool. First command adds FDB entry to SW bridge and
if the same tool could be used to add entries to embedded bridge I think
that would be the best case. So no RTNETLINK error on the second cmd. Then
embedded FDB entries could be dumped this way also so I get a complete view
of my FDB setup across multiple sw bridges and embedded bridges.

I don't think br is part of iproute2 yet I just pulled it out of some RFC
but it works reasonably well and is intuitive enough.

 
 Yes it should flood here, unless its acting as a 802.1Qbg VEB or VEPA.
 
 Ok. So there is a toggle somewhere which controls how flooding should
 happen.
 

Yes. The hardware has a bit to support this which is currently not exposed
to user space. That's a case where we have 'yet another knob' that needs
a clean solution. This causes real bugs today when users try to use the
macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
all part of the 802.1Qbg spec which people actually want to use with Linux
so a good clean solution is probably needed.


 Maybe not. But the kernel already has the needed signals with one extra
 hook we can save running a daemon in user space. Maybe that's not a great
 argument to add kernel code though.
 
 You make a reasonable arguement to have it in the kernel but i think we
 win more if we separate the control. So while i empathize, I am hoping
 that youd go with the path that is hard to travel ;-
 
 The PF_BRIDGE:RTM_GETNEIGH,RTM_NEWNEIGH,RTM_DELNEIGH are registered in the
 br_netlink_init() path. 
 
 Hrm - hadnt paid attention to that before. Nasty.
 The bridge seems to be hard-coding policy on station movement, no? 
 This is a good example of the qualms i have on adding things to the
 kernel;-
 I may not want to auto update a MAC address moving ports as part of
 some policy i have. I can go and add YAK (Yet Another Knob) - but where
 is the line drawn?
 

I have no problem with drawing the line here and trying to implement something
over PF_BRIDGE:RTM_xxx nlmsgs. I'll work with Roopa and see if we can come
up with something in the next couple days.

w.r.t. VEPA/VEB and flooding behavior we could probably have a bit to indicate
if the port is a flooding port or not. Then users could build any sort of 
forwarding
table they wanted OR we could just drive it through a notifier (ndo_ops?) in the
macvlan path which does VEPA today.

OK I'll try to write some actual code now that can be critiqued.

 cheers,
 jamal
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread John Fastabend

On 2/14/2012 11:05 AM, Stephen Hemminger wrote:
 On Tue, 14 Feb 2012 10:57:04 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 On 2/14/2012 5:18 AM, jamal wrote:
 On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:

 The use case here is multiple VFs but the same solution should work with
 multiple PFs as well. FDB controls should be independent of how the ports
 are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.

 Makes sense.

 With events and ADD/DEL/GET FDB controls we can solve both cases. This also
 solves Roopa's case with macvlan where she wants to add additional 
 addresses
 to macvlan ports.

 Not familiar with that issue - I'll prowl the list.

 Roopa was likely on the right track here,

 http://patchwork.ozlabs.org/patch/123064/

 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.

 An ideal user space interaction IMHO would look like,

 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add dev eth3 to 52:e5:62:7b:57:88
 RTNETLINK answers: Invalid argument
 
 I am going to put bridge (nameclash with br) tool into iproute2 (soon).

I've been using it on my dev box for awhile now and it works well for
me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-13 Thread John Fastabend

On 2/10/2012 7:18 AM, jamal wrote:
 Hi John,
 
 I went backwards to summarize at the top after going through your email.
 
 TL;DR version 0.1: 
 you provide a good use case where it makes sense to do things in the
 kernel. IMO, you could make the same arguement if your embedded switch
 could do ACLs, IPv4 forwarding etc. And the kernel bloats.
 I am always bigoted to move all policy control to user space instead of
 bloating in the kernel.
 
  
 On Thu, 2012-02-09 at 20:14 -0800, John Fastabend wrote:
 

 Hi Jamal,

 The user space app in this case would listen for FDB updates to the SW
 bridge and then mirror them at the embedded NIC. In this case it seems
 easier to just add a notifier chain and let the kernel keep these in
 sync. Otherwise we need a daemon in user space to replicate these.

 
 A user space daemon if you need to ensure synchronization. Thats what i
 meant when i said there was a disadvantage over the simple case when
 the goal is always to synchronize.
 
 On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
 and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
 would have one common interface to drive these. But the bridge already
 has this protocol/msgtype so that would require either some demux or
 new protocol/msgtype pairs to be created. 

 
 The bridge is very netlink friendly these days. Given the rest of the
 network stack (*NEIGH* you mention above) talks netlink to user space
 it should be workable. 
 
 Let me think on it. I'm tempted by the simplicity of adding notifier
 hooks though.
 
 If something is missing bridge-side it may need to be added (as Per
 Stephen's comment) - i just took it one further indicating those
 notifiers need to also netlink-speak
 

Sure.

 
 Actually because the bridge is adding/removing fdb entries dynamically
 maybe its best this gets done in kernel. Here's the example case,
 
 [..]
 

 With the flow by letters above hope this is not too difficult to follow.
 
 (A) veth0 a virtual device transmits packet destined for ethx.y
 (B) SW bridge receives frames and updates FDB flooding to C
 (C) eth0 the PF in this case sends the frame to the HW backed by the
 embedded bridge
 
 Following so far.
 Can you have more than one PF per embedded switch? Or is the intent here
 purely to do VMs/VF separation?
 

The use case here is multiple VFs but the same solution should work with
multiple PFs as well. FDB controls should be independent of how the ports
are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.

 (D) The HW embedded switch has a static entry for ethx.y and forwards
 the frame to the VF or if its a broadcast frame also floods it to
 the wire and ethx.y
 
 nod.
 
 (E) ethx.y receives the frame and generates a response to the dest mac of
 veth0
 
 nod.
 Since you said in #D the entries in the switch are static, I am assuming
 at this point neither ethx.y nor veth0 exist in the embedded FDB.
 
 Now here is the potential issue,

 (G) The frame transmitted from ethx.y with the destination address of
 veth0 but the embedded switch is not a learning switch. If the FDB
 update is done in user space its possible (likely?) that the FDB
 entry for veth0 has not been added to the embedded switch yet. 
 
 Ok, got it - so the catch here is the switch is not capable of learning.
 I think this depends on where learning is done. Your intent is to
 use the S/W bridge as something that does the learning for you i.e in
 the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
 And that maybe the case for your use case.
 

This is _my_ use case today.

 What if I dont wanna run the S/W bridge at all?
 Ive been making a point that with a simple knob(Stephen doesn like to
 add such a knob), the SW bridge could defer learning to user space. 
 [This way you can add a lot of richness e.g on ACLs such as restricting
 what MAC addresses etc are allowed to talk to which ones etc.].
 But if bypass the s/w bridge all together and learn in user space
 or have a static config in which i populate the embedded switch, i dont
 see the issue.

With events and ADD/DEL/GET FDB controls we can solve both cases. This also
solves Roopa's case with macvlan where he wants to add additional addresses
to macvlan ports.

 
 Now
 we either have to flood the frame which is not horrible but not
 ideal or worse if the embedded switch does not support flooding send
 it to the wire and veth0 never receives it. 
 
 If it is a switch it has to flood, no? Otherwise it sounds broken.
 

Yes it should flood here, unless its acting as a 802.1Qbg VEB or VEPA.

 If the SW bridge pushes
 the FDB update down into the embedded switch the address is for
 sure in the embedded switches forwarding tables and the switching
 works as expected.
 
 Yes, there is a small gap between the s/w bridge learning and the
 synchronization happening to the embedded nic switch. That gap gets
 larger if you defer learning

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/8/2012 8:36 PM, Stephen Hemminger wrote:
 On Wed, 08 Feb 2012 19:22:06 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.

 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.


   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   

 This is only an RFC couple more changes are needed.

 (1) Optimize HW FDB set/del to only walk list if an FDB offloaded
 device is attached. Or decide it doesn't matter from unlikely()
 path.

 (2) Is it good enough to just call dev_uc_{add|del} or
 dev_mc_{add|del}? Or do some devices really need a new netdev
 callback to do this operation correctly. I think it should be
 good enough as is.

 (3) wrapped list walk in rcu_read_lock() just in case maybe every
 case is already inside rcu_read_lock()/unlock().

 Also this is in response to this thread regarding the macvlan and
 exposing rx filters posting now to see if folks think this is the
 right idea and if it will resolve at least the bridge case.

 http://lists.openwall.net/netdev/2011/11/08/135

 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 ---

  include/linux/netdev_features.h |2 ++
  net/bridge/br_fdb.c |   34 ++
  2 files changed, 36 insertions(+), 0 deletions(-)

 diff --git a/include/linux/netdev_features.h 
 b/include/linux/netdev_features.h
 index 77f5202..5936fae 100644
 
 Rather than yet another device feature, I would rather use netlink_notifier
 callback. The notifier is more general and generic without messing with 
 internals
 of bridge.
 

But the device features makes it easy for user space to learn that the device
supports this sort of offload. Now if all SR-IOV devices support this then it
doesn't matter but I thought there were SR-IOV devices that didn't do any
switching? I'll dig through the SR-IOV drivers to check there are not too
many of them.

By netlink_notifier do you mean adding a notifier_block and using 
atomic_notifier_call_chain()
probably in rtnl_notify()? Then drivers could register with the notifier chain 
with
atomic_notifier_chain_register() and receive the events correctly. Or did I miss
some notifier chain that already exists?

Thanks,
John

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 9:40 AM, Stephen Hemminger wrote:
 On Thu, 09 Feb 2012 09:36:47 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 But the device features makes it easy for user space to learn that the device
 supports this sort of offload. Now if all SR-IOV devices support this then it
 doesn't matter but I thought there were SR-IOV devices that didn't do any
 switching? I'll dig through the SR-IOV drivers to check there are not too
 many of them.
 
 If user space needs to know then the OS is not designed properly.
 The purpose of the network device is to abstract all those details, and more 
 and more
 of them are bleeding through. This makes writing management applications 
 harder and makes
 things dependent on features that may or may not be present. The best design 
 is when
 the change is invisible.
 

Agreed.

 By netlink_notifier do you mean adding a notifier_block and using 
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier 
 chain with
 atomic_notifier_chain_register() and receive the events correctly. Or did I 
 miss
 some notifier chain that already exists?
 
 Yes. that is what I mean. The callbacks you need may or may not already be 
 present.


OK thanks I'll put together an update here shortly.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
 On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.

 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.


   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   

 
 This scenario works now as adding an interface to a bridge puts it in
 promiscuous mode. So adding a PF to a software bridge should not be
 a problem as it supports promiscuous mode. But adding a VF will not
 work.

It shouldn't work because the embedded bridge will lookup the address
in its FDB and when it doesn't match any unicast filters it will forward
the packet onto the wire. Because the veth0 and veth2 above never get
inserted into the embedded brdige's FDB the packets will _never_ get
routed there.

That said the current 'ixgbe' driver is doing something broken in that
it is always setting the unicast hash table and mirroring bits to 1. So
if you think this is working your seeing a bug where packets are being
sent onto the wire AND upto the PF. Packets with destination addresses
matching veth1 should not end up on the wire and vice versa. This is
specific to ixgbe and is not the case for other SR-IOV devices.

This causes some issues (a) has some very real performance implications,
(b) at this point you have some strange behavior from my point of view.
The embedded bridge is not a learning bridge nor is it acting like an
802.1Q VEB or VEPA.

 
 Are you trying to avoid the requirement of having to put the interface 
 in promiscuous mode when adding to a bridge?
 

I think the bridge being in promiscuous mode is correct.

Hope that helps sorry its a bit long winded.
John



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 4:39 PM, Sridhar Samudrala wrote:
 On Thu, 2012-02-09 at 12:30 -0800, John Fastabend wrote:
 On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
 On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.

 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.


   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   


 This scenario works now as adding an interface to a bridge puts it in
 promiscuous mode. So adding a PF to a software bridge should not be
 a problem as it supports promiscuous mode. But adding a VF will not
 work.

 It shouldn't work because the embedded bridge will lookup the address
 in its FDB and when it doesn't match any unicast filters it will forward
 the packet onto the wire. Because the veth0 and veth2 above never get
 inserted into the embedded brdige's FDB the packets will _never_ get
 routed there.

 That said the current 'ixgbe' driver is doing something broken in that
 it is always setting the unicast hash table and mirroring bits to 1. So
 if you think this is working your seeing a bug where packets are being
 sent onto the wire AND upto the PF. Packets with destination addresses
 matching veth1 should not end up on the wire and vice versa. This is
 specific to ixgbe and is not the case for other SR-IOV devices.
 
 OK. Is this behavior going to be fixed.
 

Only after we have a mechanism to either configure the NIC FDB directly
or have it stay in sync with the SW switch. Flooding traffic seems better
than being unable to send traffic to the virtual device altogether. This
behavior is driver specific some devices just fail outright.

I'm thinking over Jamal's comment now.


 This causes some issues (a) has some very real performance implications,
 (b) at this point you have some strange behavior from my point of view.
 The embedded bridge is not a learning bridge nor is it acting like an
 802.1Q VEB or VEPA.


 Are you trying to avoid the requirement of having to put the interface 
 in promiscuous mode when adding to a bridge?


 I think the bridge being in promiscuous mode is correct.
 
 The interface that is added to the bridge is put in promiscuous mode,
 not the bridge itself.  In this example, i assumed that setting
 promiscuous on PF is putting the embedded bridge in learning mode.


Yes I misspoke I mean the PF. The embedded bridge in this case does not
support learning. Also I'm not aware of any SR-IOV NICs that do support
learning.

 
 Thanks
 Sridhar
 

 Hope that helps sorry its a bit long winded.
 John



 
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 1:11 PM, jamal wrote:
 On Thu, 2012-02-09 at 09:52 -0800, John Fastabend wrote:
 
 By netlink_notifier do you mean adding a notifier_block and using 
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier 
 chain with
 atomic_notifier_chain_register() and receive the events correctly. Or did 
 I miss
 some notifier chain that already exists?

 Yes. that is what I mean. The callbacks you need may or may not already be 
 present.
 
 I'll go one step further.
 This stuff shouldnt be in the kernel at all. 
 The disadvantage is you need a user space app to update the hardware.
 i.e, the same mechanism should be usable for either a switch embedded
 in a NIC or a standalone hardware switch (with/out the s/ware bridge 
 presence)
 
 cheers,
 jamal
 

Hi Jamal,

The user space app in this case would listen for FDB updates to the SW
bridge and then mirror them at the embedded NIC. In this case it seems
easier to just add a notifier chain and let the kernel keep these in
sync. Otherwise we need a daemon in user space to replicate these.

On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
would have one common interface to drive these. But the bridge already
has this protocol/msgtype so that would require either some demux or
new protocol/msgtype pairs to be created. 

Let me think on it. I'm tempted by the simplicity of adding notifier
hooks though.

.John


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 6:14 PM, John Fastabend wrote:
 On 2/9/2012 1:11 PM, jamal wrote:
 On Thu, 2012-02-09 at 09:52 -0800, John Fastabend wrote:

 By netlink_notifier do you mean adding a notifier_block and using 
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier 
 chain with
 atomic_notifier_chain_register() and receive the events correctly. Or did 
 I miss
 some notifier chain that already exists?

 Yes. that is what I mean. The callbacks you need may or may not already be 
 present.

 I'll go one step further.
 This stuff shouldnt be in the kernel at all. 
 The disadvantage is you need a user space app to update the hardware.
 i.e, the same mechanism should be usable for either a switch embedded
 in a NIC or a standalone hardware switch (with/out the s/ware bridge 
 presence)

 cheers,
 jamal

 
 Hi Jamal,
 
 The user space app in this case would listen for FDB updates to the SW
 bridge and then mirror them at the embedded NIC. In this case it seems
 easier to just add a notifier chain and let the kernel keep these in
 sync. Otherwise we need a daemon in user space to replicate these.
 
 On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
 and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
 would have one common interface to drive these. But the bridge already
 has this protocol/msgtype so that would require either some demux or
 new protocol/msgtype pairs to be created. 
 
 Let me think on it. I'm tempted by the simplicity of adding notifier
 hooks though.
 
 .John
 

Actually because the bridge is adding/removing fdb entries dynamically
maybe its best this gets done in kernel. Here's the example case,


  -- -
  | ethx.y |   E| veth0 |  --- A
  -- -
  |   |
  |   |
  |   |
  | --
  | |  SW Bridge | --- B
  | --
  |   |
  |   |
  |  -
  |  | eth0  | --- C
  |  -
  |   |
   ---
   |embedded switch  | --- D
   ---
   |
   |
   G

With the flow by letters above hope this is not too difficult to follow.

(A) veth0 a virtual device transmits packet destined for ethx.y
(B) SW bridge receives frames and updates FDB flooding to C
(C) eth0 the PF in this case sends the frame to the HW backed by the
embedded bridge
(D) The HW embedded switch has a static entry for ethx.y and forwards
the frame to the VF or if its a broadcast frame also floods it to
the wire and ethx.y
(E) ethx.y receives the frame and generates a response to the dest mac of
veth0

Now here is the potential issue,

(G) The frame transmitted from ethx.y with the destination address of
veth0 but the embedded switch is not a learning switch. If the FDB
update is done in user space its possible (likely?) that the FDB
entry for veth0 has not been added to the embedded switch yet. Now
we either have to flood the frame which is not horrible but not
ideal or worse if the embedded switch does not support flooding send
it to the wire and veth0 never receives it. If the SW bridge pushes
the FDB update down into the embedded switch the address is for
sure in the embedded switches forwarding tables and the switching
works as expected.

So to handle this case correctly its probably best IMHO to use a notifier
hook. Having a RTM_GETNEIGH for the embedded switch implemented though
would be nice for dumping the FDB of the embedded switch and SET/DEL
could be used to configure the FDB when its not being driven by the SW
switch. Of course we should try to be minimalists here.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-08 Thread John Fastabend

On 2/5/2012 8:54 AM, Roopa Prabhu wrote:
 
 
 
 On 2/3/12 7:32 AM, Roopa Prabhu ropra...@cisco.com wrote:
 



 On 2/2/12 10:58 AM, John Fastabend john.r.fastab...@intel.com wrote:
 snip..
 
 Are you sure they will be good to have? I'm  not so sure you want to be
 able to manipulate the uc and mc tables from user space. MACVLAN seems to
 be one type of device where it is useful but doing this to a PF or VF seems
 hard to use for any real use case. Fun to test the embedded bridge though.


 I wont say I am sure. Would be nice have to have netlink interfaces to
 ADD/DEL additional uc, mc addrs, filter flags and vlans. I have looked at
 the existing interfaces and nothing seemed straightforward then. But I
 forget and need to take a look again.
 I think vlans and filter flags is somehow possible today. And maybe mc too.
 But if I am right we don't have a way to add additional unicast addresses
 from userspace. 

 I will dig my notes and try and list down the problems with using the
 existing netlink interfaces for this.
 
 There are kernel api's/ops to add/del hw uc/mc/vlan/filter filter flags:
 Ndo_set_rx_mode, add/del_vid, dev_uc_add, dev_mc_add and dev_filter_flags.
 
 But there are no straight forward mechanisms to add these from userspace. L2
 mc addresses can be added by SIOCADDMULTI. And filter flags maybe via
 netlink. Nothing for uc and vlan as far as I know (correct me if I am
 wrong). Setting of hw filters is usually done indirectly by the kernel
 during creation of vlan devices for example.
 
 There is a netlink msg to create a vlan device. But there is no way to add a
 vlan filter directly to the hw. Nothing for secondary uc addrs.
 This is ok for all cases except for the virtualization case I am trying to
 solve. 

Well there is the somewhat asymmetric case where we allow port VLANs to be set
on a VF but not on the PF. I can think of cases (802.1Qbg) where firmware
might be doing EVB or CDCP and enforce a specific filtering. With current
asymmetric interfaces I'm not sure how to expose this on the PF.

to see how this works look for 'ifla_vf_vlan'.

 
 To summarize,
 
 The requirement is to have a mechanism from userspace to populate hw filters
 on a device. And this is required to program guest nic filters into the host
 device backing the guest nic. In the direct attach case, its the macvtap
 device and in turn the macvtap lower device.

Yes I agree we would like a mechanism to do this.

 
 Today I cant think of any other use case that would require this (except
 that there is a brief chance that this could be used in the hybrid
 acceleration stuff that ben and intel have been discussing).

I'll post a RFC I hacked out today to do this with the ./net/bridge code
in a minute. (still needs testing and some fixups).

 
 I see the below ways this can be done:
 1) TUNSETTXFILTER: My v1 patch, that targets only the above specific macvtap
 problem. This works for only uc/mc and flags filter. Possibly requires a new
 cmd TUNSETVLANFILTER for vlan filters.
 
 
 2) rtnetlink ops for setting hw filters: My v2 patch targeting virtual
 devices that implement rtnl_link_ops. Example macvtap/macvlan
 
 This netlink interface to set filters follows TUNSETTXFILTER giving the
 ability to set filters on these devices. The netlink payload must contain
 all the uc, mc, vid's and filter flags that go on the device.
 
 
 3) netdev_ops for setting hw filters: my v3 and v4 patches. This is same as
 2 but moves the ops to netdev, so that it can be used by all devices if
 required.
 
 
 4) v5 (New approach. Not submitted yet):
 In 2 and 3 above, the netlink msg could be broken down to have separate msgs
 to support add/del of uc/mc/vlan. This should be close to what we have today
 for vf vlan and vf mac. (Something similar to what John Fastabend was
 suggesting too). Advantage, use existing hw ops. (This slightly varies from
 the original goal which was not targeted at getting in to uc,mc lists alone.
 The goal was to have macvlan maintain its own filters and use it in fwding
 lookups if needed in the future. But I guess if we need this in the future
 we could possibly use the macvlan uc, mc lists.)
 

hmm I'm on the fence here. I like that (4) is generic but I'm not sure I
would want user space to come in and whack a uc addr added from a stacked
device. Looks like there is a free bit in netdev_hw_addr maybe we could add
another bool here to let user space only modify/delete entries that haven't
been locked by the kernel.

That said (1) seems like the straight forward approach and if we don't have
a compelling reason and use case to add the ability to modify the uc and mc
lists generically then adding features for features sake might not be a good.
I'm leaning towards (1) unless someone has a use case for the others and is
really going to implement something with it.

.John

 
 Netlink msgs to set hw filters (basically for dev_uc_add/del,
 dev_mc_add/del, and vlans). The below is not a final cut. Just attempting

[RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-08 Thread John Fastabend

Propagate software FDB table into hardware uc, mc lists when
the NETIF_F_HW_FDB is set.

This resolves the case below where an embedded switch is used
in hardware to do inter-VF or VF-PF switching. This patch
pushes the FDB entry (specifically the MAC address) into the
embedded switch with dev_add_uc and dev_add_mc so the switch
learns about the software bridge.


  veth0  veth2
|  |
  
  |  bridge0 |    software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \   propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge | hardware offloaded switching
  

This is only an RFC couple more changes are needed.

(1) Optimize HW FDB set/del to only walk list if an FDB offloaded
device is attached. Or decide it doesn't matter from unlikely()
path.

(2) Is it good enough to just call dev_uc_{add|del} or
dev_mc_{add|del}? Or do some devices really need a new netdev
callback to do this operation correctly. I think it should be
good enough as is.

(3) wrapped list walk in rcu_read_lock() just in case maybe every
case is already inside rcu_read_lock()/unlock().

Also this is in response to this thread regarding the macvlan and
exposing rx filters posting now to see if folks think this is the
right idea and if it will resolve at least the bridge case.

http://lists.openwall.net/netdev/2011/11/08/135

Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 include/linux/netdev_features.h |2 ++
 net/bridge/br_fdb.c |   34 ++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 77f5202..5936fae 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -55,6 +55,8 @@ enum {
NETIF_F_NOCACHE_COPY_BIT,   /* Use no-cache copyfromuser */
NETIF_F_LOOPBACK_BIT,   /* Enable loopback */
 
+   NETIF_F_HW_FDB, /* Hardware supports switching */
+
/*
 * Add your fresh new feature above and remember to update
 * netdev_features_strings[] in net/core/ethtool.c and maybe
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 5ba0c84..4cc545b 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -81,9 +81,26 @@ static void fdb_rcu_free(struct rcu_head *head)
kmem_cache_free(br_fdb_cache, ent);
 }
 
+static void fdb_hw_delete(struct net_bridge *br,
+ struct net_bridge_fdb_entry *fdb)
+{
+   struct net_bridge_port *op;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(op, br-port_list, list) {
+   struct net_device *dev = op-dev;
+
+   if ((dev-features  NETIF_F_HW_FDB) 
+   dev != fdb-dst-dev)
+   dev_uc_del(dev, fdb-addr.addr);
+   }
+   rcu_read_unlock();
+}
+
 static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
 {
hlist_del_rcu(f-hlist);
+   fdb_hw_delete(br, f);
fdb_notify(br, f, RTM_DELNEIGH);
call_rcu(f-rcu, fdb_rcu_free);
 }
@@ -350,6 +367,22 @@ static struct net_bridge_fdb_entry *fdb_find_rcu(struct 
hlist_head *head,
return NULL;
 }
 
+static void fdb_hw_create(struct net_bridge *br,
+ struct net_bridge_fdb_entry *fdb)
+{
+   struct net_bridge_port *op;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(op, br-port_list, list) {
+   struct net_device *dev = op-dev;
+
+   if ((dev-features  NETIF_F_HW_FDB) 
+   dev != fdb-dst-dev)
+   dev_uc_add(dev, fdb-addr.addr);
+   }
+   rcu_read_unlock();
+}
+
 static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
   struct net_bridge_port *source,
   const unsigned char *addr)
@@ -363,6 +396,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct 
hlist_head *head,
fdb-is_local = 0;
fdb-is_static = 0;
fdb-updated = fdb-used = jiffies;
+   fdb_hw_create(source-br, fdb);
hlist_add_head_rcu(fdb-hlist, head);
}
return fdb;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH v0 2/2] ixgbe: add NETIF_F_HW_FDB to supported flags

2012-02-08 Thread John Fastabend

Add support for NETIF_F_HW_FDB flag when SR-IOV is enabled. This
allows the bridge to push fdb entries into the hardware so the
VF can communicate with virtual devices attached to the bridge.


  veth0  veth2
|  |
  
  |  bridge0 |    software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \   propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge | hardware offloaded switching
  


Signed-off-by: John Fastabend john.r.fastab...@intel.com
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   35 +
 1 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ecc46ce..66261fa 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3207,6 +3207,11 @@ static int ixgbe_write_uc_addr_list(struct net_device 
*netdev)
netdev_for_each_uc_addr(ha, netdev) {
if (!rar_entries)
break;
+
+   netif_printk(adapter, hw, KERN_DEBUG, adapter-netdev,
+%s %s: write vfn %i %pM\n,
+__func__, netdev-name, vfn, ha-addr);
+
hw-mac.ops.set_rar(hw, rar_entries--, ha-addr,
vfn, IXGBE_RAH_AV);
count++;
@@ -3268,16 +3273,17 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
}
ixgbe_vlan_filter_enable(adapter);
hw-addr_ctrl.user_set_promisc = false;
-   /*
-* Write addresses to available RAR registers, if there is not
-* sufficient space to store all the addresses then enable
-* unicast promiscuous mode
-*/
-   count = ixgbe_write_uc_addr_list(netdev);
-   if (count  0) {
-   fctrl |= IXGBE_FCTRL_UPE;
-   vmolr |= IXGBE_VMOLR_ROPE;
-   }
+   }
+
+   /*
+* Write addresses to available RAR registers, if there is not
+* sufficient space to store all the addresses then enable
+* unicast promiscuous mode
+*/
+   count = ixgbe_write_uc_addr_list(netdev);
+   if (count  0) {
+   fctrl |= IXGBE_FCTRL_UPE;
+   vmolr |= IXGBE_VMOLR_ROPE;
}
 
if (adapter-num_vfs) {
@@ -7214,6 +7220,10 @@ static netdev_features_t ixgbe_fix_features(struct 
net_device *netdev,
e_info(probe, rx-usecs set too low, not enabling RSC\n);
}
 
+   /* Only use offloaded FDB if SR-IOV is enabled */
+   if (!(adapter-flags  IXGBE_FLAG_SRIOV_ENABLED))
+   data = ~NETIF_F_HW_FDB;
+
return data;
 }
 
@@ -7549,9 +7559,12 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 
netdev-priv_flags |= IFF_UNICAST_FLT;
 
-   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED)
+   if (adapter-flags  IXGBE_FLAG_SRIOV_ENABLED) {
adapter-flags = ~(IXGBE_FLAG_RSS_ENABLED |
IXGBE_FLAG_DCB_ENABLED);
+   netdev-hw_features |= NETIF_F_HW_FDB;
+   netdev-features |= NETIF_F_HW_FDB;
+   }
 
 #ifdef CONFIG_IXGBE_DCB
netdev-dcbnl_ops = dcbnl_ops;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

virtio_ioport_write unexpected address

2012-02-06 Thread John Goerzen


Hi,

We're setting up some KVM systems with their disk image stored in a 
sparse raw file on an NFS4 server.  The NFS4 filesystem is mounted with 
the hard option, which means that I/O to it is blocked indefinitely when 
the server goes down, and will be properly handled when the server returns.


We observed a server failure over the weekend, and when the server 
returned -- not when it went down -- our Windows 2008 R2 VMs hard 
rebooted, with this in the libvirt log:


virtio_ioport_write: unexpected address 0x13 value 0x1

virtio_ioport_write: unexpected address 0x13 value 0x1

The downtime had been about 30 minutes, and after the downtime, the NFS 
filesystem was responsive.


We were presenting drive C to Windows as IDE, and a data drive with virtio.

The host system is Debian squeeze x64, which is running the stock Debian 
2.6.32 kernel and qemu-kvm 0.12.5.  The Windows VMs have the latest 
virtio drivers from Fedora.


Thanks,

-- John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend

On 2/1/2012 11:24 PM, Michael S. Tsirkin wrote:
 On Sun, Nov 20, 2011 at 08:30:24AM -0800, Roopa Prabhu wrote:



 On 11/17/11 4:15 PM, Ben Hutchings bhutchi...@solarflare.com wrote:

 Sorry to come to this rather late.

 On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
 [...]
 v2 - v3
 - Moved set and get filter ops from rtnl_link_ops to netdev_ops
 - Support for SRIOV VFs.
 [Note: The get filters msg (in the way current get rtnetlink 
 handles
 it) might get too big for SRIOV vfs. This patch follows existing
 sriov 
 vf get code and tries to accomodate filters for all VF's in a PF.
 And for the SRIOV case I have only tested the fact that the VF
 arguments are getting delivered to rtnetlink correctly. The code
 follows existing sriov vf handling code so rest of it should work
 fine]
 [...]

 This is already broken for large numbers of VFs, and increasing the
 amount of information per VF is going to make the situation worse.  I am
 no netlink expert but I think that the current approach of bundling all
 information about an interface in a single message may not be
 sustainable.

 Yes agreed. I have the same concern.
 
 So it seems that we need to extend the existing interface to allow
 tweaking filters per VF. Does it need to block this
 patchset though? After all, we'll need to support the existing

hmm not sure I follow what patchset is this blocking?

 interface indefinitely, too.
 

OK finally got to read through this. And its not clear to me why we need
these per VF/PF filter netdevice ops and netlink extensions if we can
get the stacking correct. (Adding filters to the macvlan seems reasonable
to me)

In the cases I saw listed above I see a few enumerations:

PF -- MACVLAN  --- Guest --- [...]

VF -- MACVLAN  --- Guest --- [...]   

VF|Guest --- [...]   direct assigned VF

PF|Guest --- [...]   direct assigned PF


I used '[...]' to represent whatever additional stacking is done in the
guest unknown to the host. In the direct assign VF case (Greg Rose
correct me if I am wrong) the normal uc and mc addr lists should suffice
along with the netdev op ndo_set_rx_mode(). Here the guest adds MAC
addresses and/or VLANS as normal and then the VF-PF back channel
should handle this if needed. This should work for Linux guests and other
OS's should do something similar.

In the direct assign PF case the hardware is owned by the guest so
no problems here.

This leaves the two MACVLAN cases which can be handled the same. If
the MACVLAN driver and netlink interface is extended to add filters
to the MACVLAN then the addresses can be pushed to the lower device
using the normal dev_uc_{add|del}() and dev_mc_{add|del}() routines.

I think this has some real advantages to the above scheme. First
we get rid of _all_ the drivers having to add a bunch of new
net_device ops and do it once in the layer above. This is nice
for driver implementers but also because your feature becomes usable
immediately and we don't have to wait for driver developers to implement
it.

Also it prunes down the number of netlink extensions being added
here. 

Additionally the existing semantics seem a bit strange to me on the
netlink message side. Taking a quick look at the macvlan implementation
it looks like every set has to have a complete list of address. But
the dev_uc_add and dev_uc_del seem to be using a refcnt scheme so
if I want to add a second address and then latter a third address
how does that work?

Is the expected flow from user space 'read uc_list - write uc_list'?
This seems risky because with two adders in user space you might
lose addresses unless they are somehow kept in sync. IMHO it is likely
easier to implement an ADD and DEL attribute rather than a table
approach. Took a quick stab at something like this below but there
might be a better way to do this and allow direct modification of the
uc and mc lists I think means you could remove a uc address added
by some stacked device maybe a VLAN. (just guessing.)

Sorry if I missed something in the above thread I read most of it. And
maybe I missed something or oversimplified the problem.

Thanks,
John



+/* MACVLAN ADDRLIST management section 
+ *
+ * Contains attributes to expose multicast and unicast hardware
+ * RX address filters to user space.
+ *
+ * FIELDS:
+ * - IFLA_ADDRLIST_{UC|MC}
+ *
+ *   Read only attributes, returns currently set mc or uc addr list.
+ *
+ * - IFLA_ADDRLIST_{UC|MC}_ADD
+ *
+ *   Write only attributes, adds listed addresses to dev uc or mc
+ *   RX filter address lists.
+ *
+ * - IFLA_ADDRLIST_{UC|MC}_DEL
+ *
+ *   Write only attributes, deletes listed addresses in dev uc or
+ *   mc RX filter address lists.
+ *
+ * PRECEDENCE:
+ *
+ * Add operations are parsed before delete operations. Passing a
+ * single netlink message with a single address in both the add
+ * and del lists will result in an addresses being added and then
+ * removed

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend

On 2/2/2012 12:50 AM, Michael S. Tsirkin wrote:
 On Thu, Feb 02, 2012 at 12:46:57AM -0800, John Fastabend wrote:
 On 2/1/2012 11:24 PM, Michael S. Tsirkin wrote:
 On Sun, Nov 20, 2011 at 08:30:24AM -0800, Roopa Prabhu wrote:



 On 11/17/11 4:15 PM, Ben Hutchings bhutchi...@solarflare.com wrote:

 Sorry to come to this rather late.

 On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
 [...]
 v2 - v3
 - Moved set and get filter ops from rtnl_link_ops to netdev_ops
 - Support for SRIOV VFs.
 [Note: The get filters msg (in the way current get rtnetlink 
 handles
 it) might get too big for SRIOV vfs. This patch follows existing
 sriov 
 vf get code and tries to accomodate filters for all VF's in a PF.
 And for the SRIOV case I have only tested the fact that the VF
 arguments are getting delivered to rtnetlink correctly. The code
 follows existing sriov vf handling code so rest of it should work
 fine]
 [...]

 This is already broken for large numbers of VFs, and increasing the
 amount of information per VF is going to make the situation worse.  I am
 no netlink expert but I think that the current approach of bundling all
 information about an interface in a single message may not be
 sustainable.

 Yes agreed. I have the same concern.

 So it seems that we need to extend the existing interface to allow
 tweaking filters per VF. Does it need to block this
 patchset though? After all, we'll need to support the existing

 hmm not sure I follow what patchset is this blocking?
 
 The one you are replying to.

Gotcha that would seem OK to me although I think you can avoid it altogether.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend

On 2/2/2012 10:07 AM, Roopa Prabhu wrote:
 
 
 
 On 2/2/12 12:46 AM, John Fastabend john.r.fastab...@intel.com wrote:
 
 On 2/1/2012 11:24 PM, Michael S. Tsirkin wrote:
 On Sun, Nov 20, 2011 at 08:30:24AM -0800, Roopa Prabhu wrote:



 On 11/17/11 4:15 PM, Ben Hutchings bhutchi...@solarflare.com wrote:

 Sorry to come to this rather late.

 On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
 [...]
 v2 - v3
 - Moved set and get filter ops from rtnl_link_ops to netdev_ops
 - Support for SRIOV VFs.
 [Note: The get filters msg (in the way current get rtnetlink
 handles
 it) might get too big for SRIOV vfs. This patch follows existing
 sriov 
 vf get code and tries to accomodate filters for all VF's in a PF.
 And for the SRIOV case I have only tested the fact that the VF
 arguments are getting delivered to rtnetlink correctly. The code
 follows existing sriov vf handling code so rest of it should work
 fine]
 [...]

 This is already broken for large numbers of VFs, and increasing the
 amount of information per VF is going to make the situation worse.  I am
 no netlink expert but I think that the current approach of bundling all
 information about an interface in a single message may not be
 sustainable.

 Yes agreed. I have the same concern.

 So it seems that we need to extend the existing interface to allow
 tweaking filters per VF. Does it need to block this
 patchset though? After all, we'll need to support the existing

 hmm not sure I follow what patchset is this blocking?

 interface indefinitely, too.


 OK finally got to read through this. And its not clear to me why we need
 these per VF/PF filter netdevice ops and netlink extensions if we can
 get the stacking correct. (Adding filters to the macvlan seems reasonable
 to me)

 In the cases I saw listed above I see a few enumerations:

 PF -- MACVLAN  --- Guest --- [...]

 VF -- MACVLAN  --- Guest --- [...]

 VF|Guest --- [...]   direct assigned VF

 PF|Guest --- [...]   direct assigned PF


 I used '[...]' to represent whatever additional stacking is done in the
 guest unknown to the host. In the direct assign VF case (Greg Rose
 correct me if I am wrong) the normal uc and mc addr lists should suffice
 along with the netdev op ndo_set_rx_mode(). Here the guest adds MAC
 addresses and/or VLANS as normal and then the VF-PF back channel
 should handle this if needed. This should work for Linux guests and other
 OS's should do something similar.

 In the direct assign PF case the hardware is owned by the guest so
 no problems here.

 This leaves the two MACVLAN cases which can be handled the same. If
 the MACVLAN driver and netlink interface is extended to add filters
 to the MACVLAN then the addresses can be pushed to the lower device
 using the normal dev_uc_{add|del}() and dev_mc_{add|del}() routines.
 
 My patches were trying to do just this (unless I am missing something).
 

Right I was trying enumerate the cases. Your patches 5,6 seem to use
dev_{uc|mc}_{add|del} like this.


 I think this has some real advantages to the above scheme. First
 we get rid of _all_ the drivers having to add a bunch of new
 net_device ops and do it once in the layer above. This is nice
 for driver implementers but also because your feature becomes usable
 immediately and we don't have to wait for driver developers to implement
 it.
 
 Yes my patches were targeting towards this too. I had macvlan implement the
 netlink ops and macvlan internally was using the dev_uc_add and del routines
 to pass the addr lists to lower device.

Yes. But I am missing why the VF ops and netlink extensions are useful. Or
even the op/netlink extension into the PF for that matter.

 

 Also it prunes down the number of netlink extensions being added
 here. 

 Additionally the existing semantics seem a bit strange to me on the
 netlink message side. Taking a quick look at the macvlan implementation
 it looks like every set has to have a complete list of address. But
 the dev_uc_add and dev_uc_del seem to be using a refcnt scheme so
 if I want to add a second address and then latter a third address
 how does that work?
 
 Every set has a complete list of addresses because, for macvlan non-passthru
 modes, in future we might want to have macvlan driver do the filtering (This
 is for the case when we have a single lower device and multiple macvlans)
 

hmm but lists seem problematic when hooked up to the netdev uc and mc addr
lists. Consider this case

read uc_list  --- thread1: dumps unicast table
add vlan  --- thread2: adds a vlan maybe inserting a uc addr
write uc_list --- thread1: writes the table back + 1 addr

Does the uc addr of the vlan get deleted? And this case

read uc_list   --- dump table
write uc_list  --- add a new filter A to the uc list
read uc_list   --- dump table
write uc_list  --- add a second filter B to the uc list

Now based on your patch 4,5 it looks like the refcnt

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend

On 2/2/2012 12:38 PM, Ben Hutchings wrote:
 On Thu, 2012-02-02 at 00:46 -0800, John Fastabend wrote:
 [...]
 OK finally got to read through this. And its not clear to me why we need
 these per VF/PF filter netdevice ops and netlink extensions if we can
 get the stacking correct. (Adding filters to the macvlan seems reasonable
 to me)

 In the cases I saw listed above I see a few enumerations:

 PF -- MACVLAN  --- Guest --- [...]

 VF -- MACVLAN  --- Guest --- [...]   

 VF|Guest --- [...]   direct assigned VF

 PF|Guest --- [...]   direct assigned PF


 I used '[...]' to represent whatever additional stacking is done in the
 guest unknown to the host. In the direct assign VF case (Greg Rose
 correct me if I am wrong) the normal uc and mc addr lists should suffice
 along with the netdev op ndo_set_rx_mode(). Here the guest adds MAC
 addresses and/or VLANS as normal and then the VF-PF back channel
 should handle this if needed. This should work for Linux guests and other
 OS's should do something similar.

 In the direct assign PF case the hardware is owned by the guest so
 no problems here.

 This leaves the two MACVLAN cases which can be handled the same. If
 the MACVLAN driver and netlink interface is extended to add filters
 to the MACVLAN then the addresses can be pushed to the lower device
 using the normal dev_uc_{add|del}() and dev_mc_{add|del}() routines.
 [...]
 
 There is another case: hybrid acceleration.  Without a bridge in the
 NIC, you need a software bridge for multicast/broadcast replication,
 traffic between guests, and traffic between guest and host.  A guest
 driver can then send and receive to remote addresses through a VF while
 retaining fallback to the software bridge.
 
 In order to do this, the guest driver needs to know which addresses are
 local.  The net driver for the PF can tell it about the addresses
 assigned to each function, but if there are other devices included in
 the bridge then it will not know about them.

Off the top of my head another approach would be to add a flag to the
PF maybe NETIF_F_FDB_OFFLOADED and have the bridge push the fdb updates
down to the PF which could propagate them into the VFs. Then you don't
need any new netdevice ops or netlink extensions.

This also would allow any 'learned' addresses to be pushed into the
VF and your daemon wouldn't have to monitor the bridges FDB and send
updates down.

 
 In Solarflare's current out-of-tree implementation this is dealt with in
 an extension to libvirt that writes the additional 'local' MAC addresses
 to a driver-specific sysfs file, but that is obviously not likely to be
 acceptable in-tree!  So I was interested in this proposal to extend MAC
 filtering, but wanted to get the semantics clear.

Agreed. Intel devices needs something similar to handle the bridge cases
where the MAC addresses of the virtual devices are not getting programmed
into the hardware forwarding table.

So we need something its just finding the right semantics and mechanism.

 
 Ben.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: where should I report kvm kernel bug?

2012-01-13 Thread John 'Warthog9' Hawley

I think I tracked down that problem, it should clear itself up in the
next hour or so, and the missing e-mails should start arriving.

- John 'Warthog9' Hawley
Chief Kernel.org Administrator

On 01/13/2012 12:05 AM, Ren, Yongjie wrote:
 When someone comments in a bug in kernel.org's bugzilla, the bug reporter or 
 the CC list can't receive an update email for that.
 For example, Gleb and Avi updated the following bug, but I (the reporter) 
 didn't receive an email to notify about that.
 https://bugzilla.kernel.org/show_bug.cgi?id=42563
 
 
 Best Regards,
  Yongjie Ren  (Jay)
 
 
 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Ren, Yongjie
 Sent: Friday, January 06, 2012 11:27 AM
 To: John 'Warthog9' Hawley; Avi Kivity
 Cc: Konstantin Ryabitsev; kvm@vger.kernel.org; ftpadmin
 Subject: RE: where should I report kvm kernel bug?

 -Original Message-
 From: John 'Warthog9' Hawley [mailto:warth...@kernel.org]
 Sent: Friday, January 06, 2012 10:34 AM
 To: Avi Kivity
 Cc: Konstantin Ryabitsev; Ren, Yongjie; kvm@vger.kernel.org; ftpadmin
 Subject: Re: where should I report kvm kernel bug?

 On 01/03/2012 10:08 AM, Avi Kivity wrote:
 On 01/03/2012 06:28 PM, Konstantin Ryabitsev wrote:
 On Tue, 2012-01-03 at 12:10 +0200, Avi Kivity wrote:
 I'm surprised bugzilla.kernel.org isn't responding.  Copying
 ftpad...@kernel.org for ETA (though that address isn't responding,
 either).

 Hi, Avi and Ren:

 Unfortunately, we don't currently have an ETA for bringing up
 bugzilla.kernel.org. We are still awaiting some hardware deliveries
 before we can recover that service.

 The KVM developers suggest a virtual machine in the meantime?

 I suggest you ask KVM developers
 where they would prefer to have the bug reported.

 Ren, please use the qemu bug tracker temporarily, just make sure to mark
 it as a kernel bug.

 Sorry for lack of responses -- the holidays always make it more
 difficult to respond in time.


 Thanks.


 Ok, I've quietly turned bugzilla back on don't beat it too hard or
 you'll start affecting other services but it's alive.

 Thanks very much.
 And I filed a kvm bug on kernel bugzilla.
 rhel5u5 guest panic when booting up
 https://bugzilla.kernel.org/show_bug.cgi?id=42563
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: where should I report kvm kernel bug?

2012-01-05 Thread John 'Warthog9' Hawley

On 01/03/2012 10:08 AM, Avi Kivity wrote:
 On 01/03/2012 06:28 PM, Konstantin Ryabitsev wrote:
 On Tue, 2012-01-03 at 12:10 +0200, Avi Kivity wrote:
 I'm surprised bugzilla.kernel.org isn't responding.  Copying
 ftpad...@kernel.org for ETA (though that address isn't responding, either).

 Hi, Avi and Ren:

 Unfortunately, we don't currently have an ETA for bringing up
 bugzilla.kernel.org. We are still awaiting some hardware deliveries
 before we can recover that service. 
 
 The KVM developers suggest a virtual machine in the meantime?
 
 I suggest you ask KVM developers
 where they would prefer to have the bug reported.
 
 Ren, please use the qemu bug tracker temporarily, just make sure to mark
 it as a kernel bug.
 
 Sorry for lack of responses -- the holidays always make it more
 difficult to respond in time.

 
 Thanks.
 

Ok, I've quietly turned bugzilla back on don't beat it too hard or
you'll start affecting other services but it's alive.

- John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/

2011-11-08 Thread John Kacur



On Tue, 8 Nov 2011, Ted Ts'o wrote:

 On Tue, Nov 08, 2011 at 01:55:09PM +0100, Ingo Molnar wrote:
  I guess you can do well with a split project as well - my main claim 
  is that good compatibility comes *naturally* with integration.
 
 Here I have to disagree; my main worry is that integration makes it
 *naturally* easy for people to skip the hard work needed to keep a
 stable kernel/userspace interface.
 
 The other worry which I've mentioned, but which I haven't seen
 addressed, is that the even if you can use a perf from a newer kernel
 with an older kernel, this causes distributions a huge amount of pain,
 since they have to package two different kernel source packages, and
 only compile perf from the newer kernel source package.  This leads to
 all sorts of confusion from a distribution packaging point of view.
 
 For example, assume that RHEL 5, which is using 2.6.32 or something
 like that, wants to use a newer e2fsck that does a better job fixing
 file system corruptions.  If it were bundled with the kernel, then
 they would have to package up the v3.1 kernel sources, and have a
 source RPM that isn't used for building kernel sources, but just to
 build a newer version of e2fsck.  Fortunately, they don't have to do
 that.  They just pull down a newer version of e2fsprogs, and package,
 build, test, and ship that.
 
 In addition, suppose Red Hat ships a security bug fix which means a
 new kernel-image RPM has to be shipped.  Does that mean that Red Hat
 has to ship new binary RPM's for any and all tools/* programs that
 they have packaged as separate RPM's?  Or should installing a new
 kernel RPM also imply dropping new binaries in /usr/bin/perf, et. al?
 There are all sorts of packaging questions that are raised
 integration, and from where I sit I don't think they've been
 adequately solved yet.

 
This in practice is not a big deal.

There are many approaches for how the RPM can be built, but basically
getting the perf source is just a matter of
make perf-tar-src-pkg or friends such as
make perf-tarbz2-src-pkg
which will create perf-3.2.0-rc1.tar, and perf-3.2.0-rc1.tar.bz2
respectively which can be used for the src rpms. This tar ball can be used
as a separate package or subpackage.

Thanks

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Booting up an old Windows 3.1 harddisk image

2011-10-19 Thread John Stoffel


Avi  Guys,

Thanks for your patience, I just got the sucker booting by switching
the -M option to be an isapc system.  And I was wrong about it being a
Gateway, it's a packard Bell system running Windows 3.1, talk about a
blast from the past.  I think I still have the keyboard around
somehwere.

Anyway, here's my commanto run it, with npt=0 in the kvm_amd modules.

   kvm -no-acpi -no-hpet -cpu 486 -hda hda-caroline486.image -M isapc \
-m 4 -vga std

and I'll see about changing the defaults back on.  

Thanks,
John

-- 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Booting up an old Windows 3.1 harddisk image

2011-10-18 Thread John Stoffel

 Avi == Avi Kivity a...@redhat.com writes:

Avi On 10/18/2011 02:59 AM, John Stoffel wrote:
 Hi Guys,
 
 I'm not a subscriber to the kvm mailing list, so please copy me in
 your replies.
 
 I've got an old image of an i486 disk running (I think!) Windows 3.1
 which I want to bring up and play with, just to make sure I can use
 it.  
 
  file hda-caroline486.image 
 hda-caroline486.image: x86 boot sector, Microsoft Windows XP mbr;
 partition 1: ID=0x6, active, starthead 1, startsector 17, 254983
 sectors, code offset 0x33
 
 I basically just dd'd the entire disk into a file.  I can mount it
 using a loop device and then examine the partitions, so I think it's
 fine from that viewpoint.
 
 My server is a Debian 5.0 box, running on AMD Quad Core CPU, 8gb of
 RAM.  

Avi What's the host kernel version?

I thought I was running the stock Debian version, but I'm actually
running my own 3.1.0-rc4 kernel.  A bit out of date, but I don't
remember seeing any major KVM breakage in the later part of the
3.1-rc# series.

  Linux version 3.1.0-rc4-custom (john@quad) (gcc version 4.4.5 (Debian
  4.4.5-8) ) #1 SMP Wed Aug 31 12:12:32 EDT 2011

 I've got other KVM guests running just fine.  
 
 So I tried to use the following to boot the image:
 
  sudo qemu -no-acpi -no-hpet -cpu 486 -hda hda-caroline486.image -m \
 128 -vga std -vnc quad:44

Avi No need to sudo.

 And I get the following in a VNC screen:
 
 Starting SeaBIOS (version 0.5.1-20100616_222654-volta)
 
 Booting from Hard Disk...
 
 HIMEM: DOS XMS Driver, Version 3.07 - 02/14/92
 Extended Memory Specification (XMS) Version 3.0
 Copyright 1988-1992 Microsoft Corp.
 
 Installed A20 handler number 2.
 64K High Memory Area is available.
 
 
 MICROSOFT Expanded Memory Manager 386  Version 4.44
 Copyright Microsoft Corporation 1986, 1991
 
 _
 
 
 And that's it.  So it looks like I'm missing a driver or something
 here.  Do I need to define keyboard and mouse for this sucker?  The
 hardware is long gone, but I think it was a Gateway.  Total guess.
 

Avi It's probably a bug in kvm.  Try collecting a complete trace as in
Avi http://www.linux-kvm.org/page/Tracing and posting it here, maybe
Avi something will jump out.

Avi Also try out -no-kvm.

Thanks, I'll give this a try tonight when I'm home again.

John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Booting up an old Windows 3.1 harddisk image

2011-10-18 Thread John Stoffel

 John == John Stoffel j...@stoffel.org writes:

 Avi == Avi Kivity a...@redhat.com writes:
Avi On 10/18/2011 02:59 AM, John Stoffel wrote:
 Hi Guys,
 
 I'm not a subscriber to the kvm mailing list, so please copy me in
 your replies.
 
 I've got an old image of an i486 disk running (I think!) Windows 3.1
 which I want to bring up and play with, just to make sure I can use
 it.  
 
  file hda-caroline486.image 
 hda-caroline486.image: x86 boot sector, Microsoft Windows XP mbr;
 partition 1: ID=0x6, active, starthead 1, startsector 17, 254983
 sectors, code offset 0x33
 
 I basically just dd'd the entire disk into a file.  I can mount it
 using a loop device and then examine the partitions, so I think it's
 fine from that viewpoint.
 
 My server is a Debian 5.0 box, running on AMD Quad Core CPU, 8gb of
 RAM.  

Avi What's the host kernel version?

John I thought I was running the stock Debian version, but I'm actually
John running my own 3.1.0-rc4 kernel.  A bit out of date, but I don't
John remember seeing any major KVM breakage in the later part of the
John 3.1-rc# series.

John   Linux version 3.1.0-rc4-custom (john@quad) (gcc version 4.4.5 (Debian
John   4.4.5-8) ) #1 SMP Wed Aug 31 12:12:32 EDT 2011

 I've got other KVM guests running just fine.  
 
 So I tried to use the following to boot the image:
 
  sudo qemu -no-acpi -no-hpet -cpu 486 -hda hda-caroline486.image -m \
 128 -vga std -vnc quad:44

Avi No need to sudo.

Ok, didn't do it this time.  Still no luck from what I see.

 And I get the following in a VNC screen:
 
 Starting SeaBIOS (version 0.5.1-20100616_222654-volta)
 
 Booting from Hard Disk...
 
 HIMEM: DOS XMS Driver, Version 3.07 - 02/14/92
 Extended Memory Specification (XMS) Version 3.0
 Copyright 1988-1992 Microsoft Corp.
 
 Installed A20 handler number 2.
 64K High Memory Area is available.
 
 
 MICROSOFT Expanded Memory Manager 386  Version 4.44
 Copyright Microsoft Corporation 1986, 1991
 
 _
 
 
 And that's it.  So it looks like I'm missing a driver or something
 here.  Do I need to define keyboard and mouse for this sucker?  The
 hardware is long gone, but I think it was a Gateway.  Total guess.
 

Avi It's probably a bug in kvm.  Try collecting a complete trace as
Avi in http://www.linux-kvm.org/page/Tracing and posting it here,
Avi maybe something will jump out.

I tried doing that, but it's huge, mostly because I've got three other
KVM sessions running.  Let me poke at the trace-command to see how I
can limit it to just a single process to make it smaller.

Avi Also try out -no-kvm.

This doesn't exist in my version of 'qemu' or should I really be doing
something else?  I'm running:

qemu --version
   QEMU PC emulator version 0.12.5 (Debian 0.12.5+dfsg-3squeeze1),
   Copyright (c) 2003-2008 Fabrice Bellard

Oh wait, I see, I should be doing:

 kvm -no-acpi -no-hpet -cpu 486 -hda hda-caroline486.image -m 128 -vga std 
-vnc quad:4
open /dev/kvm: Permission denied
Could not initialize KVM, will disable KVM support

instead.  And it looks like I don't have KVM support anyway when I run
as myself.

John Thanks, I'll give this a try tonight when I'm home again.

Ok, so I ssh'd into home from work and now I can see it starting up,
but it's horribly slow.  Heh.  

So I then tried bumping down the memory to just 4 megs (-m 4) to see
what would happen.  No change.  

Thanks for all your help guys.
John


-- 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Booting up an old Windows 3.1 harddisk image

2011-10-17 Thread John Stoffel


Hi Guys,

I'm not a subscriber to the kvm mailing list, so please copy me in
your replies.

I've got an old image of an i486 disk running (I think!) Windows 3.1
which I want to bring up and play with, just to make sure I can use
it.  

 file hda-caroline486.image 
hda-caroline486.image: x86 boot sector, Microsoft Windows XP mbr;
partition 1: ID=0x6, active, starthead 1, startsector 17, 254983
sectors, code offset 0x33

I basically just dd'd the entire disk into a file.  I can mount it
using a loop device and then examine the partitions, so I think it's
fine from that viewpoint.

My server is a Debian 5.0 box, running on AMD Quad Core CPU, 8gb of
RAM.  I've got other KVM guests running just fine.  

So I tried to use the following to boot the image:

   sudo qemu -no-acpi -no-hpet -cpu 486 -hda hda-caroline486.image -m \
  128 -vga std -vnc quad:44

And I get the following in a VNC screen:

Starting SeaBIOS (version 0.5.1-20100616_222654-volta)

Booting from Hard Disk...

HIMEM: DOS XMS Driver, Version 3.07 - 02/14/92
Extended Memory Specification (XMS) Version 3.0
Copyright 1988-1992 Microsoft Corp.

Installed A20 handler number 2.
64K High Memory Area is available.


MICROSOFT Expanded Memory Manager 386  Version 4.44
Copyright Microsoft Corporation 1986, 1991

_


And that's it.  So it looks like I'm missing a driver or something
here.  Do I need to define keyboard and mouse for this sucker?  The
hardware is long gone, but I think it was a Gateway.  Total guess.

Thanks for any hints,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM cpu limitations

2011-08-09 Thread John Paul Walters


On Aug 9, 2011, at 5:13 AM, Avi Kivity wrote:

 On 08/08/2011 10:18 PM, John Paul Walters wrote:
 On Jul 21, 2011, at 2:10 AM, Avi Kivity wrote:
 
   On 07/21/2011 02:20 AM, John Paul Walters wrote:
   Hi,
 
   We have a 256 core SGI Ultraviolet machine running RHEL 6.1 with 
  qemu-kvm 0.13, and we'd like to be able to start large guest VMs of up to 
  256 cores.  I see that x86 guests are currently limited to 64 VCPUs.  Is 
  there any reason for this hard limitation?  It appears that we can't get 
  around this limitation by simply redefining the kernel's KVM_MAX_VCPUS to 
  256.  Qemu-kvm and possibly SeaBIOS seem to require changes as well.  Can 
  anyone offer any suggestions as to how straightforward it would be to 
  increase the number of CPUs that we can allocate to KVM guests?
 
 
   And here I am on record saying no one wants this...
 
   kvm.git has patches increasing the limit to 254 (256 is not possible due 
  to the APIC ID being 8  bits and two IDs being reserved).
 
   Latest seabios appears to have no cpu limits; qemu is limited to 255.
 
 
 
 Hi again,
 
 I've applied the 254 core patches (below) from kvm.git on a RHEL 6.1 kernel. 
  The new modules build and insert fine.
 
 https://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=052fa7f4c5e79262cffcdc90bdd94172e00d45e3
 https://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=29a07f8e31980599c586ea7d1f84957bc7fe98ed
 
 However, whenever I try to boot a system with more than 83 CPUs, the system 
 fails to boot with:
 
 Booting from Hard Disk...
 Boot failed: could not read the boot disk
 
 I'm using qemu-kvm.git with the following command line:
 /opt/qemu.git/bin/qemu-system-x86_64 -smp 84 -hda big_image_2.qcow2 -m 8388 
 -redir tcp:52109::22
 
 Does anyone have any suggestions?
 
 
 
 Most likely a seabios failure.  Suggest you enable debugging in seabios and 
 see what's going on; also copy the seabios mailing list.
 
Hi Avi,

I've enabled debugging in seabios (#define DEBUG_BIOS) and get the output 
below.  Note that with the help of folks in the KVM irc channel I'm able to 
start a 254 core instance using the KVM tool, so the problem seems to be 
limited to qemu/seabios.

best,
JP

jwalters@uv /tmp/qemu_test_jp $ /opt/qemu.git/bin/qemu-system-x86_64 -smp 84 
-drive file=big_image_2.qcow2,if=virtio -m 8388
warning: subregion collision fffe/2 vs 0/12c40
VNC server running on `::1:5901'
Start bios (version pre-0.6.3-20110315_112143-titi)
Ram Size=0xe000 (0x00012c40 high)
Relocating init from 0x000e49d0 to 0xdffe1880 (size 58968)
CPU Mhz=2002
PCI: pci_bios_init_bus_rec bus = 0x0
PIIX3/PIIX4 init: elcr=00 0c
PCI: bus=0 devfn=0x00: vendor_id=0x8086 device_id=0x1237
PCI: bus=0 devfn=0x08: vendor_id=0x8086 device_id=0x7000
PCI: bus=0 devfn=0x09: vendor_id=0x8086 device_id=0x7010
region 4: 0xc000
PCI: bus=0 devfn=0x0b: vendor_id=0x8086 device_id=0x7113
PCI: bus=0 devfn=0x10: vendor_id=0x1013 device_id=0x00b8
region 0: 0xf000
region 1: 0xf200
region 6: 0xf201
PCI: bus=0 devfn=0x18: vendor_id=0x10ec device_id=0x8139
region 0: 0xc100
region 1: 0xf202
region 6: 0xf203
PCI: bus=0 devfn=0x20: vendor_id=0x1af4 device_id=0x1001
region 0: 0xc200
region 1: 0xf204
Found 84 cpu(s) max supported 84 cpu(s)
MP table addr=0x000fd4b0 MPC table addr=0x000fd4c0 size=1892
SMBIOS ptr=0x000fd490 table=0xd030
ACPI tables: RSDP=0x000fd460 RSDT=0xdfffa5c0
Scan for VGA option rom
Running option rom at c000:0003
VGABios $Id$
Turning on vga text mode console
SeaBIOS (version pre-0.6.3-20110315_112143-titi)

Found 1 lpt ports
Found 1 serial ports
ATA controller 0 at 1f0/3f4/0 (irq 14 dev 9)
ATA controller 1 at 170/374/0 (irq 15 dev 9)
found virtio-blk at 0:4
ebda moved from 9fc00 to 9dc00
WARNING - Unable to allocate resource at init_virtio_blk:107!
WARNING - Unable to allocate resource at init_atadrive:740!
PS2 keyboard initialized
All threads complete.
Scan for option roms
Running option rom at c900:0003
pmm call arg1=1
pmm call arg1=0
pmm call arg1=1
pmm call arg1=0
Searching bootorder for: /pci@i0cf8/*@3
Searching bootorder for: /rom@genroms/vapic.bin
Running option rom at ca00:0003
Returned 40960 bytes of ZoneHigh
e820 map has 8 items:
  0:  - 0009dc00 = 1
  1: 0009dc00 - 000a = 2
  2: 000f - 0010 = 2
  3: 0010 - dfffa000 = 1
  4: dfffa000 - e000 = 2
  5: feffc000 - ff00 = 2
  6: fffc - 0001 = 2
  7: 0001 - 00022c40 = 1
enter handle_19:
  NULL
Booting from ROM...
Booting from c900:0372
enter handle_18:
  NULL
Booting from Hard Disk...
Boot failed: could not read the boot disk

enter handle_18:
  NULL
Booting from Floppy...
Boot failed: could not read the boot disk

enter handle_18:
  NULL
No bootable device.




 -- 
 error compiling committee.c: too many arguments to function
 

--
To unsubscribe from this list: send the line

Re: KVM cpu limitations

2011-08-08 Thread John Paul Walters

On Jul 21, 2011, at 2:10 AM, Avi Kivity wrote:

On 07/21/2011 02:20 AM, John Paul Walters wrote:
Hi,

We have a 256 core SGI Ultraviolet machine running RHEL 6.1 with qemu-kvm
0.13, and we'd like to be able to start large guest VMs of up to 256 cores.
I see that x86 guests are currently limited to 64 VCPUs. Is there any
reason for this hard limitation? It appears that we can't get around this
limitation by simply redefining the kernel's KVM_MAX_VCPUS to 256. Qemu-kvm
and possibly SeaBIOS seem to require changes as well. Can anyone offer any
suggestions as to how straightforward it would be to increase the number of
CPUs that we can allocate to KVM guests?

And here I am on record saying no one wants this...

kvm.git has patches increasing the limit to 254 (256 is not possible due to
the APIC ID being 8 bits and two IDs being reserved).

Latest seabios appears to have no cpu limits; qemu is limited to 255.

Hi again,

I've applied the 254 core patches (below) from kvm.git on a RHEL 6.1 kernel.
The new modules build and insert fine.

https://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=052fa7f4c5e79262cffcdc90bdd94172e00d45e3
https://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=29a07f8e31980599c586ea7d1f84957bc7fe98ed

However, whenever I try to boot a system with more than 83 CPUs, the system
fails to boot with:

Booting from Hard Disk...
Boot failed: could not read the boot disk

I'm using qemu-kvm.git with the following command line:
/opt/qemu.git/bin/qemu-system-x86_64 -smp 84 -hda big_image_2.qcow2 -m 8388
-redir tcp:52109::22

Does anyone have any suggestions?

thanks,
JP

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH] V2 Handle guest access to BBL_CR_CTL3 MSR

2011-01-20 Thread john cooper

[Resubmit of prior version which contained a wayward
patch hunk.  Thanks Marcelo]

A correction to Intel cpu model CPUID data (patch queued)
caused winxp to BSOD when booted with a Penryn model.
This was traced to the CPUID model field correction from
6 - 23 (as is proper for a Penryn class of cpu).  Only in
this case does the problem surface.

The cause for this failure is winxp accessing the BBL_CR_CTL3
MSR which is unsupported by current kvm, appears to be a
legacy MSR not fully characterized yet existing in current
silicon, and is apparently carried forward in MSR space to
accommodate vintage code as here.  It is not yet conclusive
whether this MSR implements any of its legacy functionality
or is just an ornamental dud for compatibility.  While I
found no silicon version specific documentation link to
this MSR, a general description exists in Intel's developer's
reference which agrees with the functional behavior of
other bootloader/kernel code I've examined accessing
BBL_CR_CTL3.  Regrettably winxp appears to be setting bit #19
called out as reserved in the above document.

So to minimally accommodate this MSR, kvm msr get will provide
the equivalent mock data and kvm msr write will simply toss the
guest passed data without interpretation.  While this treatment
of BBL_CR_CTL3 addresses the immediate problem, the approach may
be modified pending clarification from Intel.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 4d0dfa0..5bfafb6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -38,6 +38,7 @@
 
 #define MSR_MTRRcap0x00fe
 #define MSR_IA32_BBL_CR_CTL0x0119
+#define MSR_IA32_BBL_CR_CTL3   0x011e
 
 #define MSR_IA32_SYSENTER_CS   0x0174
 #define MSR_IA32_SYSENTER_ESP  0x0175
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bcc0efc..04d6c55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1592,6 +1592,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
} else
return set_msr_hyperv(vcpu, msr, data);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* Drop writes to this legacy MSR -- see rdmsr
+* counterpart for further detail.
+*/
+   pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
+   break;
default:
if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
return xen_hvm_config(vcpu, data);
@@ -1846,6 +1852,19 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
} else
return get_msr_hyperv(vcpu, msr, pdata);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* This legacy MSR exists but isn't fully documented in current
+* silicon.  It is however accessed by winxp in very narrow
+* scenarios where it sets bit #19, itself documented as
+* a reserved bit.  Best effort attempt to source coherent
+* read data here should the balance of the register be
+* interpreted by the guest:
+*
+* L2 cache control register 3: 64GB range, 256KB size,
+* enabled, latency 0x1, configured
+*/ 
+   data = 0xbe702111;
+   break;
default:
if (!ignore_msrs) {
pr_unimpl(vcpu, unhandled rdmsr: 0x%x\n, msr);


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Handle guest access to BBL_CR_CTL3 MSR

2011-01-18 Thread john cooper

Marcelo Tosatti wrote:
 On Sat, Jan 08, 2011 at 12:05:14AM -0500, john cooper wrote:
   
 A correction to Intel cpu model CPUID data (patch queued)
 caused winxp-64 to BSOD when booted with a Penryn model.
 This was traced to the CPUID model field correction from
 6 - 23 (as is proper for a Penryn class of cpu).  Only in
 this case does the problem surface.

 The cause for this failure is winxp accessing the BBL_CR_CTL3
 MSR which is unsupported by current kvm, appears to be a
 legacy MSR not fully characterized yet existing in current
 silicon, and is apparently carried forward in MSR space to
 accommodate vintage code as here.  It is not yet conclusive
 whether this MSR implements any of its legacy functionality
 or is just an ornamental dud for compatibility.  While I
 found no silicon version specific documentation link to
 this MSR, a general description exists in Intel's developer's
 reference which agrees with the functional behavior of
 other bootloader/kernel code I've examined accessing
 BBL_CR_CTL3.  Regrettably winxp-64 appears to be setting bit #19
 called out as reserved in the above document.

 So to minimally accommodate this MSR, kvm msr get will provide
 the equivalent mock data and kvm msr write will simply toss the
 guest passed data without interpretation.  While this treatment
 of BBL_CR_CTL3 addresses the immediate problem, the approach may
 be modified pending clarification from Intel.

 Signed-off-by: john cooper john.coo...@redhat.com
 ---

 diff --git a/arch/x86/include/asm/msr-index.h 
 b/arch/x86/include/asm/msr-index.h
 index 6b89f5e..145cd60 100644
 --- a/arch/x86/include/asm/msr-index.h
 +++ b/arch/x86/include/asm/msr-index.h
 @@ -38,6 +38,7 @@
  
  #define MSR_MTRRcap 0x00fe
  #define MSR_IA32_BBL_CR_CTL 0x0119
 +#define MSR_IA32_BBL_CR_CTL30x011e
  
  #define MSR_IA32_SYSENTER_CS0x0174
  #define MSR_IA32_SYSENTER_ESP   0x0175
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fa708c9..9a8331c 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1283,6 +1283,12 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 
 msr, u64 data)
  return -1;
  vcpu-arch.mcg_ctl = data;
  break;
 +case MSR_IA32_BBL_CR_CTL3:
 +/* Drop writes to this legacy MSR -- see rdmsr
 + * counterpart for further detail.
 + */
 +pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
 +break;
  default:
  if (msr = MSR_IA32_MC0_CTL 
  msr  MSR_IA32_MC0_CTL + 4 * bank_num) {
 @@ -1592,6 +1598,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 
 msr, u64 data)
  } else
  return set_msr_hyperv(vcpu, msr, data);
  break;
 +case MSR_IA32_BBL_CR_CTL3:
 +/* This legacy MSR exists but isn't fully documented in current
 + * silicon.  It is however accessed by winxp in very narrow
 + * scenarios where it sets bit #19, itself documented as
 + * a reserved bit.  Best effort attempt to source coherent
 + * read data here should the balance of the register be
 + * interpreted by the guest:
 + *
 + * L2 cache control register 3: 64GB range, 256KB size,
 + * enabled, latency 0x1, configured
 + */ 
 +data = 0xbe702111;
 +break;
  default:
  if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
  return xen_hvm_config(vcpu, data);
 

 This is the MSR write path ?
   
The write path ignores the data. The MSR read returns a
mock-up of BBL_CR_CTL3 data. The above addresses the
narrow usage leading to the immediate bug. From the
feedback thus far from Intel I believe the above is sufficient
and minimal.

-john



-- 
john.coo...@third-harmonic.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Handle guest access to BBL_CR_CTL3 MSR

2011-01-07 Thread john cooper

A correction to Intel cpu model CPUID data (patch queued)
caused winxp-64 to BSOD when booted with a Penryn model.
This was traced to the CPUID model field correction from
6 - 23 (as is proper for a Penryn class of cpu).  Only in
this case does the problem surface.

The cause for this failure is winxp accessing the BBL_CR_CTL3
MSR which is unsupported by current kvm, appears to be a
legacy MSR not fully characterized yet existing in current
silicon, and is apparently carried forward in MSR space to
accommodate vintage code as here.  It is not yet conclusive
whether this MSR implements any of its legacy functionality
or is just an ornamental dud for compatibility.  While I
found no silicon version specific documentation link to
this MSR, a general description exists in Intel's developer's
reference which agrees with the functional behavior of
other bootloader/kernel code I've examined accessing
BBL_CR_CTL3.  Regrettably winxp-64 appears to be setting bit #19
called out as reserved in the above document.

So to minimally accommodate this MSR, kvm msr get will provide
the equivalent mock data and kvm msr write will simply toss the
guest passed data without interpretation.  While this treatment
of BBL_CR_CTL3 addresses the immediate problem, the approach may
be modified pending clarification from Intel.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 6b89f5e..145cd60 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -38,6 +38,7 @@
 
 #define MSR_MTRRcap0x00fe
 #define MSR_IA32_BBL_CR_CTL0x0119
+#define MSR_IA32_BBL_CR_CTL3   0x011e
 
 #define MSR_IA32_SYSENTER_CS   0x0174
 #define MSR_IA32_SYSENTER_ESP  0x0175
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fa708c9..9a8331c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1283,6 +1283,12 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return -1;
vcpu-arch.mcg_ctl = data;
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* Drop writes to this legacy MSR -- see rdmsr
+* counterpart for further detail.
+*/
+   pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
+   break;
default:
if (msr = MSR_IA32_MC0_CTL 
msr  MSR_IA32_MC0_CTL + 4 * bank_num) {
@@ -1592,6 +1598,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
} else
return set_msr_hyperv(vcpu, msr, data);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* This legacy MSR exists but isn't fully documented in current
+* silicon.  It is however accessed by winxp in very narrow
+* scenarios where it sets bit #19, itself documented as
+* a reserved bit.  Best effort attempt to source coherent
+* read data here should the balance of the register be
+* interpreted by the guest:
+*
+* L2 cache control register 3: 64GB range, 256KB size,
+* enabled, latency 0x1, configured
+*/ 
+   data = 0xbe702111;
+   break;
default:
if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
return xen_hvm_config(vcpu, data);


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API

2010-08-20 Thread john stultz

On Thu, 2010-08-19 at 22:07 -1000, Zachary Amsden wrote:
 Add a kernel call to get the number of nanoseconds since boot.  This
 is generally useful enough to make it a generic call.

Few comments here.

 Signed-off-by: Zachary Amsden zams...@redhat.com
 ---
  include/linux/time.h  |1 +
  kernel/time/timekeeping.c |   27 +++
  2 files changed, 28 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/time.h b/include/linux/time.h
 index ea3559f..5d04108 100644
 --- a/include/linux/time.h
 +++ b/include/linux/time.h
 @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
  extern void getrawmonotonic(struct timespec *ts);
  extern void getboottime(struct timespec *ts);
  extern void monotonic_to_bootbased(struct timespec *ts);
 +extern s64 getnsboottime(void);

So instead of converting the timespec from getboottime, why did you add
a new interface? Also if not a timespec, why did you pick a s64 instead
of a ktime_t?


  extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
  extern int timekeeping_valid_for_hres(void);
 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
 index caf8d4d..d250f0a 100644
 --- a/kernel/time/timekeeping.c
 +++ b/kernel/time/timekeeping.c
 @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
  }
  EXPORT_SYMBOL_GPL(ktime_get_ts);
 
 +
 +/**
 + * getnsboottime - get the bootbased clock in nsec format
 + *
 + * The function calculates the bootbased clock from the realtime
 + * clock and the wall_to_monotonic offset and stores the result
 + * in normalized timespec format in the variable pointed to by @ts.
 + */
 +s64 getnsboottime(void)
 +{
 + unsigned int seq;
 + s64 secs, nsecs;
 +
 + WARN_ON(timekeeping_suspended);
 +
 + do {
 + seq = read_seqbegin(xtime_lock);
 + secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
 + secs += total_sleep_time.tv_sec;
 + nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
 + nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
 +
 + } while (read_seqretry(xtime_lock, seq));
 + return nsecs + (secs * NSEC_PER_SEC);
 +}
 +EXPORT_SYMBOL_GPL(getnsboottime);

You forgot to include the boottime.tv_sec/nsec offset in this. Take a
look again at getboottime()

thanks
-john


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API

2010-08-20 Thread john stultz

On Fri, 2010-08-20 at 13:37 -1000, Zachary Amsden wrote:
 On 08/20/2010 08:39 AM, john stultz wrote:
  On Thu, 2010-08-19 at 22:07 -1000, Zachary Amsden wrote:
 
  Add a kernel call to get the number of nanoseconds since boot.  This
  is generally useful enough to make it a generic call.
   
  Few comments here.
 
 
  Signed-off-by: Zachary Amsdenzams...@redhat.com
  ---
include/linux/time.h  |1 +
kernel/time/timekeeping.c |   27 +++
2 files changed, 28 insertions(+), 0 deletions(-)
 
  diff --git a/include/linux/time.h b/include/linux/time.h
  index ea3559f..5d04108 100644
  --- a/include/linux/time.h
  +++ b/include/linux/time.h
  @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
extern void getrawmonotonic(struct timespec *ts);
extern void getboottime(struct timespec *ts);
extern void monotonic_to_bootbased(struct timespec *ts);
  +extern s64 getnsboottime(void);
   
  So instead of converting the timespec from getboottime, why did you add
  a new interface? Also if not a timespec, why did you pick a s64 instead
  of a ktime_t?
 
 
 The new interface was suggested several times, so I'm proposing it.  I'm 
 indifferent to putting it the kernel API or making it internal to KVM.  
 KVM doesn't want to deal with conversions to / from ktime_t; this code 
 uses a lot (too much) math, and it's easy to get wrong when splitting 
 sec / nsec fields.  So s64 seems a natural type for ns values.  I 
 realize it's not entirely consistent with the kernel API, but s64 
 representation for ns seems to be creeping in.

I can understand wanting that, way back I was pushing for s64 ns
representations for most time values, but the ktime_t was considered a
reasonable compromise to avoid costly 64bit divides to split (sec,nsec)
on 32bit arches.

Maybe call it getboottime_ns() just to distinguish it from
getnstimeofday() which returns a timespec?


extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
extern int timekeeping_valid_for_hres(void);
  diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
  index caf8d4d..d250f0a 100644
  --- a/kernel/time/timekeeping.c
  +++ b/kernel/time/timekeeping.c
  @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
}
EXPORT_SYMBOL_GPL(ktime_get_ts);
 
  +
  +/**
  + * getnsboottime - get the bootbased clock in nsec format
  + *
  + * The function calculates the bootbased clock from the realtime
  + * clock and the wall_to_monotonic offset and stores the result
  + * in normalized timespec format in the variable pointed to by @ts.
  + */
  +s64 getnsboottime(void)
  +{
  +  unsigned int seq;
  +  s64 secs, nsecs;
  +
  +  WARN_ON(timekeeping_suspended);
  +
  +  do {
  +  seq = read_seqbegin(xtime_lock);
  +  secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
  +  secs += total_sleep_time.tv_sec;
  +  nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
  +  nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
  +
  +  } while (read_seqretry(xtime_lock, seq));
  +  return nsecs + (secs * NSEC_PER_SEC);
  +}
  +EXPORT_SYMBOL_GPL(getnsboottime);
   
  You forgot to include the boottime.tv_sec/nsec offset in this. Take a
  look again at getboottime()
 
 
 I don't think so... boottime is internal to getboottime, and it's just 
 wall_to_monotonic + total_sleep_time -- right?

Right, sorry, some architectures refine boot time even further,
providing an offset from when the machine was actually powered on to
when the timekeeping code was initialized. But that's already adjusted
into wall_to_monotonic at startup. I thought we kept it separately.


 Perhaps I've named the function badly.  What I want is the monotonic 
 clock, adjusted for sleep time - i.e. a clock that counts elapsed real 
 time without accounting for wall clock changes due to time zone, which 
 never goes backwards.

That looks fine then.  Its a little confusing since getboottime()
returns a timespec with the absolute time that the system booted. Where
as your interface is providing the time since boot. 

Maybe gettimefromboot_ns() would be clearer?

thanks
-john






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API

2010-08-20 Thread john stultz

On Fri, 2010-08-20 at 14:52 -1000, Zachary Amsden wrote:
 I think gettimefromboot_ns() is a good descriptive name, but slightly 
 too long - it would ruin my indentation.  Perhaps getrealtime_ns()?

Sigh... So getrealtime_ns would probably be confused with
CLOCK_REALTIME, which is wall time.  :P

At this point it feels too nitpicky to suggest anything else, so go
ahead and use boottime_ns and we'll refine things if anyone actually
trips up on it.

thanks
-john


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net unreleased?

2010-08-19 Thread John Bellessa

On Tue, Aug 17, 2010 at 3:13 AM, Avi Kivity a...@redhat.com wrote:
  On 08/17/2010 09:58 AM, Christian Theune wrote:

 Hi,

 I've been plugging through code and presentations trying to find out
 whether the KVM/qemu side of vhost-net has been released yet. The git
 archive seems to include the vhost-net code since about a year already, but
 I could not find any trace of it in 0.12.5.

 Can you confirm, that vhost-net on the kvm/qemu side is not released yet?
 And if it is so, does anyone have a gut feeling of when it will be?

 vhost-net will be supported by qemu-kvm 0.13 which is on track for release
 soon.

 --
 error compiling committee.c: too many arguments to function

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


I'm a little confused.  The vhost-net page
(http://www.linux-kvm.org/page/VhostNet) indicates that it is fully
functional.  Could you clarify that a bit?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bad O_DIRECT read and write performance with small block sizes with virtio

2010-08-03 Thread John Leach

On Mon, 2010-08-02 at 21:50 +0100, Stefan Hajnoczi wrote:
 On Mon, Aug 2, 2010 at 6:46 PM, Anthony Liguori anth...@codemonkey.ws wrote:
  On 08/02/2010 12:15 PM, John Leach wrote:
 
  Hi,
 
  I've come across a problem with read and write disk IO performance when
  using O_DIRECT from within a kvm guest.  With O_DIRECT, reads and writes
  are much slower with smaller block sizes.  Depending on the block size
  used, I've seen 10 times slower.
 
  For example, with an 8k block size, reading directly from /dev/vdb
  without O_DIRECT I see 750 MB/s, but with O_DIRECT I see 79 MB/s.
 
  As a comparison, reading in O_DIRECT mode in 8k blocks directly from the
  backend device on the host gives 2.3 GB/s.  Reading in O_DIRECT mode
  from a xen guest on the same hardware manages 263 MB/s.
 
 
  Stefan has a few fixes for this behavior that help a lot.  One of them
  (avoiding memset) is already upstream but not in 0.12.x.

Anthony, that patch is already applied in the RHEL6 package I'm been
testing with - I've just manually confirmed that.  Thanks though.

 
  The other two are not done yet but should be on the ML in the next couple
  weeks.  They involve using ioeventfd for notification and unlocking the
  block queue lock while doing a kick notification.
 
 Thanks for mentioning those patches.  The ioeventfd patch will be sent
 this week, I'm checking that migration works correctly and then need
 to check that vhost-net still works.

I'll give them a test as soon as I can get hold of them, thanks Stefan!

  Writing is affected in the same way, and exhibits the same behaviour
  with O_SYNC too.
 
  Watching with vmstat on the host, I see the same number of blocks being
  read, but about 14 times the number of context switches in O_DIRECT mode
  (4500 cs vs. 63000 cs) and a little more cpu usage.
 
  The device I'm writing to is a device-mapper zero device that generates
  zeros on read and throws away writes, you can set it up
  at /dev/mapper/zero like this:
 
  echo 0 21474836480 zero | dmsetup create zero
 
  My libvirt config for the disk is:
 
  disk type='block' device='disk'
driver cache='none'/
source dev='/dev/mapper/zero'/
target dev='vdb' bus='virtio'/
address type='pci' domain='0x' bus='0x00' slot='0x06'
  function='0x0'/
  /disk
 
  which translates to the kvm arg:
 
  -device
  virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
  -drive file=/dev/mapper/zero,if=none,id=drive-virtio-disk1,cache=none
 
  I'm testing with dd:
 
  dd if=/dev/vdb of=/dev/null bs=8k iflag=direct
 
  As a side note, as you increase the block size read performance in
  O_DIRECT mode starts to overtake non O_DIRECT mode reads (from about
  150k block size). By 550k block size I'm seeing 1 GB/s reads with
  O_DIRECT and 770 MB/s without.
 
 Can you take QEMU out of the picture and run the same test on the host:
 
 dd if=/dev/vdb of=/dev/null bs=8k iflag=direct
 vs
 dd if=/dev/vdb of=/dev/null bs=8k
 
 This isn't quite the same because QEMU will use a helper thread doing
 preadv.  I'm not sure what syscall dd will use.
 
 It should be close enough to determine whether QEMU and device
 emulation are involved at all though, or whether these differences are
 due to the host kernel code path down to the device mapper zero device
 being different for normal vs O_DIRECT.


dd if=/dev/mapper/zero of=/dev/null bs=8k count=100 iflag=direct
819200 bytes (8.2 GB) copied, 3.46529 s, 2.4 GB/s

dd if=/dev/mapper/zero of=/dev/null bs=8k count=100
819200 bytes (8.2 GB) copied, 5.5741 s, 1.5 GB/s

dd is just using read.

Thanks,

John.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bad O_DIRECT read and write performance with small block sizes with virtio

2010-08-03 Thread John Leach

On Tue, 2010-08-03 at 09:35 +0300, Dor Laor wrote:
 On 08/02/2010 11:50 PM, Stefan Hajnoczi wrote:
  On Mon, Aug 2, 2010 at 6:46 PM, Anthony Liguorianth...@codemonkey.ws  
  wrote:
  On 08/02/2010 12:15 PM, John Leach wrote:
 
  Hi,
 
  I've come across a problem with read and write disk IO performance when
  using O_DIRECT from within a kvm guest.  With O_DIRECT, reads and writes
  are much slower with smaller block sizes.  Depending on the block size
  used, I've seen 10 times slower.
 
  For example, with an 8k block size, reading directly from /dev/vdb
  without O_DIRECT I see 750 MB/s, but with O_DIRECT I see 79 MB/s.
 
  As a comparison, reading in O_DIRECT mode in 8k blocks directly from the
  backend device on the host gives 2.3 GB/s.  Reading in O_DIRECT mode
  from a xen guest on the same hardware manages 263 MB/s.
 
 
  Stefan has a few fixes for this behavior that help a lot.  One of them
  (avoiding memset) is already upstream but not in 0.12.x.
 
  The other two are not done yet but should be on the ML in the next couple
  weeks.  They involve using ioeventfd for notification and unlocking the
  block queue lock while doing a kick notification.
 
  Thanks for mentioning those patches.  The ioeventfd patch will be sent
  this week, I'm checking that migration works correctly and then need
  to check that vhost-net still works.
 
  Writing is affected in the same way, and exhibits the same behaviour
  with O_SYNC too.
 
  Watching with vmstat on the host, I see the same number of blocks being
  read, but about 14 times the number of context switches in O_DIRECT mode
  (4500 cs vs. 63000 cs) and a little more cpu usage.
 
  The device I'm writing to is a device-mapper zero device that generates
  zeros on read and throws away writes, you can set it up
  at /dev/mapper/zero like this:
 
  echo 0 21474836480 zero | dmsetup create zero
 
  My libvirt config for the disk is:
 
  disk type='block' device='disk'
 driver cache='none'/
 source dev='/dev/mapper/zero'/
 target dev='vdb' bus='virtio'/
 address type='pci' domain='0x' bus='0x00' slot='0x06'
  function='0x0'/
  /disk
 
  which translates to the kvm arg:
 
  -device
  virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
  -drive file=/dev/mapper/zero,if=none,id=drive-virtio-disk1,cache=none
 
 aio=native and change the io scheduler on the host to deadline should 
 help as well.

No improvement in this case (I was already using deadline on the host,
and just tested with aio=native). Tried with a real disk backend too,
still no improvement.

I'll try with and without once I get Stefan's other patches too though.

Thanks,

John.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bad O_DIRECT read and write performance with small block sizes with virtio

2010-08-03 Thread John Leach

On Tue, 2010-08-03 at 17:44 +0300, Avi Kivity wrote:
 On 08/03/2010 05:40 PM, John Leach wrote:
 
  dd if=/dev/mapper/zero of=/dev/null bs=8k count=100 iflag=direct
  819200 bytes (8.2 GB) copied, 3.46529 s, 2.4 GB/s
 
  dd if=/dev/mapper/zero of=/dev/null bs=8k count=100
  819200 bytes (8.2 GB) copied, 5.5741 s, 1.5 GB/s
 
  dd is just using read.
 
 
 What's /dev/mapper/zero?  A real volume or a zero target?
 

zero target:

echo 0 21474836480 zero | dmsetup create zero

The same performance penalty occurs when using real disks though, I just
moved to a zero target to rule out the variables of spinning metal and
raid controller caches.

John.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

bad O_DIRECT read and write performance with small block sizes with virtio

2010-08-02 Thread John Leach

Hi,

I've come across a problem with read and write disk IO performance when
using O_DIRECT from within a kvm guest.  With O_DIRECT, reads and writes
are much slower with smaller block sizes.  Depending on the block size
used, I've seen 10 times slower.

For example, with an 8k block size, reading directly from /dev/vdb
without O_DIRECT I see 750 MB/s, but with O_DIRECT I see 79 MB/s.

As a comparison, reading in O_DIRECT mode in 8k blocks directly from the
backend device on the host gives 2.3 GB/s.  Reading in O_DIRECT mode
from a xen guest on the same hardware manages 263 MB/s.

Writing is affected in the same way, and exhibits the same behaviour
with O_SYNC too.

Watching with vmstat on the host, I see the same number of blocks being
read, but about 14 times the number of context switches in O_DIRECT mode
(4500 cs vs. 63000 cs) and a little more cpu usage.

The device I'm writing to is a device-mapper zero device that generates
zeros on read and throws away writes, you can set it up
at /dev/mapper/zero like this:

echo 0 21474836480 zero | dmsetup create zero

My libvirt config for the disk is:

disk type='block' device='disk'
  driver cache='none'/
  source dev='/dev/mapper/zero'/
  target dev='vdb' bus='virtio'/
  address type='pci' domain='0x' bus='0x00' slot='0x06' function='0x0'/
/disk

which translates to the kvm arg:

-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
-drive file=/dev/mapper/zero,if=none,id=drive-virtio-disk1,cache=none

I'm testing with dd:

dd if=/dev/vdb of=/dev/null bs=8k iflag=direct

As a side note, as you increase the block size read performance in
O_DIRECT mode starts to overtake non O_DIRECT mode reads (from about
150k block size). By 550k block size I'm seeing 1 GB/s reads with
O_DIRECT and 770 MB/s without.

Of course I see this performance situation with real disks too, I just
wanted to rule out the variables of moving metal around.

I get the same situation on Centos 5.5 and the latest RHEL 6 beta (which
is kvm 0.12 and kernel 2.6.32).  Hardware is a Dell i510 with 64GB RAM
and 12 Intel(R) Xeon(R) L5640 2.27GHz CPUs (running only one kvm guest
with 1G ram).  Host disk scheduler is deadline, guest disk scheduler is
noop.

Guest distro is Ubuntu Lucid, 2.6.32-22-server. I've tried with both
32bit pae and 64bit guest kernels.

Anyone got any thoughts on this?

Thanks,

John.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Does KVM support VMDq?

2010-07-23 Thread John Bellessa

Hi all,

I have been trying to figure out whether or not KVM has support for
Intel's VMDq.  I've found some posts and mailing list threads relating
to it, but nothing stating specifically that there is support.  At
least, not in language that wasn't a bit over my head.  I've also seen
some documents from Intel mentioning KVM, but still nothing explicitly
saying one way or the other.

Can anyone help me out?

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Looking at using KVM for embedded product

2010-06-23 Thread john cooper

Tom Shoes wrote:
 Hi there,
 
 I am looking at using KVM for an embedded product. I am also new
 to Virtualization so pardon if
 I ask dumb questions. This is my first time posting to this forum.
 
The embedded product that need to run KVM has:
 
  a. Intel processor with VT
  b. BIOS supports enabling VT
  c. Linux kernel 2.6.26 (from kernel.org)
  d. No VGA adapter
  e. Serial console
  f. BusyBox

Busybox is an interesting requirement in that context.  If
you are constrained with userland size and linking against
other than glibc, use of qemu could be interesting.  Can't
say I've built it other than linked against glibc and an
extensive list of runtime libraries.  Although I've never
tried to configure-down that dependency. 

-john

-- 
john.coo...@third-harmonic.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread john cooper

Rusty Russell wrote:
 On Sat, 19 Jun 2010 04:08:02 am Ryan Harper wrote:
 Create a new attribute for virtio-blk devices that will fetch the serial 
 number
 of the block device.  This attribute can be used by udev to create disk/by-id
 symlinks for devices that don't have a UUID (filesystem) associated with 
 them.

 ATA_IDENTIFY strings are special in that they can be up to 20 chars long
 and aren't required to be NULL-terminated.  The buffer is also zero-padded
 meaning that if the serial is 19 chars or less that we get a NULL terminated
 string.  When copying this value into a string buffer, we must be careful to
 copy up to the NULL (if it present) and only 20 if it is longer and not to
 attempt to NULL terminate; this isn't needed.

 Signed-off-by: Ryan Harper ry...@us.ibm.com
 Signed-off-by: john cooper john.coo...@redhat.com
 ---
  drivers/block/virtio_blk.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 258bc2a..f1ef26f 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -281,6 +281,31 @@ static int index_to_minor(int index)
  return index  PART_BITS;
  }
  
 +/* Copy serial number from *s to *d.  Copy operation terminates on either
 + * encountering a nul in *s or after n bytes have been copied, whichever
 + * occurs first.  *d is not forcibly nul terminated.  Return # of bytes 
 copied.
 + */
 +static inline int serial_sysfs(char *d, char *s, int n)
 +{
 +char *di = d;
 +
 +while (*s  n--)
 +*d++ = *s++;
 +return d - di;
 +}
 +
 +static ssize_t virtblk_serial_show(struct device *dev,
 +struct device_attribute *attr, char *buf)
 +{
 +struct gendisk *disk = dev_to_disk(dev);
 +char id_str[VIRTIO_BLK_ID_BYTES];
 +
 +if (IS_ERR(virtblk_get_id(disk, id_str)))
 +return 0;
 
 0?  Really?  That doesn't seem very informative.

Propagating a prospective error from virtblk_get_id() should
be possible.  Unsure if doing so is more useful from the
user's perspective compared to just a nul id string.

 +return serial_sysfs(buf, id_str, min(VIRTIO_BLK_ID_BYTES, PAGE_SIZE));
 
 How about something like this:
 
   BUILD_BUG_ON(PAGE_SIZE  VIRTIO_BLK_ID_BYTES + 1);

Agreed, that's a better wrench in the gearworks.
Note padding buf[] by 1 isn't necessary as indicated
below.

   /* id_str is not necessarily nul-terminated! */
   buf[VIRTIO_BLK_ID_BYTES] = '\0';
   return virtblk_get_id(disk, buf);

The /sys file is rendered according to the length
returned from this function and the trailing nul
is not interpreted in this context.  In fact if a
nul is added and included in the byte count of the
string it will appear in the /sys file.

Thanks,

-john


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread john cooper

Ryan Harper wrote:
 * john cooper john.coo...@redhat.com [2010-06-21 01:11]:
 Rusty Russell wrote:
 On Sat, 19 Jun 2010 04:08:02 am Ryan Harper wrote:
 Create a new attribute for virtio-blk devices that will fetch the serial 
 number
 of the block device.  This attribute can be used by udev to create 
 disk/by-id
 symlinks for devices that don't have a UUID (filesystem) associated with 
 them.

 ATA_IDENTIFY strings are special in that they can be up to 20 chars long
 and aren't required to be NULL-terminated.  The buffer is also zero-padded
 meaning that if the serial is 19 chars or less that we get a NULL 
 terminated
 string.  When copying this value into a string buffer, we must be careful 
 to
 copy up to the NULL (if it present) and only 20 if it is longer and not to
 attempt to NULL terminate; this isn't needed.

 Signed-off-by: Ryan Harper ry...@us.ibm.com
 Signed-off-by: john cooper john.coo...@redhat.com
 ---
  drivers/block/virtio_blk.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 258bc2a..f1ef26f 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -281,6 +281,31 @@ static int index_to_minor(int index)
return index  PART_BITS;
  }
  
 +/* Copy serial number from *s to *d.  Copy operation terminates on either
 + * encountering a nul in *s or after n bytes have been copied, whichever
 + * occurs first.  *d is not forcibly nul terminated.  Return # of bytes 
 copied.
 + */
 +static inline int serial_sysfs(char *d, char *s, int n)
 +{
 +  char *di = d;
 +
 +  while (*s  n--)
 +  *d++ = *s++;
 +  return d - di;
 +}
 +
 +static ssize_t virtblk_serial_show(struct device *dev,
 +  struct device_attribute *attr, char *buf)
 +{
 +  struct gendisk *disk = dev_to_disk(dev);
 +  char id_str[VIRTIO_BLK_ID_BYTES];
 +
 +  if (IS_ERR(virtblk_get_id(disk, id_str)))
 +  return 0;
 0?  Really?  That doesn't seem very informative.
 Propagating a prospective error from virtblk_get_id() should
 be possible.  Unsure if doing so is more useful from the
 user's perspective compared to just a nul id string.
 
 I'm not sure we can do any thing else here; maybe printk a warning?
 
 Documentation/filesystems/sysfs.txt says that showing attributes should
 always return the number of chars put into the buffer; so when there is
 an error; zero is the right value to return since we're not filling the
 buffer.

So we return a nul string in the case the qemu user
didn't specify an id string and also in the case a
legacy qemu doesn't support retrieval of an id string.
Not too much difference and if needed going forward the
error return can be elaborated.

 /* id_str is not necessarily nul-terminated! */
 buf[VIRTIO_BLK_ID_BYTES] = '\0';
 return virtblk_get_id(disk, buf);
 The /sys file is rendered according to the length
 returned from this function and the trailing nul
 is not interpreted in this context.  In fact if a
 nul is added and included in the byte count of the
 string it will appear in the /sys file.
 
 Yeah; I like the simplicity; but we do need to know how long the string
 is so we can return that value. 

Which we're getting from serial_sysfs() without
having to accommodate an unused nul.  I'd hazard the
primary reason the sysfs calling code keys off a
return of byte count vs. traversing the string itself
is due to the called function almost always having the
byte count available.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread john cooper

Rusty Russell wrote:
 On Tue, 22 Jun 2010 02:13:21 am Ryan Harper wrote:
 * john cooper john.coo...@redhat.com [2010-06-21 01:11]:
 Rusty Russell wrote:
/* id_str is not necessarily nul-terminated! */
buf[VIRTIO_BLK_ID_BYTES] = '\0';
return virtblk_get_id(disk, buf);
 The /sys file is rendered according to the length
 returned from this function and the trailing nul
 is not interpreted in this context.  In fact if a
 nul is added and included in the byte count of the
 string it will appear in the /sys file.
 Yeah; I like the simplicity; but we do need to know how long the string
 is so we can return that value. 
 
 So we're looking at something like:
 
   /* id_str is not necessarily nul-terminated! */
   buf[VIRTIO_BLK_ID_BYTES] = '\0';
   err = virtblk_get_id(disk, buf);
   if (!err)
   return strlen(buf);
   if (err == -EIO) /* Unsupported?  Make it empty. */ 
   return 0;
   return err;

In my haste reading your prior mail, I'd glossed over
the fact you were copying direct to the sysfs buf.  So
in retrospect that (and the above) do make sense.

Thanks,

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Remove virtio_blk VBID ioctl

2010-06-20 Thread john cooper

Rusty Russell wrote:
 On Sat, 19 Jun 2010 04:08:03 am Ryan Harper wrote:
 With the availablility of a sysfs device attribute for examining disk serial
 numbers the ioctl is no longer needed.  The user-space changes for this 
 aren't
 upstream yet so we don't have any users to worry about.
 
 If John Cooper acks this, I'll push it to Linus immediately.

Actually I'm the one who suggested removing it.
The code in question was only intended as example
usage of accessing the s/n data in the driver, for
the /sys interface under discussion back then.
That effort subsequently stalled and Ryan had
recently picked it up.  As such I believe this
overshadows the general need for an ioctl.  Even if
for some reason an ioctl would be justified going
forward, a more usage friendly form would be better.
So let's just drop it for now as the corresponding
qemu-side code hasn't been merged.
 
 Unfortunately we offered this interface in 2.6.34, and we're now removing it.
 That's unpleasant.

Indeed.  This entire effort, aside from being an exercise
in protracted agony, probably violates a Rube Goldberg
patent.

Thanks,

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/17] Add helper function get_kernel_ns

2010-06-15 Thread john stultz

On Tue, 2010-06-15 at 11:03 -1000, Zachary Amsden wrote:
 On 06/14/2010 10:41 PM, Avi Kivity wrote:
  On 06/15/2010 10:34 AM, Zachary Amsden wrote:
  Add a helper function for the multiple places this is used.  Note 
  that it
  must not be called in preemptible context, as that would mean the kernel
  could enter software suspend state, which would cause non-atomic 
  operation
  of the monotonic_to_bootbased computation.
 
  Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
  as well?  Currently, they are not bootbased (but perhaps should be).
 
  Signed-off-by: Zachary Amsdenzams...@redhat.com
  ---
arch/x86/kvm/x86.c |   26 +-
1 files changed, 13 insertions(+), 13 deletions(-)
 
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 703ea43..15c7317 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t 
  scaled_khz, uint32_t base_khz,
 __func__, base_khz, scaled_khz, shift, *pmultiplier);
}
 
  +static inline u64 get_kernel_ns(void)
  +{
  +struct timespec ts;
  +
  +WARN_ON(preemptible());
  +ktime_get_ts(ts);
  +monotonic_to_bootbased(ts);
  +return timespec_to_ns(ts);
  +}
  +
static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 
 
  Isn't something like this a candidate for the time infrastructure?
 
 
 Should it be?  It certainly seems reasonable.

Yea, probably should move to timekeeping.c or time.h.

You might also want a more descriptive name, since get_kernel_ns()
doesn't really express that this is the bootbased monotonic time. 

The similar sounding current_kernel_time() returns a coarse tick
granular CLOCK_REALTIME, so it could lead to confusion.

thanks
-john


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread john cooper

Avi Kivity wrote:
 On 06/01/2010 07:38 PM, Andi Kleen wrote:
 Your new code would starve again, right?


 Yes, of course it may starve with unfair spinlock. Since vcpus are not
 always running there is much smaller chance then vcpu on remote memory
 node will starve forever. Old kernels with unfair spinlocks are running
 fine in VMs on NUMA machines with various loads.
  
 Try it on a NUMA system with unfair memory.

 
 We are running everything on NUMA (since all modern machines are now
 NUMA).  At what scale do the issues become observable?
 
 I understand that reason and do not propose to get back to old spinlock
 on physical HW! But with virtualization performance hit is unbearable.
  
 Extreme unfairness can be unbearable too.

 
 Well, the question is what happens first.  In our experience, vcpu
 overcommit is a lot more painful.  People will never see the NUMA
 unfairness issue if they can't use kvm due to the vcpu overcommit problem.

Gleb's observed performance hit seems to be a rather mild
throughput depression compared with creating a worst case by
enforcing vcpu overcommit.  Running a single guest with 2:1
overcommit on a 4 core machine I saw over an order of magnitude
slowdown vs. 1:1 commit with the same kernel build test.
Others have reported similar results.

How close you'll get to that scenario depends on host
scheduling dynamics, and statistically the number of opened
and stalled lock held paths waiting to be contended.  So
I'd expect to see quite variable numbers for guest-guest
aggravation of this problem.

 What I'd like to see eventually is a short-term-unfair, long-term-fair
 spinlock.  Might make sense for bare metal as well.  But it won't be
 easy to write.

Collecting the contention/usage statistics on a per spinlock
basis seems complex.  I believe a practical approximation
to this are adaptive mutexes where upon hitting a spin
time threshold, punt and let the scheduler reconcile fairness.

-john

-- 
john.coo...@third-harmonic.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 219 matches

Mail list logo