Re: [Qemu-devel] [PATCH 3/5] trace-cmd: Support trace-agent of virtio-trace

2012-08-22 Thread Yoshihiro YUNOMAE

Hi Steven,

(2012/08/22 22:51), Steven Rostedt wrote:

On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:

Add read path and control path to use trace-agent of virtio-trace.
When we use trace-agent, trace-cmd will be used as follows:
# AGENT_READ_DIR=/tmp/virtio-trace/tracing \
  AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \
  TRACING_DIR=/tmp/virtio-trace/debugfs/tracing \\


Ha! You used "TRACING_DIR" but patch one introduces TRACE_DIR. Lets
change this to DEBUG_TRACING_DIR instead anyway.


Oh, sorry for the confusion.


Also, I don't like the generic environment variables. Perhaps
VIRTIO_TRACE_DIR, or AGENT_TRACE_DIR and AGENT_TRACE_CTL. Lets try to
keep the environment namespace sparse.


OK, I'll change these name of environment variables as follows:
AGENT_READ_DIR
AGENT_TRACE_CTL
GUEST_TRACING_DIR


  trace-cmd record -e "sched:*"
Here, AGENT_READ_DIR is the path for a reading directory of virtio-trace,
AGENT_CTL is a control path of trace-agent, and TRACING_DIR is a debugfs path
of a guest.

Signed-off-by: Yoshihiro YUNOMAE 
---

  trace-cmd.h  |1 +
  trace-recorder.c |   57 +-
  trace-util.c |   18 +
  3 files changed, 75 insertions(+), 1 deletions(-)

diff --git a/trace-cmd.h b/trace-cmd.h
index f904dc5..75506ed 100644
--- a/trace-cmd.h
+++ b/trace-cmd.h
@@ -72,6 +72,7 @@ static inline int tracecmd_host_bigendian(void)
  }

  char *tracecmd_find_tracing_dir(void);
+char *guest_agent_tracing_read_dir(void);

  /* --- Opening and Reading the trace.dat file --- */

diff --git a/trace-recorder.c b/trace-recorder.c
index 215affc..3b750e9 100644
--- a/trace-recorder.c
+++ b/trace-recorder.c
@@ -33,6 +33,7 @@
  #include 
  #include 
  #include 
+#include 

  #include "trace-cmd.h"

@@ -43,6 +44,8 @@ struct tracecmd_recorder {
int page_size;
int cpu;
int stop;
+   int ctl_fd;
+   boolagent_existing;


Thanks for the reminder. I need to convert a lot to use 'bool' instead.


I'll change 'int' just for flag to use 'bool' as much as possible
after finishing this patch set.


  };

  void tracecmd_free_recorder(struct tracecmd_recorder *recorder)
@@ -59,11 +62,29 @@ void tracecmd_free_recorder(struct tracecmd_recorder 
*recorder)
free(recorder);
  }

+static char *use_trace_agent_dir(char *ctl_path,
+   struct tracecmd_recorder *recorder)
+{
+   ctl_path = strdup(ctl_path);
+   if (!ctl_path)
+   die("malloc");
+   warning("Use environmental control path: %s\n", ctl_path);


s/Use/Using/


OK, I'll correct this.

Thank you,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com





[Qemu-devel] [PATCH 3/5] trace-cmd: Support trace-agent of virtio-trace

2012-08-22 Thread Yoshihiro YUNOMAE
Add read path and control path to use trace-agent of virtio-trace.
When we use trace-agent, trace-cmd will be used as follows:
# AGENT_READ_DIR=/tmp/virtio-trace/tracing \
  AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \
  TRACING_DIR=/tmp/virtio-trace/debugfs/tracing \
  trace-cmd record -e "sched:*"
Here, AGENT_READ_DIR is the path for a reading directory of virtio-trace,
AGENT_CTL is a control path of trace-agent, and TRACING_DIR is a debugfs path
of a guest.

Signed-off-by: Yoshihiro YUNOMAE 
---

 trace-cmd.h  |1 +
 trace-recorder.c |   57 +-
 trace-util.c |   18 +
 3 files changed, 75 insertions(+), 1 deletions(-)

diff --git a/trace-cmd.h b/trace-cmd.h
index f904dc5..75506ed 100644
--- a/trace-cmd.h
+++ b/trace-cmd.h
@@ -72,6 +72,7 @@ static inline int tracecmd_host_bigendian(void)
 }
 
 char *tracecmd_find_tracing_dir(void);
+char *guest_agent_tracing_read_dir(void);
 
 /* --- Opening and Reading the trace.dat file --- */
 
diff --git a/trace-recorder.c b/trace-recorder.c
index 215affc..3b750e9 100644
--- a/trace-recorder.c
+++ b/trace-recorder.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "trace-cmd.h"
 
@@ -43,6 +44,8 @@ struct tracecmd_recorder {
int page_size;
int cpu;
int stop;
+   int ctl_fd;
+   boolagent_existing;
 };
 
 void tracecmd_free_recorder(struct tracecmd_recorder *recorder)
@@ -59,11 +62,29 @@ void tracecmd_free_recorder(struct tracecmd_recorder 
*recorder)
free(recorder);
 }
 
+static char *use_trace_agent_dir(char *ctl_path,
+   struct tracecmd_recorder *recorder)
+{
+   ctl_path = strdup(ctl_path);
+   if (!ctl_path)
+   die("malloc");
+   warning("Use environmental control path: %s\n", ctl_path);
+
+   recorder->ctl_fd = open(ctl_path, O_WRONLY);
+   if (recorder->ctl_fd < 0)
+   return NULL;
+
+   recorder->agent_existing = true;
+
+   return guest_agent_tracing_read_dir();
+}
+
 struct tracecmd_recorder *tracecmd_create_recorder_fd(int fd, int cpu)
 {
struct tracecmd_recorder *recorder;
char *tracing = NULL;
char *path = NULL;
+   char *ctl_path = NULL;
int ret;
 
recorder = malloc_or_die(sizeof(*recorder));
@@ -76,12 +97,23 @@ struct tracecmd_recorder *tracecmd_create_recorder_fd(int 
fd, int cpu)
recorder->trace_fd = -1;
recorder->brass[0] = -1;
recorder->brass[1] = -1;
+   recorder->ctl_fd = -1;
+   recorder->agent_existing = false;
 
recorder->page_size = getpagesize();
 
recorder->fd = fd;
 
-   tracing = tracecmd_find_tracing_dir();
+   /*
+* The trace-agent on a guest is controlled to run or stop by a host,
+* so we need to assign the control path of the trace-agent to use
+* virtio-trace.
+*/
+   ctl_path = getenv("AGENT_CTL");
+   if (ctl_path)
+   tracing = use_trace_agent_dir(ctl_path, recorder);
+   else
+   tracing = tracecmd_find_tracing_dir();
if (!tracing) {
errno = ENODEV;
goto out_free;
@@ -182,6 +214,24 @@ long tracecmd_flush_recording(struct tracecmd_recorder 
*recorder)
return total;
 }
 
+static void operation_to_trace_agent(int ctl_fd, bool run_agent)
+{
+   if (run_agent == true)
+   write(ctl_fd, "1", 2);
+   else
+   write(ctl_fd, "0", 2);
+}
+
+static void run_operation_to_trace_agent(int ctl_fd)
+{
+   operation_to_trace_agent(ctl_fd, true);
+}
+
+static void stop_operation_to_trace_agent(int ctl_fd)
+{
+   operation_to_trace_agent(ctl_fd, false);
+}
+
 int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long 
sleep)
 {
struct timespec req;
@@ -189,6 +239,9 @@ int tracecmd_start_recording(struct tracecmd_recorder 
*recorder, unsigned long s
 
recorder->stop = 0;
 
+   if (recorder->agent_existing)
+   run_operation_to_trace_agent(recorder->ctl_fd);
+   
do {
if (sleep) {
req.tv_sec = sleep / 100;
@@ -214,6 +267,8 @@ void tracecmd_stop_recording(struct tracecmd_recorder 
*recorder)
if (!recorder)
return;
 
+   if (recorder->agent_existing)
+   stop_operation_to_trace_agent(recorder->ctl_fd);
recorder->stop = 1;
 }
 
diff --git a/trace-util.c b/trace-util.c
index d5a3eb4..ff639be 100644
--- a/trace-util.c
+++ b/trace-util.c
@@ -304,6 +304,24 @@ static int mount_debugfs(void)
return ret;
 }
 
+char *guest_agent_tracing_read_dir(void)
+{
+   char *tracing_read_dir;
+
+   tracing_read_d

[Qemu-devel] [PATCH 4/5] trace-cmd: Add non-blocking option for open() and splice_read()

2012-08-22 Thread Yoshihiro YUNOMAE
Add non-blocking option for open() and splice_read() for avoiding block to read
trace data of a guest from FIFO.

If SIGINT comes to read/write processes from the parent process in the case
where FIFO as a read I/F is assigned, then reading is normally blocked for
splice_read(). So, we added nonblock option to open() and splice_read().

Signed-off-by: Yoshihiro YUNOMAE 
---

 trace-recorder.c |   13 -
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/trace-recorder.c b/trace-recorder.c
index 3b750e9..6577fe8 100644
--- a/trace-recorder.c
+++ b/trace-recorder.c
@@ -124,7 +124,7 @@ struct tracecmd_recorder *tracecmd_create_recorder_fd(int 
fd, int cpu)
goto out_free;
 
sprintf(path, "%s/per_cpu/cpu%d/trace_pipe_raw", tracing, cpu);
-   recorder->trace_fd = open(path, O_RDONLY);
+   recorder->trace_fd = open(path, O_RDONLY | O_NONBLOCK);
if (recorder->trace_fd < 0)
goto out_free;
 
@@ -172,14 +172,17 @@ static long splice_data(struct tracecmd_recorder 
*recorder)
long ret;
 
ret = splice(recorder->trace_fd, NULL, recorder->brass[1], NULL,
-recorder->page_size, 1 /* SPLICE_F_MOVE */);
+recorder->page_size, SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
if (ret < 0) {
-   warning("recorder error in splice input");
-   return -1;
+   if (errno != EAGAIN) {
+   warning("recorder error in splice input");
+   return -1;
+   }
+   return 0; /* Buffer is empty */
}
 
ret = splice(recorder->brass[0], NULL, recorder->fd, NULL,
-recorder->page_size, 3 /* and NON_BLOCK */);
+recorder->page_size, SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
if (ret < 0) {
if (errno != EAGAIN) {
warning("recorder error in splice output");





[Qemu-devel] [PATCH 5/5] trace-cmd: Use polling function

2012-08-22 Thread Yoshihiro YUNOMAE
Use poll() for avoiding a busy loop to read trace data of a guest from FIFO.

Signed-off-by: Yoshihiro YUNOMAE 
---

 trace-recorder.c |   42 --
 1 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/trace-recorder.c b/trace-recorder.c
index 6577fe8..bdf9798 100644
--- a/trace-recorder.c
+++ b/trace-recorder.c
@@ -34,9 +34,12 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "trace-cmd.h"
 
+#define WAIT_MSEC 1
+
 struct tracecmd_recorder {
int fd;
int trace_fd;
@@ -235,9 +238,37 @@ static void stop_operation_to_trace_agent(int ctl_fd)
operation_to_trace_agent(ctl_fd, false);
 }
 
-int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long 
sleep)
+static int wait_data(struct tracecmd_recorder *recorder, unsigned long sleep)
 {
+   struct pollfd poll_fd;
struct timespec req;
+   int ret = 0;
+
+   if (recorder->agent_existing) {
+   poll_fd.fd = recorder->trace_fd;
+   poll_fd.events = POLLIN;
+   while (1) {
+   ret = poll(&poll_fd, 1, WAIT_MSEC);
+
+   if(ret < 0) {
+   warning("polling error");
+   return ret;
+   }
+
+   if (ret)
+   break;
+   }
+   } else if (sleep) {
+   req.tv_sec = sleep / 100;
+   req.tv_nsec = (sleep % 100) * 1000;
+   nanosleep(&req, NULL);
+   }
+
+   return ret;
+}
+
+int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long 
sleep)
+{
long ret;
 
recorder->stop = 0;
@@ -246,11 +277,10 @@ int tracecmd_start_recording(struct tracecmd_recorder 
*recorder, unsigned long s
run_operation_to_trace_agent(recorder->ctl_fd);

do {
-   if (sleep) {
-   req.tv_sec = sleep / 100;
-   req.tv_nsec = (sleep % 100) * 1000;
-   nanosleep(&req, NULL);
-   }
+   ret = wait_data(recorder, sleep);
+   if (ret < 0)
+   return ret;
+
ret = splice_data(recorder);
if (ret < 0)
return ret;





[Qemu-devel] [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs

2012-08-22 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Count debugfs/tracing/per_cpu/cpu* to determine the
number of CPUs.

Signed-off-by: Masami Hiramatsu 
Signed-off-by: Yoshihiro YUNOMAE 
---

 trace-record.c |   41 +
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/trace-record.c b/trace-record.c
index 9dc18a9..ed18951 100644
--- a/trace-record.c
+++ b/trace-record.c
@@ -1179,6 +1179,41 @@ static void expand_event_list(void)
}
 }
 
+static int count_tracingdir_cpus(void)
+{
+   char *tracing_dir = NULL;
+   char *percpu_dir = NULL;
+   struct dirent **namelist;
+   int count = 0, n;
+
+   /* Count cpus in per_cpu directory */
+   tracing_dir = tracecmd_find_tracing_dir();
+   if (!tracing_dir)
+   return 0;
+   percpu_dir = malloc_or_die(strlen(tracing_dir) + 9);
+   if (!percpu_dir)
+   goto err;
+
+   sprintf(percpu_dir, "%s/per_cpu", tracing_dir);
+
+   n = scandir(percpu_dir, &namelist, NULL, alphasort);
+   if (n > 0) {
+   while (n--) {
+   if (strncmp("cpu", namelist[n]->d_name, 3) == 0)
+   count++;
+   free(namelist[n]);
+   }
+   free(namelist);
+   }
+
+   if (percpu_dir)
+   free(percpu_dir);
+err:
+   if (tracing_dir)
+   free(tracing_dir);
+   return count;
+}
+
 static int count_cpus(void)
 {
FILE *fp;
@@ -1189,6 +1224,12 @@ static int count_cpus(void)
size_t n;
int r;
 
+   cpus = count_tracingdir_cpus();
+   if (cpus > 0)
+   return cpus;
+
+   warning("failed to use tracing_dir to determine number of CPUS");
+
cpus = sysconf(_SC_NPROCESSORS_CONF);
if (cpus > 0)
return cpus;





[Qemu-devel] [PATCH 1/5] trace-cmd: Use TRACE_DIR envrionment variable if defined

2012-08-22 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Use TRACE_DIR environment variable for setting
debugfs/tracing directory if defined. This is
for controlling guest(or remote) ftrace.

Signed-off-by: Masami Hiramatsu 
Signed-off-by: Yoshihiro YUNOMAE 
---

 trace-util.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/trace-util.c b/trace-util.c
index e128188..d5a3eb4 100644
--- a/trace-util.c
+++ b/trace-util.c
@@ -311,6 +311,15 @@ char *tracecmd_find_tracing_dir(void)
char type[100];
FILE *fp;

+   tracing_dir = getenv("TRACE_DIR");
+   if (tracing_dir) {
+   tracing_dir = strdup(tracing_dir);
+   if (!tracing_dir)
+   die("malloc");
+   warning("Use environmental tracing directory: %s\n", 
tracing_dir);
+   return tracing_dir;
+   }
+
if ((fp = fopen("/proc/mounts","r")) == NULL) {
warning("Can't open /proc/mounts for read");
return NULL;





[Qemu-devel] [PATCH 0/5] trace-cmd: Add a recorder readable feature for virtio-trace

2012-08-22 Thread Yoshihiro YUNOMAE
Hi Steven,

The following patch set provides a feature which can read trace data of a guest
using virtio-trace (https://lkml.org/lkml/2012/8/9/210) for a recorder
function of trace-cmd. This patch set depends on the trace-agent running on a
guest in the virtio-trace system.

To translate raw data of a guest to text data on a host, information of debugfs
in the guest is also needed on the host. In other words, the guest's debugfs
must be exposed (mounted) on the host via other serial line (we don't like to
depend on network connection). For this purpose, we'll use DIOD 9pfs server
(http://code.google.com/p/diod/) as below.

***HOW TO USE***
We explain about how to translate raw data to text data on a host using
trace-cmd applied this patch set and virtio-trace.

- Preparation
1. Make FIFO in a host
 virtio-trace uses virtio-serial pipe as trace data paths as to the number
of CPUs and a control path, so FIFO (named pipe) should be created as follows:
# mkdir /tmp/virtio-trace/
# mkfifo /tmp/virtio-trace/trace-path-cpu{0,1,2,...,X}.{in,out}
# mkfifo /tmp/virtio-trace/agent-ctl-path.{in,out}

Here, if we assign 1VCPU for a guest, then we set as follows:
trace-path-cpu0.{in.out}
and
agent-ctl-path.{in,out}.

2. Set up of virtio-serial pipe and unix in a host
 Add qemu option to use virtio-serial pipe for tracing and unix for debugfs.

 ##virtio-serial device##
 -device virtio-serial-pci,id=virtio-serial0\
 ##control path##
 -chardev pipe,id=charchannel0,path=/tmp/virtio-trace/agent-ctl-path\
 -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,\
  id=channel0,name=agent-ctl-path\
 ##data path##
 -chardev pipe,id=charchannel1,path=/tmp/virtio-trace/trace-path-cpu0\
 -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,\
  id=channel1,name=trace-path-cpu0\
 ##9pfs path##
 -device virtio-serial \
 -chardev socket,path=/tmp/virtio-trace/trace-9pfs,server,nowait, \
  id=trace-9pfs \
 -device virtserialport,chardev=trace-9pfs,name=virtioserial

If you manage guests with libvirt, add the following tags to domain XML files.
Then, libvirt passes the same command option to qemu.


   
   
   


   
   
   


   
   
   

Here, chardev names are restricted to trace-path-cpu0 and agent-ctl-path. UNIX
domain socket is automatically created on a host.

3. Boot the guest
 You can find some chardev in /dev/virtio-ports/ in the guest.

4. Create symbolic link for trace-cmd on the host
# ln -s /tmp/virtio-trace/trace-path-cpu0.out \
  /tmp/virtio-tracing/tracing/per_cpu/cpu0/trace_pipe_raw

5. Wait for 9pfs server on the host
# mount -t 9p -o trans=unix,access=any,uname=root, \
  aname=/sys/kernel/debug,version=9p2000.L \
  /tmp/virtio-trace/trace-9pfs /tmp/virtio-trace/debugfs

6. Run DIOD on the guest
# diod -E -Nn -u 0

7. Connect DIOD to virtio-console on the guest
# socat TCP4:127.0.0.1:564 /dev/virtio-ports/trace-9pfs

- Execution
1. Run trace-agent on the guest
# ./trace-agent

2. Execute trace-cmd on the host
# AGENT_READ_DIR=/tmp/virtio-trace/tracing \
  AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \
  TRACE_DIR=/tmp/virtio-trace/debugfs/tracing \
  ./trace-cmd record -e "sched:*

3. Translate raw data to text data on the host
# ./trace-cmd report trace.dat

***Just enhancement ideas***
 - Support for trace-cmd => done
 - Support for 9pfs protocol
 - Support for non-blocking mode in QEMU

Thank you,

---

Masami Hiramatsu (2):
  trace-cmd: Use tracing directory to count CPUs
  trace-cmd: Use TRACE_DIR envrionment variable if defined

Yoshihiro YUNOMAE (3):
  trace-cmd: Use polling function
  trace-cmd: Add non-blocking option for open() and splice_read()
  trace-cmd: Support trace-agent of virtio-trace


 trace-cmd.h  |1 
 trace-record.c   |   41 
 trace-recorder.c |  112 --
 trace-util.c |   27 +
 4 files changed, 169 insertions(+), 12 deletions(-)

-- 
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com




[Qemu-devel] [PATCH V2 6/6] tools: Add guest trace agent as a user tool

2012-08-09 Thread Yoshihiro YUNOMAE
This patch adds a user tool, "trace agent" for sending trace data of a guest to
a Host in low overhead. This agent has the following functions:
 - splice a page of ring-buffer to read_pipe without memory copying
 - splice the page from write_pipe to virtio-console without memory copying
 - write trace data to stdout by using -o option
 - controlled by start/stop orders from a Host

Changes in v2:
 - Cleanup (change fprintf() to pr_err() and an include guard)

Signed-off-by: Yoshihiro YUNOMAE 
---

 tools/virtio/virtio-trace/Makefile  |   14 +
 tools/virtio/virtio-trace/README|  118 
 tools/virtio/virtio-trace/trace-agent-ctl.c |  137 ++
 tools/virtio/virtio-trace/trace-agent-rw.c  |  192 +++
 tools/virtio/virtio-trace/trace-agent.c |  270 +++
 tools/virtio/virtio-trace/trace-agent.h |   75 
 6 files changed, 806 insertions(+), 0 deletions(-)
 create mode 100644 tools/virtio/virtio-trace/Makefile
 create mode 100644 tools/virtio/virtio-trace/README
 create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent.h

diff --git a/tools/virtio/virtio-trace/Makefile 
b/tools/virtio/virtio-trace/Makefile
new file mode 100644
index 000..ef3adfc
--- /dev/null
+++ b/tools/virtio/virtio-trace/Makefile
@@ -0,0 +1,14 @@
+CC = gcc
+CFLAGS = -O2 -Wall
+LFLAG = -lpthread
+
+all: trace-agent
+
+.c.o:
+   $(CC) $(CFLAGS) $(LFLAG) -c $^ -o $@
+
+trace-agent: trace-agent.o trace-agent-ctl.o trace-agent-rw.o
+   $(CC) $(CFLAGS) $(LFLAG) -o $@ $^
+
+clean:
+   rm -f *.o trace-agent
diff --git a/tools/virtio/virtio-trace/README b/tools/virtio/virtio-trace/README
new file mode 100644
index 000..b64845b
--- /dev/null
+++ b/tools/virtio/virtio-trace/README
@@ -0,0 +1,118 @@
+Trace Agent for virtio-trace
+
+
+Trace agent is a user tool for sending trace data of a guest to a Host in low
+overhead. Trace agent has the following functions:
+ - splice a page of ring-buffer to read_pipe without memory copying
+ - splice the page from write_pipe to virtio-console without memory copying
+ - write trace data to stdout by using -o option
+ - controlled by start/stop orders from a Host
+
+The trace agent operates as follows:
+ 1) Initialize all structures.
+ 2) Create a read/write thread per CPU. Each thread is bound to a CPU.
+The read/write threads hold it.
+ 3) A controller thread does poll() for a start order of a host.
+ 4) After the controller of the trace agent receives a start order from a host,
+the controller wake read/write threads.
+ 5) The read/write threads start to read trace data from ring-buffers and
+write the data to virtio-serial.
+ 6) If the controller receives a stop order from a host, the read/write threads
+stop to read trace data.
+
+
+Files
+=
+
+README: this file
+Makefile: Makefile of trace agent for virtio-trace
+trace-agent.c: includes main function, sets up for operating trace agent
+trace-agent.h: includes all structures and some macros
+trace-agent-ctl.c: includes controller function for read/write threads
+trace-agent-rw.c: includes read/write threads function
+
+
+Setup
+=
+
+To use this trace agent for virtio-trace, we need to prepare some virtio-serial
+I/Fs.
+
+1) Make FIFO in a host
+ virtio-trace uses virtio-serial pipe as trace data paths as to the number
+of CPUs and a control path, so FIFO (named pipe) should be created as follows:
+   # mkdir /tmp/virtio-trace/
+   # mkfifo /tmp/virtio-trace/trace-path-cpu{0,1,2,...,X}.{in,out}
+   # mkfifo /tmp/virtio-trace/agent-ctl-path.{in,out}
+
+For example, if a guest use three CPUs, the names are
+   trace-path-cpu{0,1,2}.{in.out}
+and
+   agent-ctl-path.{in,out}.
+
+2) Set up of virtio-serial pipe in a host
+ Add qemu option to use virtio-serial pipe.
+
+ ##virtio-serial device##
+ -device virtio-serial-pci,id=virtio-serial0\
+ ##control path##
+ -chardev pipe,id=charchannel0,path=/tmp/virtio-trace/agent-ctl-path\
+ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,\
+  id=channel0,name=agent-ctl-path\
+ ##data path##
+ -chardev pipe,id=charchannel1,path=/tmp/virtio-trace/trace-path-cpu0\
+ -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel0,\
+  id=channel1,name=trace-path-cpu0\
+  ...
+
+If you manage guests with libvirt, add the following tags to domain XML files.
+Then, libvirt passes the same command option to qemu.
+
+   
+  
+  
+  
+   
+   
+  
+  
+  
+   
+   ...
+Here, chardev names are restricted to trace-path-cpuX and agent-ctl-path. For
+example, if a guest use three CPUs, chardev names should be trace-path-cpu0,
+trace-path-c

[Qemu-devel] [PATCH V2 4/6] ftrace: Allow stealing pages from pipe buffer

2012-08-09 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Use generic steal operation on pipe buffer to allow stealing
ring buffer's read page from pipe buffer.

Note that this could reduce the performance of splice on the
splice_write side operation without affinity setting.
Since the ring buffer's read pages are allocated on the
tracing-node, but the splice user does not always execute
splice write side operation on the same node. In this case,
the page will be accessed from the another node.
Thus, it is strongly recommended to assign the splicing
thread to corresponding node.

Signed-off-by: Masami Hiramatsu 
Acked-by: Steven Rostedt 
---

 kernel/trace/trace.c |8 +---
 1 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a120f98..ae01930 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4194,12 +4194,6 @@ static void buffer_pipe_buf_release(struct 
pipe_inode_info *pipe,
buf->private = 0;
 }
 
-static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe,
-struct pipe_buffer *buf)
-{
-   return 1;
-}
-
 static void buffer_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
 {
@@ -4215,7 +4209,7 @@ static const struct pipe_buf_operations 
buffer_pipe_buf_ops = {
.unmap  = generic_pipe_buf_unmap,
.confirm= generic_pipe_buf_confirm,
.release= buffer_pipe_buf_release,
-   .steal  = buffer_pipe_buf_steal,
+   .steal  = generic_pipe_buf_steal,
.get= buffer_pipe_buf_get,
 };
 





[Qemu-devel] [PATCH V2 5/6] virtio/console: Allocate scatterlist according to the current pipe size

2012-08-09 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Allocate scatterlist according to the current pipe size.
This allows splicing bigger buffer if the pipe size has
been changed by fcntl.

Changes in v2:
 - Just a minor fix for avoiding a confliction with previous patch.

Signed-off-by: Masami Hiramatsu 
---

 drivers/char/virtio_console.c |   23 ---
 1 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index b2fc2ab..e88f843 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -229,7 +229,6 @@ struct port {
bool guest_connected;
 };
 
-#define MAX_SPLICE_PAGES   32
 /* This is the very early arch-specified put chars function. */
 static int (*early_put_chars)(u32, const char *, int);
 
@@ -482,15 +481,16 @@ struct buffer_token {
void *buf;
struct scatterlist *sg;
} u;
-   bool sgpages;
+   /* If sgpages == 0 then buf is used, else sg is used */
+   unsigned int sgpages;
 };
 
-static void reclaim_sg_pages(struct scatterlist *sg)
+static void reclaim_sg_pages(struct scatterlist *sg, unsigned int nrpages)
 {
int i;
struct page *page;
 
-   for (i = 0; i < MAX_SPLICE_PAGES; i++) {
+   for (i = 0; i < nrpages; i++) {
page = sg_page(&sg[i]);
if (!page)
break;
@@ -511,7 +511,7 @@ static void reclaim_consumed_buffers(struct port *port)
}
while ((tok = virtqueue_get_buf(port->out_vq, &len))) {
if (tok->sgpages)
-   reclaim_sg_pages(tok->u.sg);
+   reclaim_sg_pages(tok->u.sg, tok->sgpages);
else
kfree(tok->u.buf);
kfree(tok);
@@ -581,7 +581,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, 
size_t in_count,
tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
if (!tok)
return -ENOMEM;
-   tok->sgpages = false;
+   tok->sgpages = 0;
tok->u.buf = in_buf;
 
sg_init_one(sg, in_buf, in_count);
@@ -597,7 +597,7 @@ static ssize_t send_pages(struct port *port, struct 
scatterlist *sg, int nents,
tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
if (!tok)
return -ENOMEM;
-   tok->sgpages = true;
+   tok->sgpages = nents;
tok->u.sg = sg;
 
return __send_to_port(port, sg, nents, in_count, tok, nonblock);
@@ -797,6 +797,7 @@ out:
 
 struct sg_list {
unsigned int n;
+   unsigned int size;
size_t len;
struct scatterlist *sg;
 };
@@ -807,7 +808,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct 
pipe_buffer *buf,
struct sg_list *sgl = sd->u.data;
unsigned int offset, len;
 
-   if (sgl->n == MAX_SPLICE_PAGES)
+   if (sgl->n == sgl->size)
return 0;
 
/* Try lock this page */
@@ -868,12 +869,12 @@ static ssize_t port_fops_splice_write(struct 
pipe_inode_info *pipe,
 
sgl.n = 0;
sgl.len = 0;
-   sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES,
-GFP_KERNEL);
+   sgl.size = pipe->nrbufs;
+   sgl.sg = kmalloc(sizeof(struct scatterlist) * sgl.size, GFP_KERNEL);
if (unlikely(!sgl.sg))
return -ENOMEM;
 
-   sg_init_table(sgl.sg, MAX_SPLICE_PAGES);
+   sg_init_table(sgl.sg, sgl.size);
ret = __splice_from_pipe(pipe, &sd, pipe_to_sg);
if (likely(ret > 0))
ret = send_pages(port, sgl.sg, sgl.n, sgl.len, true);





[Qemu-devel] [PATCH V2 3/6] virtio/console: Wait until the port is ready on splice

2012-08-09 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Wait if the port is not connected or full on splice
like as write is doing.

Signed-off-by: Masami Hiramatsu 
---

 drivers/char/virtio_console.c |   39 +++
 1 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 22b7373..b2fc2ab 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -724,6 +724,26 @@ static ssize_t port_fops_read(struct file *filp, char 
__user *ubuf,
return fill_readbuf(port, ubuf, count, true);
 }
 
+static int wait_port_writable(struct port *port, bool nonblock)
+{
+   int ret;
+
+   if (will_write_block(port)) {
+   if (nonblock)
+   return -EAGAIN;
+
+   ret = wait_event_freezable(port->waitqueue,
+  !will_write_block(port));
+   if (ret < 0)
+   return ret;
+   }
+   /* Port got hot-unplugged. */
+   if (!port->guest_connected)
+   return -ENODEV;
+
+   return 0;
+}
+
 static ssize_t port_fops_write(struct file *filp, const char __user *ubuf,
   size_t count, loff_t *offp)
 {
@@ -740,18 +760,9 @@ static ssize_t port_fops_write(struct file *filp, const 
char __user *ubuf,
 
nonblock = filp->f_flags & O_NONBLOCK;
 
-   if (will_write_block(port)) {
-   if (nonblock)
-   return -EAGAIN;
-
-   ret = wait_event_freezable(port->waitqueue,
-  !will_write_block(port));
-   if (ret < 0)
-   return ret;
-   }
-   /* Port got hot-unplugged. */
-   if (!port->guest_connected)
-   return -ENODEV;
+   ret = wait_port_writable(port, nonblock);
+   if (ret < 0)
+   return ret;
 
count = min((size_t)(32 * 1024), count);
 
@@ -851,6 +862,10 @@ static ssize_t port_fops_splice_write(struct 
pipe_inode_info *pipe,
.u.data = &sgl,
};
 
+   ret = wait_port_writable(port, filp->f_flags & O_NONBLOCK);
+   if (ret < 0)
+   return ret;
+
sgl.n = 0;
sgl.len = 0;
sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES,





[Qemu-devel] [PATCH V2 2/6] virtio/console: Add a failback for unstealable pipe buffer

2012-08-09 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Add a failback memcpy path for unstealable pipe buffer.
If buf->ops->steal() fails, virtio-serial tries to
copy the page contents to an allocated page, instead
of just failing splice().

Signed-off-by: Masami Hiramatsu 
---

 drivers/char/virtio_console.c |   28 +---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 730816c..22b7373 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -794,7 +794,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct 
pipe_buffer *buf,
struct splice_desc *sd)
 {
struct sg_list *sgl = sd->u.data;
-   unsigned int len = 0;
+   unsigned int offset, len;
 
if (sgl->n == MAX_SPLICE_PAGES)
return 0;
@@ -807,9 +807,31 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct 
pipe_buffer *buf,
 
len = min(buf->len, sd->len);
sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset);
-   sgl->n++;
-   sgl->len += len;
+   } else {
+   /* Failback to copying a page */
+   struct page *page = alloc_page(GFP_KERNEL);
+   char *src = buf->ops->map(pipe, buf, 1);
+   char *dst;
+
+   if (!page)
+   return -ENOMEM;
+   dst = kmap(page);
+
+   offset = sd->pos & ~PAGE_MASK;
+
+   len = sd->len;
+   if (len + offset > PAGE_SIZE)
+   len = PAGE_SIZE - offset;
+
+   memcpy(dst + offset, src + buf->offset, len);
+
+   kunmap(page);
+   buf->ops->unmap(pipe, buf, src);
+
+   sg_set_page(&(sgl->sg[sgl->n]), page, len, offset);
}
+   sgl->n++;
+   sgl->len += len;
 
return len;
 }





[Qemu-devel] [PATCH V2 1/6] virtio/console: Add splice_write support

2012-08-09 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Enable to use splice_write from pipe to virtio-console port.
This steals pages from pipe and directly send it to host.

Note that this may accelerate only the guest to host path.

Changes in v2:
 - Use GFP_KERNEL instead of GFP_ATOMIC in syscall context function.

Signed-off-by: Masami Hiramatsu 
---

 drivers/char/virtio_console.c |  136 +++--
 1 files changed, 128 insertions(+), 8 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index cdf2f54..730816c 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -24,6 +24,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -227,6 +229,7 @@ struct port {
bool guest_connected;
 };
 
+#define MAX_SPLICE_PAGES   32
 /* This is the very early arch-specified put chars function. */
 static int (*early_put_chars)(u32, const char *, int);
 
@@ -474,26 +477,52 @@ static ssize_t send_control_msg(struct port *port, 
unsigned int event,
return 0;
 }
 
+struct buffer_token {
+   union {
+   void *buf;
+   struct scatterlist *sg;
+   } u;
+   bool sgpages;
+};
+
+static void reclaim_sg_pages(struct scatterlist *sg)
+{
+   int i;
+   struct page *page;
+
+   for (i = 0; i < MAX_SPLICE_PAGES; i++) {
+   page = sg_page(&sg[i]);
+   if (!page)
+   break;
+   put_page(page);
+   }
+   kfree(sg);
+}
+
 /* Callers must take the port->outvq_lock */
 static void reclaim_consumed_buffers(struct port *port)
 {
-   void *buf;
+   struct buffer_token *tok;
unsigned int len;
 
if (!port->portdev) {
/* Device has been unplugged.  vqs are already gone. */
return;
}
-   while ((buf = virtqueue_get_buf(port->out_vq, &len))) {
-   kfree(buf);
+   while ((tok = virtqueue_get_buf(port->out_vq, &len))) {
+   if (tok->sgpages)
+   reclaim_sg_pages(tok->u.sg);
+   else
+   kfree(tok->u.buf);
+   kfree(tok);
port->outvq_full = false;
}
 }
 
-static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count,
-   bool nonblock)
+static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
+ int nents, size_t in_count,
+ struct buffer_token *tok, bool nonblock)
 {
-   struct scatterlist sg[1];
struct virtqueue *out_vq;
ssize_t ret;
unsigned long flags;
@@ -505,8 +534,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, 
size_t in_count,
 
reclaim_consumed_buffers(port);
 
-   sg_init_one(sg, in_buf, in_count);
-   ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC);
+   ret = virtqueue_add_buf(out_vq, sg, nents, 0, tok, GFP_ATOMIC);
 
/* Tell Host to go! */
virtqueue_kick(out_vq);
@@ -544,6 +572,37 @@ done:
return in_count;
 }
 
+static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count,
+   bool nonblock)
+{
+   struct scatterlist sg[1];
+   struct buffer_token *tok;
+
+   tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
+   if (!tok)
+   return -ENOMEM;
+   tok->sgpages = false;
+   tok->u.buf = in_buf;
+
+   sg_init_one(sg, in_buf, in_count);
+
+   return __send_to_port(port, sg, 1, in_count, tok, nonblock);
+}
+
+static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents,
+ size_t in_count, bool nonblock)
+{
+   struct buffer_token *tok;
+
+   tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
+   if (!tok)
+   return -ENOMEM;
+   tok->sgpages = true;
+   tok->u.sg = sg;
+
+   return __send_to_port(port, sg, nents, in_count, tok, nonblock);
+}
+
 /*
  * Give out the data that's requested from the buffer that we have
  * queued up.
@@ -725,6 +784,66 @@ out:
return ret;
 }
 
+struct sg_list {
+   unsigned int n;
+   size_t len;
+   struct scatterlist *sg;
+};
+
+static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+   struct splice_desc *sd)
+{
+   struct sg_list *sgl = sd->u.data;
+   unsigned int len = 0;
+
+   if (sgl->n == MAX_SPLICE_PAGES)
+   return 0;
+
+   /* Try lock this page */
+   if (buf->ops->steal(pipe, buf) == 0) {
+   /* Get reference and unlock page for moving */
+   get_page(buf->page);
+   unlock_page(buf->page);
+
+   len = min(buf->len, sd->len);
+   sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset);
+   sgl->n++;
+   sgl->len += len;
+   }
+
+   return len;
+}
+
+/* Faster zero-copy w

[Qemu-devel] [PATCH V2 0/6] virtio-trace: Support virtio-trace

2012-08-09 Thread Yoshihiro YUNOMAE
Use GFP_KERNEL instead of GFP_ATOMIC in syscall context function in 1/6
 - Just a minor fix for avoiding a confliction with previous patch in 5/6
 - Cleanup (change fprintf() to pr_err() and an include guard) in 6/6

Thank you,

---

Masami Hiramatsu (5):
  virtio/console: Allocate scatterlist according to the current pipe size
  ftrace: Allow stealing pages from pipe buffer
  virtio/console: Wait until the port is ready on splice
  virtio/console: Add a failback for unstealable pipe buffer
  virtio/console: Add splice_write support

Yoshihiro YUNOMAE (1):
  tools: Add guest trace agent as a user tool


 drivers/char/virtio_console.c   |  198 ++--
 kernel/trace/trace.c|8 -
 tools/virtio/virtio-trace/Makefile  |   14 +
 tools/virtio/virtio-trace/README|  118 
 tools/virtio/virtio-trace/trace-agent-ctl.c |  137 ++
 tools/virtio/virtio-trace/trace-agent-rw.c  |  192 +++
 tools/virtio/virtio-trace/trace-agent.c |  270 +++
 tools/virtio/virtio-trace/trace-agent.h |   75 
 8 files changed, 985 insertions(+), 27 deletions(-)
 create mode 100644 tools/virtio/virtio-trace/Makefile
 create mode 100644 tools/virtio/virtio-trace/README
 create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent.h

-- 
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com




Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace

2012-07-30 Thread Yoshihiro YUNOMAE

Hi Amit,

Sorry for the late reply.

(2012/07/27 18:43), Amit Shah wrote:

On (Fri) 27 Jul 2012 [17:55:11], Yoshihiro YUNOMAE wrote:

Hi Amit,

Thank you for commenting on our work.

(2012/07/26 20:35), Amit Shah wrote:

On (Tue) 24 Jul 2012 [11:36:57], Yoshihiro YUNOMAE wrote:




[...]



***Just enhancement ideas***
  - Support for trace-cmd
  - Support for 9pfs protocol
  - Support for non-blocking mode in QEMU


There were patches long back (by me) to make chardevs non-blocking but
they didn't make it upstream.  Fedora carries them, if you want to try
out.  Though we want to converge on a reasonable solution that's
acceptable upstream as well.  Just that no one's working on it
currently.  Any help here will be appreciated.


Thanks! In this case, since a guest will stop to run when host reads
trace data of the guest, char device is needed to add a non-blocking
mode. I'll read your patch series. Is the latest version 8?
http://lists.gnu.org/archive/html/qemu-devel/2010-12/msg00035.html


I suppose the latest version on-list is what you quote above.  The
objections to the patch series are mentioned in Anthony's mails.


I'll check the mails.


Hans maintains a rebased version of the patches in his tree at

http://cgit.freedesktop.org/~jwrdegoede/qemu/

those patches are included in Fedora's qemu-kvm, so you can try that
out if it improves performance for you.


Thanks. I'll check those patches.


  - Make "vhost-serial"


I need to understand a) why it's perf-critical, and b) why should the
host be involved at all, to comment on these.


a) To make collecting overhead decrease for application on a guest.
(see above)
b) Trace data of host kernel is not involved even if we introduce this
patch set.


I see, so you suggested vhost-serial only because you saw the guest
stopping problem due to the absence of non-blocking code?  If so, it
now makes sense.  I don't think we need vhost-serial in any way yet.


I understood. We suggested vhost-serial as one of the ideas for
improving performances. Other features(trace-cmd, 9pfs, and
non-blocking chardev) should be supported first, I think.


BTW where do you parse the trace data obtained from guests?  On a
remote host?


It is the best that we can parse the data on a remote host in this
tracing system. Existing trace-cmd can already parse it on a remote
site. If we add the feature collecting event-format data(guest's
debugfs has that) from guests, we can parse tracing data on a remote
host as well as on a host running guests.

Thank you,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com





Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace

2012-07-27 Thread Yoshihiro YUNOMAE

Hi Amit,

Thank you for commenting on our work.

(2012/07/26 20:35), Amit Shah wrote:

On (Tue) 24 Jul 2012 [11:36:57], Yoshihiro YUNOMAE wrote:


[...]



Therefore, we propose a new system "virtio-trace", which uses enhanced
virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel
tracing data. In this system, there are 5 main components:
  (1) Ring-buffer of ftrace in a guest
  - When trace agent reads ring-buffer, a page is removed from ring-buffer.
  (2) Trace agent in the guest
  - Splice the page of ring-buffer to read_pipe using splice() without
memory copying. Then, the page is spliced from write_pipe to virtio
without memory copying.


I really like the splicing idea.


Thanks. We will improve this patch set.


  (3) Virtio-console driver in the guest
  - Pass the page to virtio-ring
  (4) Virtio-serial bus in QEMU
  - Copy the page to kernel pipe
  (5) Reader in the host
  - Read guest tracing data via FIFO(named pipe)


So will this be useful only if guest and host run the same kernel?

I'd like to see the host kernel not being used at all -- collect all
relevant info from the guest and send it out to qemu, where it can be
consumed directly by apps driving the tracing.


No, this patch set is used only for guest kernels, so guest and host
don't need to run the same kernel.


***Evaluation***
When a host collects tracing data of a guest, the performance of using
virtio-trace is compared with that of using native(just running ftrace),
IVRing, and virtio-serial(normal method of read/write).


Why is tracing performance-sensitive?  i.e. why try to optimise this
at all?


To minimize effects for applications on guests when a host collects
tracing data of guests.
For example, we assume the situation where guests A and B are running
on a host sharing I/O device. An I/O delay problem occur in guest A,
but it doesn't for the requirement in guest B. In this case, we need to
collect tracing data of guests A and B, but a usual method using
network takes high load for applications of guest B even if guest B is
normally running. Therefore, we try to decrease the load on guests.
We also use this feature for performance analysis on production
virtualization systems.

[...]



***Just enhancement ideas***
  - Support for trace-cmd
  - Support for 9pfs protocol
  - Support for non-blocking mode in QEMU


There were patches long back (by me) to make chardevs non-blocking but
they didn't make it upstream.  Fedora carries them, if you want to try
out.  Though we want to converge on a reasonable solution that's
acceptable upstream as well.  Just that no one's working on it
currently.  Any help here will be appreciated.


Thanks! In this case, since a guest will stop to run when host reads
trace data of the guest, char device is needed to add a non-blocking
mode. I'll read your patch series. Is the latest version 8?
http://lists.gnu.org/archive/html/qemu-devel/2010-12/msg00035.html


  - Make "vhost-serial"


I need to understand a) why it's perf-critical, and b) why should the
host be involved at all, to comment on these.


a) To make collecting overhead decrease for application on a guest.
   (see above)
b) Trace data of host kernel is not involved even if we introduce this
   patch set.

Thank you,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com





Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace

2012-07-25 Thread Yoshihiro YUNOMAE

Hi Stefan,

(2012/07/24 22:41), Stefan Hajnoczi wrote:

On Tue, Jul 24, 2012 at 12:19 PM, Yoshihiro YUNOMAE
 wrote:

Are you using text formatted ftrace?

No, currently using raw format, but we'd like to reformat it in text.


Capturing the info necessary to translate numbers into symbols is one
of the problems of host<->guest tracing so I'm curious how you handle
this :).


Right, your consideration is true.


Apologies for my lack of ftrace knowledge but how useful is the raw
tracing data on the host?  How do you pretty-print it in
human-readable form?


perf and trace-cmd can actually translate raw-formatted trace data to
text-formatted trace data by using information of kernel or trace
format under tracing/events directory in debugfs. In the same way, if
the information of a guest is exported to a host, we can translate
raw trace data of a guest to text trace data on a host. We will use
9pfs to export that.

Thank you,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com





Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace

2012-07-24 Thread Yoshihiro YUNOMAE

Hi Stefan,

Thank you for commenting on our patch set.

(2012/07/24 20:03), Masami Hiramatsu wrote:

(2012/07/24 19:02), Stefan Hajnoczi wrote:

On Tue, Jul 24, 2012 at 3:36 AM, Yoshihiro YUNOMAE
 wrote:

The performance of each method is compared as follows:
  [1] Native
  - only recording trace data to ring-buffer on a guest
  [2] Virtio-trace
  - running a trace agent on a guest
  - a reader on a host opens FIFO using cat command
  [3] IVRing
  - A SystemTap script in a guest records trace data to IVRing.
-- probe points are same as ftrace.
  [4] Virtio-serial(normal)
  - A reader(using cat) on a guest output trace data to a host using
standard output via virtio-serial.


The first time I read this I thought you are adding a new virtio-trace
device.  But it looks like this series really add splice support to
virtio-console and that yields a big performance improvement when
sending trace_pipe_raw.


Yes, sorry for the confusion. Actually this is an enhancement of
virtio-serial. I'm working with Yoshihiro on this feature.


Guest ftrace is useful and I like this.  Have you thought about
controlling ftrace from the host?  Perhaps a command could be added to
the QEMU guest agent which basically invokes trace-cmd/perf.


As you can see, guest trace-agent can be controlled via a
control channel. In our scenario, host tools can control that
instead of guest one.

We are considering that exporting the tracing part of guest's
debugfs to host via another virtio-serial channel by using
9pfs, so that the host tools can refer that.

(In this scenario, guest trace-agent will also provide 9pfs server.
Since it means that the agent can handle writing a special file,
trace-agent can be controlled via the special file on exported
debugfs.)

Of course, this also requires modifying trace-cmd/perf to accept
some options like guest-debugfs mount point, guest's serial
channel pipe (or unix socket?), etc. However, it will be a small
change.

Thank you,



>> Are you using text formatted ftrace?
No, currently using raw format, but we'd like to reformat it in text.

Thank you,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com





[Qemu-devel] [RFC PATCH 5/6] virtio/console: Allocate scatterlist according to the current pipe size

2012-07-23 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Allocate scatterlist according to the current pipe size.
This allows splicing bigger buffer if the pipe size has
been changed by fcntl.

Signed-off-by: Masami Hiramatsu 
Cc: Amit Shah 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
---

 drivers/char/virtio_console.c |   23 ---
 1 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index e49d435..f5063d5 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -229,7 +229,6 @@ struct port {
bool guest_connected;
 };
 
-#define MAX_SPLICE_PAGES   32
 /* This is the very early arch-specified put chars function. */
 static int (*early_put_chars)(u32, const char *, int);
 
@@ -482,15 +481,16 @@ struct buffer_token {
void *buf;
struct scatterlist *sg;
} u;
-   bool sgpages;
+   /* If sgpages == 0 then buf is used, else sg is used */
+   unsigned int sgpages;
 };
 
-static void reclaim_sg_pages(struct scatterlist *sg)
+static void reclaim_sg_pages(struct scatterlist *sg, unsigned int nrpages)
 {
int i;
struct page *page;
 
-   for (i = 0; i < MAX_SPLICE_PAGES; i++) {
+   for (i = 0; i < nrpages; i++) {
page = sg_page(&sg[i]);
if (!page)
break;
@@ -511,7 +511,7 @@ static void reclaim_consumed_buffers(struct port *port)
}
while ((tok = virtqueue_get_buf(port->out_vq, &len))) {
if (tok->sgpages)
-   reclaim_sg_pages(tok->u.sg);
+   reclaim_sg_pages(tok->u.sg, tok->sgpages);
else
kfree(tok->u.buf);
kfree(tok);
@@ -581,7 +581,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, 
size_t in_count,
tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
if (!tok)
return -ENOMEM;
-   tok->sgpages = false;
+   tok->sgpages = 0;
tok->u.buf = in_buf;
 
sg_init_one(sg, in_buf, in_count);
@@ -597,7 +597,7 @@ static ssize_t send_pages(struct port *port, struct 
scatterlist *sg, int nents,
tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
if (!tok)
return -ENOMEM;
-   tok->sgpages = true;
+   tok->sgpages = nents;
tok->u.sg = sg;
 
return __send_to_port(port, sg, nents, in_count, tok, nonblock);
@@ -797,6 +797,7 @@ out:
 
 struct sg_list {
unsigned int n;
+   unsigned int size;
size_t len;
struct scatterlist *sg;
 };
@@ -807,7 +808,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct 
pipe_buffer *buf,
struct sg_list *sgl = sd->u.data;
unsigned int offset, len;
 
-   if (sgl->n == MAX_SPLICE_PAGES)
+   if (sgl->n == sgl->size)
return 0;
 
/* Try lock this page */
@@ -868,12 +869,12 @@ static ssize_t port_fops_splice_write(struct 
pipe_inode_info *pipe,
 
sgl.n = 0;
sgl.len = 0;
-   sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES,
-GFP_ATOMIC);
+   sgl.size = pipe->nrbufs;
+   sgl.sg = kmalloc(sizeof(struct scatterlist) * sgl.size, GFP_ATOMIC);
if (unlikely(!sgl.sg))
return -ENOMEM;
 
-   sg_init_table(sgl.sg, MAX_SPLICE_PAGES);
+   sg_init_table(sgl.sg, sgl.size);
ret = __splice_from_pipe(pipe, &sd, pipe_to_sg);
if (likely(ret > 0))
ret = send_pages(port, sgl.sg, sgl.n, sgl.len, true);





[Qemu-devel] [RFC PATCH 4/6] ftrace: Allow stealing pages from pipe buffer

2012-07-23 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Use generic steal operation on pipe buffer to allow stealing
ring buffer's read page from pipe buffer.

Note that this could reduce the performance of splice on the
splice_write side operation without affinity setting.
Since the ring buffer's read pages are allocated on the
tracing-node, but the splice user does not always execute
splice write side operation on the same node. In this case,
the page will be accessed from the another node.
Thus, it is strongly recommended to assign the splicing
thread to corresponding node.

Signed-off-by: Masami Hiramatsu 
Cc: Steven Rostedt 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
---

 kernel/trace/trace.c |8 +---
 1 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a120f98..ae01930 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4194,12 +4194,6 @@ static void buffer_pipe_buf_release(struct 
pipe_inode_info *pipe,
buf->private = 0;
 }
 
-static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe,
-struct pipe_buffer *buf)
-{
-   return 1;
-}
-
 static void buffer_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
 {
@@ -4215,7 +4209,7 @@ static const struct pipe_buf_operations 
buffer_pipe_buf_ops = {
.unmap  = generic_pipe_buf_unmap,
.confirm= generic_pipe_buf_confirm,
.release= buffer_pipe_buf_release,
-   .steal  = buffer_pipe_buf_steal,
+   .steal  = generic_pipe_buf_steal,
.get= buffer_pipe_buf_get,
 };
 





[Qemu-devel] [RFC PATCH 6/6] tools: Add guest trace agent as a user tool

2012-07-23 Thread Yoshihiro YUNOMAE
This patch adds a user tool, "trace agent" for sending trace data of a guest to
a Host in low overhead. This agent has the following functions:
 - splice a page of ring-buffer to read_pipe without memory copying
 - splice the page from write_pipe to virtio-console without memory copying
 - write trace data to stdout by using -o option
 - controlled by start/stop orders from a Host

Signed-off-by: Yoshihiro YUNOMAE 
---

 tools/virtio/virtio-trace/Makefile  |   14 +
 tools/virtio/virtio-trace/README|  118 
 tools/virtio/virtio-trace/trace-agent-ctl.c |  137 ++
 tools/virtio/virtio-trace/trace-agent-rw.c  |  192 +++
 tools/virtio/virtio-trace/trace-agent.c |  270 +++
 tools/virtio/virtio-trace/trace-agent.h |   75 
 6 files changed, 806 insertions(+), 0 deletions(-)
 create mode 100644 tools/virtio/virtio-trace/Makefile
 create mode 100644 tools/virtio/virtio-trace/README
 create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent.c
 create mode 100644 tools/virtio/virtio-trace/trace-agent.h

diff --git a/tools/virtio/virtio-trace/Makefile 
b/tools/virtio/virtio-trace/Makefile
new file mode 100644
index 000..ef3adfc
--- /dev/null
+++ b/tools/virtio/virtio-trace/Makefile
@@ -0,0 +1,14 @@
+CC = gcc
+CFLAGS = -O2 -Wall
+LFLAG = -lpthread
+
+all: trace-agent
+
+.c.o:
+   $(CC) $(CFLAGS) $(LFLAG) -c $^ -o $@
+
+trace-agent: trace-agent.o trace-agent-ctl.o trace-agent-rw.o
+   $(CC) $(CFLAGS) $(LFLAG) -o $@ $^
+
+clean:
+   rm -f *.o trace-agent
diff --git a/tools/virtio/virtio-trace/README b/tools/virtio/virtio-trace/README
new file mode 100644
index 000..b64845b
--- /dev/null
+++ b/tools/virtio/virtio-trace/README
@@ -0,0 +1,118 @@
+Trace Agent for virtio-trace
+
+
+Trace agent is a user tool for sending trace data of a guest to a Host in low
+overhead. Trace agent has the following functions:
+ - splice a page of ring-buffer to read_pipe without memory copying
+ - splice the page from write_pipe to virtio-console without memory copying
+ - write trace data to stdout by using -o option
+ - controlled by start/stop orders from a Host
+
+The trace agent operates as follows:
+ 1) Initialize all structures.
+ 2) Create a read/write thread per CPU. Each thread is bound to a CPU.
+The read/write threads hold it.
+ 3) A controller thread does poll() for a start order of a host.
+ 4) After the controller of the trace agent receives a start order from a host,
+the controller wake read/write threads.
+ 5) The read/write threads start to read trace data from ring-buffers and
+write the data to virtio-serial.
+ 6) If the controller receives a stop order from a host, the read/write threads
+stop to read trace data.
+
+
+Files
+=
+
+README: this file
+Makefile: Makefile of trace agent for virtio-trace
+trace-agent.c: includes main function, sets up for operating trace agent
+trace-agent.h: includes all structures and some macros
+trace-agent-ctl.c: includes controller function for read/write threads
+trace-agent-rw.c: includes read/write threads function
+
+
+Setup
+=
+
+To use this trace agent for virtio-trace, we need to prepare some virtio-serial
+I/Fs.
+
+1) Make FIFO in a host
+ virtio-trace uses virtio-serial pipe as trace data paths as to the number
+of CPUs and a control path, so FIFO (named pipe) should be created as follows:
+   # mkdir /tmp/virtio-trace/
+   # mkfifo /tmp/virtio-trace/trace-path-cpu{0,1,2,...,X}.{in,out}
+   # mkfifo /tmp/virtio-trace/agent-ctl-path.{in,out}
+
+For example, if a guest use three CPUs, the names are
+   trace-path-cpu{0,1,2}.{in.out}
+and
+   agent-ctl-path.{in,out}.
+
+2) Set up of virtio-serial pipe in a host
+ Add qemu option to use virtio-serial pipe.
+
+ ##virtio-serial device##
+ -device virtio-serial-pci,id=virtio-serial0\
+ ##control path##
+ -chardev pipe,id=charchannel0,path=/tmp/virtio-trace/agent-ctl-path\
+ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,\
+  id=channel0,name=agent-ctl-path\
+ ##data path##
+ -chardev pipe,id=charchannel1,path=/tmp/virtio-trace/trace-path-cpu0\
+ -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel0,\
+  id=channel1,name=trace-path-cpu0\
+  ...
+
+If you manage guests with libvirt, add the following tags to domain XML files.
+Then, libvirt passes the same command option to qemu.
+
+   
+  
+  
+  
+   
+   
+  
+  
+  
+   
+   ...
+Here, chardev names are restricted to trace-path-cpuX and agent-ctl-path. For
+example, if a guest use three CPUs, chardev names should be trace-path-cpu0,
+trace-path-cpu1, trace-path-cpu2, and agent-ctl-path.
+
+3) Boot the guest
+ You can find some 

[Qemu-devel] [RFC PATCH 3/6] virtio/console: Wait until the port is ready on splice

2012-07-23 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Wait if the port is not connected or full on splice
like as write is doing.

Signed-off-by: Masami Hiramatsu 
Cc: Amit Shah 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
---

 drivers/char/virtio_console.c |   39 +++
 1 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 911cb3e..e49d435 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -724,6 +724,26 @@ static ssize_t port_fops_read(struct file *filp, char 
__user *ubuf,
return fill_readbuf(port, ubuf, count, true);
 }
 
+static int wait_port_writable(struct port *port, bool nonblock)
+{
+   int ret;
+
+   if (will_write_block(port)) {
+   if (nonblock)
+   return -EAGAIN;
+
+   ret = wait_event_freezable(port->waitqueue,
+  !will_write_block(port));
+   if (ret < 0)
+   return ret;
+   }
+   /* Port got hot-unplugged. */
+   if (!port->guest_connected)
+   return -ENODEV;
+
+   return 0;
+}
+
 static ssize_t port_fops_write(struct file *filp, const char __user *ubuf,
   size_t count, loff_t *offp)
 {
@@ -740,18 +760,9 @@ static ssize_t port_fops_write(struct file *filp, const 
char __user *ubuf,
 
nonblock = filp->f_flags & O_NONBLOCK;
 
-   if (will_write_block(port)) {
-   if (nonblock)
-   return -EAGAIN;
-
-   ret = wait_event_freezable(port->waitqueue,
-  !will_write_block(port));
-   if (ret < 0)
-   return ret;
-   }
-   /* Port got hot-unplugged. */
-   if (!port->guest_connected)
-   return -ENODEV;
+   ret = wait_port_writable(port, nonblock);
+   if (ret < 0)
+   return ret;
 
count = min((size_t)(32 * 1024), count);
 
@@ -851,6 +862,10 @@ static ssize_t port_fops_splice_write(struct 
pipe_inode_info *pipe,
.u.data = &sgl,
};
 
+   ret = wait_port_writable(port, filp->f_flags & O_NONBLOCK);
+   if (ret < 0)
+   return ret;
+
sgl.n = 0;
sgl.len = 0;
sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES,





[Qemu-devel] [RFC PATCH 1/6] virtio/console: Add splice_write support

2012-07-23 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Enable to use splice_write from pipe to virtio-console port.
This steals pages from pipe and directly send it to host.

Note that this may accelerate only the guest to host path.

Signed-off-by: Masami Hiramatsu 
Cc: Amit Shah 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
---

 drivers/char/virtio_console.c |  136 +++--
 1 files changed, 128 insertions(+), 8 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index cdf2f54..fe31b2f 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -24,6 +24,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -227,6 +229,7 @@ struct port {
bool guest_connected;
 };
 
+#define MAX_SPLICE_PAGES   32
 /* This is the very early arch-specified put chars function. */
 static int (*early_put_chars)(u32, const char *, int);
 
@@ -474,26 +477,52 @@ static ssize_t send_control_msg(struct port *port, 
unsigned int event,
return 0;
 }
 
+struct buffer_token {
+   union {
+   void *buf;
+   struct scatterlist *sg;
+   } u;
+   bool sgpages;
+};
+
+static void reclaim_sg_pages(struct scatterlist *sg)
+{
+   int i;
+   struct page *page;
+
+   for (i = 0; i < MAX_SPLICE_PAGES; i++) {
+   page = sg_page(&sg[i]);
+   if (!page)
+   break;
+   put_page(page);
+   }
+   kfree(sg);
+}
+
 /* Callers must take the port->outvq_lock */
 static void reclaim_consumed_buffers(struct port *port)
 {
-   void *buf;
+   struct buffer_token *tok;
unsigned int len;
 
if (!port->portdev) {
/* Device has been unplugged.  vqs are already gone. */
return;
}
-   while ((buf = virtqueue_get_buf(port->out_vq, &len))) {
-   kfree(buf);
+   while ((tok = virtqueue_get_buf(port->out_vq, &len))) {
+   if (tok->sgpages)
+   reclaim_sg_pages(tok->u.sg);
+   else
+   kfree(tok->u.buf);
+   kfree(tok);
port->outvq_full = false;
}
 }
 
-static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count,
-   bool nonblock)
+static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
+ int nents, size_t in_count,
+ struct buffer_token *tok, bool nonblock)
 {
-   struct scatterlist sg[1];
struct virtqueue *out_vq;
ssize_t ret;
unsigned long flags;
@@ -505,8 +534,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, 
size_t in_count,
 
reclaim_consumed_buffers(port);
 
-   sg_init_one(sg, in_buf, in_count);
-   ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC);
+   ret = virtqueue_add_buf(out_vq, sg, nents, 0, tok, GFP_ATOMIC);
 
/* Tell Host to go! */
virtqueue_kick(out_vq);
@@ -544,6 +572,37 @@ done:
return in_count;
 }
 
+static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count,
+   bool nonblock)
+{
+   struct scatterlist sg[1];
+   struct buffer_token *tok;
+
+   tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
+   if (!tok)
+   return -ENOMEM;
+   tok->sgpages = false;
+   tok->u.buf = in_buf;
+
+   sg_init_one(sg, in_buf, in_count);
+
+   return __send_to_port(port, sg, 1, in_count, tok, nonblock);
+}
+
+static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents,
+ size_t in_count, bool nonblock)
+{
+   struct buffer_token *tok;
+
+   tok = kmalloc(sizeof(*tok), GFP_ATOMIC);
+   if (!tok)
+   return -ENOMEM;
+   tok->sgpages = true;
+   tok->u.sg = sg;
+
+   return __send_to_port(port, sg, nents, in_count, tok, nonblock);
+}
+
 /*
  * Give out the data that's requested from the buffer that we have
  * queued up.
@@ -725,6 +784,66 @@ out:
return ret;
 }
 
+struct sg_list {
+   unsigned int n;
+   size_t len;
+   struct scatterlist *sg;
+};
+
+static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+   struct splice_desc *sd)
+{
+   struct sg_list *sgl = sd->u.data;
+   unsigned int len = 0;
+
+   if (sgl->n == MAX_SPLICE_PAGES)
+   return 0;
+
+   /* Try lock this page */
+   if (buf->ops->steal(pipe, buf) == 0) {
+   /* Get reference and unlock page for moving */
+   get_page(buf->page);
+   unlock_page(buf->page);
+
+   len = min(buf->len, sd->len);
+   sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset);
+   sgl->n++;
+   sgl->len += len;
+   }
+
+   return len;
+}
+
+/* Faster zero-copy write by splicing */
+static

[Qemu-devel] [RFC PATCH 2/6] virtio/console: Add a failback for unstealable pipe buffer

2012-07-23 Thread Yoshihiro YUNOMAE
From: Masami Hiramatsu 

Add a failback memcpy path for unstealable pipe buffer.
If buf->ops->steal() fails, virtio-serial tries to
copy the page contents to an allocated page, instead
of just failing splice().

Signed-off-by: Masami Hiramatsu 
Cc: Amit Shah 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
---

 drivers/char/virtio_console.c |   28 +---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index fe31b2f..911cb3e 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -794,7 +794,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct 
pipe_buffer *buf,
struct splice_desc *sd)
 {
struct sg_list *sgl = sd->u.data;
-   unsigned int len = 0;
+   unsigned int offset, len;
 
if (sgl->n == MAX_SPLICE_PAGES)
return 0;
@@ -807,9 +807,31 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct 
pipe_buffer *buf,
 
len = min(buf->len, sd->len);
sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset);
-   sgl->n++;
-   sgl->len += len;
+   } else {
+   /* Failback to copying a page */
+   struct page *page = alloc_page(GFP_KERNEL);
+   char *src = buf->ops->map(pipe, buf, 1);
+   char *dst;
+
+   if (!page)
+   return -ENOMEM;
+   dst = kmap(page);
+
+   offset = sd->pos & ~PAGE_MASK;
+
+   len = sd->len;
+   if (len + offset > PAGE_SIZE)
+   len = PAGE_SIZE - offset;
+
+   memcpy(dst + offset, src + buf->offset, len);
+
+   kunmap(page);
+   buf->ops->unmap(pipe, buf, src);
+
+   sg_set_page(&(sgl->sg[sgl->n]), page, len, offset);
}
+   sgl->n++;
+   sgl->len += len;
 
return len;
 }





[Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace

2012-07-23 Thread Yoshihiro YUNOMAE
Hi All,

The following patch set provides a low-overhead system for collecting kernel
tracing data of guests by a host in a virtualization environment.

A guest OS generally shares some devices with other guests or a host, so
reasons of any problems occurring in a guest may be from other guests or a host.
Then, to collect some tracing data of a number of guests and a host is needed
when some problems occur in a virtualization environment. One of methods to
realize that is to collect tracing data of guests in a host. To do this, network
is generally used. However, high load will be taken to applications on guests
using network I/O because there are many network stack layers. Therefore,
a communication method for collecting the data without using network is needed.

We submitted a patch set of "IVRing", a ring-buffer driver constructed on
Inter-VM shared memory (IVShmem), to LKML http://lwn.net/Articles/500304/ in
this June. IVRing and the IVRing reader use POSIX shared memory each other
without using network, so a low-overhead system for collecting guest tracing
data is realized. However, this patch set has some problems as follows:
 - use IVShmem instead of virtio
 - create a new ring-buffer without using existing ring-buffer in kernel
 - scalability
   -- not support SMP environment
   -- buffer size limitation
   -- not support live migration (maybe difficult for realize this)

Therefore, we propose a new system "virtio-trace", which uses enhanced
virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel
tracing data. In this system, there are 5 main components:
 (1) Ring-buffer of ftrace in a guest
 - When trace agent reads ring-buffer, a page is removed from ring-buffer.
 (2) Trace agent in the guest
 - Splice the page of ring-buffer to read_pipe using splice() without
   memory copying. Then, the page is spliced from write_pipe to virtio
   without memory copying.
 (3) Virtio-console driver in the guest
 - Pass the page to virtio-ring
 (4) Virtio-serial bus in QEMU
 - Copy the page to kernel pipe
 (5) Reader in the host
 - Read guest tracing data via FIFO(named pipe) 

***Evaluation***
When a host collects tracing data of a guest, the performance of using
virtio-trace is compared with that of using native(just running ftrace),
IVRing, and virtio-serial(normal method of read/write).


The overview of this evaluation is as follows:
 (a) A guest on a KVM is prepared.
 - The guest is dedicated one physical CPU as a virtual CPU(VCPU).

 (b) The guest starts to write tracing data to ring-buffer of ftrace.
 - The probe points are all trace points of sched, timer, and kmem.

 (c) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark
 tool in the guest.
 - Dhrystone 2 intends system performance by repeating integer arithmetic
   as a score.
 - Since higher score equals to better system performance, if the score
   decrease based on bare environment, it indicates that any operation
   disturbs the integer arithmetic. Then, we define the overhead of
   transporting trace data is calculated as follows:
OVERHEAD = (1 - SCORE_OF_A_METHOD/NATIVE_SCORE) * 100.

The performance of each method is compared as follows:
 [1] Native
 - only recording trace data to ring-buffer on a guest
 [2] Virtio-trace
 - running a trace agent on a guest
 - a reader on a host opens FIFO using cat command
 [3] IVRing
 - A SystemTap script in a guest records trace data to IVRing.
   -- probe points are same as ftrace.
 [4] Virtio-serial(normal)
 - A reader(using cat) on a guest output trace data to a host using
   standard output via virtio-serial.

Other information is as follows:
 - host
   kernel: 3.3.7-1 (Fedora16)
   CPU: Intel Xeon x5660@2.80GHz(12core)
   Memory: 48GB

 - guest(only booting one guest)
   kernel: 3.5.0-rc4+ (Fedora16)
   CPU: 1VCPU(dedicated)
   Memory: 1GB


3 patterns based on the bare environment were indicated as follows:
   Scores  overhead against [0] Native
[0] Native:  28807569.5   -
[1] Virtio-trace:28685049.5 0.43%
[2] IVRing:  28418595.5 1.35%
[3] Virtio-serial:   13262258.753.96%


***Just enhancement ideas***
 - Support for trace-cmd
 - Support for 9pfs protocol
 - Support for non-blocking mode in QEMU
 - Make "vhost-serial"

Thank you,

---

Masami Hiramatsu (5):
  virtio/console: Allocate scatterlist according to the current pipe size
  ftrace: Allow stealing pages from pipe buffer
  virtio/console: Wait until the port is ready on splice
  virtio/console: Add a failback for unstealable pipe buffer
  virtio/console: Add splice_write support

Yoshihiro YUNOMAE (1):
  tools: Add guest trace agent as a user tool


 drivers/char/virtio_console.c   |  198 

[Qemu-devel] [RFC PATCH 1/2] ivring: Add a ring-buffer driver on IVShmem

2012-06-05 Thread Yoshihiro YUNOMAE
This patch adds a ring-buffer driver for IVShmem device, a virtual RAM device in
QEMU. This driver can be used as a ring-buffer for kernel logging or tracing of
a guest OS by recording kernel programing or SystemTap.

This ring-buffer driver is implemented very simple. First 4kB of shared memory
region is control structure of a ring-buffer. In this region, some values for
managing the ring-buffer is stored such as bits and mask of whole memory size,
writing position, threshold value for notification to a reader on a host OS.
This region is used by the reader to know writing position. Then, "total
memory size - 4kB" equals to usable memory region for recording data.
This ring-buffer driver records any data from start to end of the writable
memory region.

When writing size exceeds a threshold value, this driver can notify a reader
to read data by using writel(). As this later patch, reader does not have any
function for receiving the notification. This notification feature will be used
near the future.

As a writer records data in this ring-buffer, spinlock function is used to
avoid competing by some writers in multi CPU environment. Not to use spinlock,
lockless ring-buffer like as ftrace and one ring-buffer one CPU will be
implemented near the future.

Signed-off-by: Yoshihiro YUNOMAE 
Signed-off-by: Masami Hiramatsu 
Signed-off-by: Akihiro Nagai 
Cc: Greg Kroah-Hartman 
Cc: Ohad Ben-Cohen 
Cc: Linus Walleij 
Cc: MyungJoo Ham 
Cc: Rusty Russell 
Cc: Joerg Roedel 
Cc: Grant Likely 
Cc: linux-ker...@vger.kernel.org
Cc: Cam Macdonell 
Cc: qemu-devel@nongnu.org
Cc: system...@sourceware.org
---

 drivers/Kconfig  |1 
 drivers/Makefile |1 
 drivers/ivshmem/Kconfig  |9 +
 drivers/ivshmem/Makefile |5 
 drivers/ivshmem/ivring.c |  551 ++
 drivers/ivshmem/ivring.h |   77 ++
 6 files changed, 644 insertions(+), 0 deletions(-)
 create mode 100644 drivers/ivshmem/Kconfig
 create mode 100644 drivers/ivshmem/Makefile
 create mode 100644 drivers/ivshmem/ivring.c
 create mode 100644 drivers/ivshmem/ivring.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index bfc9186..e01adcd 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -148,4 +148,5 @@ source "drivers/iio/Kconfig"
 
 source "drivers/vme/Kconfig"
 
+source "drivers/ivshmem/Kconfig"
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 2ba29ff..1ebdd03 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -23,6 +23,7 @@ obj-y += amba/
 # really early.
 obj-$(CONFIG_DMA_ENGINE)   += dma/
 
+obj-$(CONFIG_IVRING_MANAGER)   += ivshmem/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_XEN)  += xen/
 
diff --git a/drivers/ivshmem/Kconfig b/drivers/ivshmem/Kconfig
new file mode 100644
index 000..e84364a
--- /dev/null
+++ b/drivers/ivshmem/Kconfig
@@ -0,0 +1,9 @@
+#
+# IVShmem support drivers
+#
+
+config IVRING_MANAGER
+   tristate "IVRing management driver"
+   help
+ It allows IVShmem, a virtual PCI RAM device in QEMU, to use as a
+ ring-buffer for tracing of a guest.
diff --git a/drivers/ivshmem/Makefile b/drivers/ivshmem/Makefile
new file mode 100644
index 000..e725f8c
--- /dev/null
+++ b/drivers/ivshmem/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for IVShmem drivers
+#
+
+obj-$(CONFIG_IVRING_MANAGER)   += ivring.o
diff --git a/drivers/ivshmem/ivring.c b/drivers/ivshmem/ivring.c
new file mode 100644
index 000..5cbcfb6
--- /dev/null
+++ b/drivers/ivshmem/ivring.c
@@ -0,0 +1,551 @@
+/*
+ * Ring buffer on IVShmem Driver
+ *
+ * (C) 2012 Hitachi, Ltd.
+ * Written by Hitachi Yokohama Research Laboratory.
+ *
+ * Created by Masami Hiramatsu 
+ *        Akihiro Nagai 
+ *Yoshihiro Yunomae 
+ * based on UIOIVShmem Driver, http://www.gitorious.org/nahanni/guest-code,
+ *   (C) 2009 Cam Macdonell 

+ * based on Hilscher CIF card driver (C) 2007 Hans J. Koch 
+ *
+ * Licensed under GPL version 2 only.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "./ivring.h"
+
+
+#define IVSHM_OFFS_INTRMASK0
+#define IVSHM_OFFS_INTRSTATUS  4
+#define IVSHM_OFFS_IVPOSITION  8
+#define IVSHM_OFFS_DOORBELL12
+
+#define MSIX_NAMEBUF_SIZE  128
+#define DEFAULT_NR_VECTORS 4
+
+#define IVRING_DEVNAME "ivring"
+
+struct ivring_mem {
+   unsigned long   addr;
+   unsigned long   size;
+   void __iomem*ioaddr;
+};
+
+struct ivring_info {
+   struct pci_dev  *dev;
+   int irq;
+   struct ivring_mem   mem[2]; /* 0:control, 1:shmem */
+   struct msix_entry   *msix_entries;
+   char(*msix_names)[MSIX_NAMEBUF_SIZE];
+   int nvectors;
+   int posn;
+   struct ivring_hdr   *hdr;
+};
+
+#

Re: [Qemu-devel] [RFC PATCH 1/2] ivring: Add a ring-buffer driver on IVShmem

2012-06-05 Thread Yoshihiro YUNOMAE

(2012/06/05 22:10), Borislav Petkov wrote:

On Tue, Jun 05, 2012 at 10:01:17PM +0900, Yoshihiro YUNOMAE wrote:

This patch adds a ring-buffer driver for IVShmem device, a virtual RAM device in
QEMU. This driver can be used as a ring-buffer for kernel logging or tracing of
a guest OS by recording kernel programing or SystemTap.

This ring-buffer driver is implemented very simple. First 4kB of shared memory
region is control structure of a ring-buffer. In this region, some values for
managing the ring-buffer is stored such as bits and mask of whole memory size,
writing position, threshold value for notification to a reader on a host OS.
This region is used by the reader to know writing position. Then, "total
memory size - 4kB" equals to usable memory region for recording data.
This ring-buffer driver records any data from start to end of the writable
memory region.

When writing size exceeds a threshold value, this driver can notify a reader
to read data by using writel(). As this later patch, reader does not have any
function for receiving the notification. This notification feature will be used
near the future.

As a writer records data in this ring-buffer, spinlock function is used to
avoid competing by some writers in multi CPU environment. Not to use spinlock,
lockless ring-buffer like as ftrace and one ring-buffer one CPU will be
implemented near the future.


Yet another ring buffer?


Yes, unfortunately...


We already have an ftrace and perf ring buffer, can't you use one of those?


No, because those do not support to allocate buffer
from PCI memory device, nor pass the control structure
over it.

However, indeed, we understand what you would like to say.

This series is just RFC and we'd like to ask who is
interested in the guest tracing and how it should be
implemented.

 - no more ring buffer. enhance perf/ftrace ring buffer to
   enable allocating buffers on shared memory.

Other comments are welcome.

Thank you,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com





[Qemu-devel] [RFC PATCH 1/2] ivring: Add a ring-buffer driver on IVShmem

2012-06-05 Thread Yoshihiro YUNOMAE
This patch adds a ring-buffer driver for IVShmem device, a virtual RAM device in
QEMU. This driver can be used as a ring-buffer for kernel logging or tracing of
a guest OS by recording kernel programing or SystemTap.

This ring-buffer driver is implemented very simple. First 4kB of shared memory
region is control structure of a ring-buffer. In this region, some values for
managing the ring-buffer is stored such as bits and mask of whole memory size,
writing position, threshold value for notification to a reader on a host OS.
This region is used by the reader to know writing position. Then, "total
memory size - 4kB" equals to usable memory region for recording data.
This ring-buffer driver records any data from start to end of the writable
memory region.

When writing size exceeds a threshold value, this driver can notify a reader
to read data by using writel(). As this later patch, reader does not have any
function for receiving the notification. This notification feature will be used
near the future.

As a writer records data in this ring-buffer, spinlock function is used to
avoid competing by some writers in multi CPU environment. Not to use spinlock,
lockless ring-buffer like as ftrace and one ring-buffer one CPU will be
implemented near the future.

Signed-off-by: Yoshihiro YUNOMAE 
Signed-off-by: Masami Hiramatsu 
Signed-off-by: Akihiro Nagai 
Cc: Greg Kroah-Hartman 
Cc: Ohad Ben-Cohen 
Cc: Linus Walleij 
Cc: MyungJoo Ham 
Cc: Rusty Russell 
Cc: Joerg Roedel 
Cc: Grant Likely 
Cc: linux-ker...@vger.kernel.org
Cc: Cam Macdonell 
Cc: qemu-devel@nongnu.org
Cc: system...@sourceware.org
---

 drivers/Kconfig  |1 
 drivers/Makefile |1 
 drivers/ivshmem/Kconfig  |9 +
 drivers/ivshmem/Makefile |5 
 drivers/ivshmem/ivring.c |  551 ++
 drivers/ivshmem/ivring.h |   77 ++
 6 files changed, 644 insertions(+), 0 deletions(-)
 create mode 100644 drivers/ivshmem/Kconfig
 create mode 100644 drivers/ivshmem/Makefile
 create mode 100644 drivers/ivshmem/ivring.c
 create mode 100644 drivers/ivshmem/ivring.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index bfc9186..e01adcd 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -148,4 +148,5 @@ source "drivers/iio/Kconfig"
 
 source "drivers/vme/Kconfig"
 
+source "drivers/ivshmem/Kconfig"
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 2ba29ff..1ebdd03 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -23,6 +23,7 @@ obj-y += amba/
 # really early.
 obj-$(CONFIG_DMA_ENGINE)   += dma/
 
+obj-$(CONFIG_IVRING_MANAGER)   += ivshmem/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_XEN)  += xen/
 
diff --git a/drivers/ivshmem/Kconfig b/drivers/ivshmem/Kconfig
new file mode 100644
index 000..e84364a
--- /dev/null
+++ b/drivers/ivshmem/Kconfig
@@ -0,0 +1,9 @@
+#
+# IVShmem support drivers
+#
+
+config IVRING_MANAGER
+   tristate "IVRing management driver"
+   help
+ It allows IVShmem, a virtual PCI RAM device in QEMU, to use as a
+ ring-buffer for tracing of a guest.
diff --git a/drivers/ivshmem/Makefile b/drivers/ivshmem/Makefile
new file mode 100644
index 000..e725f8c
--- /dev/null
+++ b/drivers/ivshmem/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for IVShmem drivers
+#
+
+obj-$(CONFIG_IVRING_MANAGER)   += ivring.o
diff --git a/drivers/ivshmem/ivring.c b/drivers/ivshmem/ivring.c
new file mode 100644
index 000..5cbcfb6
--- /dev/null
+++ b/drivers/ivshmem/ivring.c
@@ -0,0 +1,551 @@
+/*
+ * Ring buffer on IVShmem Driver
+ *
+ * (C) 2012 Hitachi, Ltd.
+ * Written by Hitachi Yokohama Research Laboratory.
+ *
+ * Created by Masami Hiramatsu 
+ *        Akihiro Nagai 
+ *Yoshihiro Yunomae 
+ * based on UIOIVShmem Driver, http://www.gitorious.org/nahanni/guest-code,
+ *   (C) 2009 Cam Macdonell 

+ * based on Hilscher CIF card driver (C) 2007 Hans J. Koch 
+ *
+ * Licensed under GPL version 2 only.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "./ivring.h"
+
+
+#define IVSHM_OFFS_INTRMASK0
+#define IVSHM_OFFS_INTRSTATUS  4
+#define IVSHM_OFFS_IVPOSITION  8
+#define IVSHM_OFFS_DOORBELL12
+
+#define MSIX_NAMEBUF_SIZE  128
+#define DEFAULT_NR_VECTORS 4
+
+#define IVRING_DEVNAME "ivring"
+
+struct ivring_mem {
+   unsigned long   addr;
+   unsigned long   size;
+   void __iomem*ioaddr;
+};
+
+struct ivring_info {
+   struct pci_dev  *dev;
+   int irq;
+   struct ivring_mem   mem[2]; /* 0:control, 1:shmem */
+   struct msix_entry   *msix_entries;
+   char(*msix_names)[MSIX_NAMEBUF_SIZE];
+   int nvectors;
+   int posn;
+   struct ivring_hdr   *hdr;
+};
+
+#

[Qemu-devel] [RFC PATCH 2/2] ivring: Add a ring-buffer reader tool

2012-06-05 Thread Yoshihiro YUNOMAE
This patch adds a reader tool for IVRing. This tool is used on a host OS and
reads data written by a guest. This reader reads data from a ring-buffer via
POSIX share memory, so the data will be read without memory copying between
a guest and a host. To read data written by a guest, s option assigning same
shared memory object of IVShmem is needed.

Some options are available as follows:
-f: output log file
-h: show usage
-m: shared memory size in MB
-s: shared memory object path
-N: number of log files
-S: log file size in MB

Example:
./ivring_reader -m 2 -f /tmp/log.txt -S 10 -N 2 -s /ivshmem
In this case, two log files are output as /tmp/log.txt.0 and /tmp/log.txt.1
whose sizes are 10MB.

Signed-off-by: Yoshihiro YUNOMAE 
Signed-off-by: Masami Hiramatsu 
Signed-off-by: Akihiro Nagai 
Cc: Borislav Petkov 
Cc: Arnaldo Carvalho de Melo 
Cc: linux-ker...@vger.kernel.org
Cc: Cam Macdonell 
Cc: qemu-devel@nongnu.org
Cc: system...@sourceware.org
---

 tools/Makefile|1 
 tools/ivshmem/Makefile|   19 ++
 tools/ivshmem/ivring_reader.c |  516 +
 tools/ivshmem/ivring_reader.h |   15 +
 tools/ivshmem/pr_msg.c|  125 ++
 tools/ivshmem/pr_msg.h|   19 ++
 6 files changed, 695 insertions(+), 0 deletions(-)
 create mode 100644 tools/ivshmem/Makefile
 create mode 100644 tools/ivshmem/ivring_reader.c
 create mode 100644 tools/ivshmem/ivring_reader.h
 create mode 100644 tools/ivshmem/pr_msg.c
 create mode 100644 tools/ivshmem/pr_msg.h

diff --git a/tools/Makefile b/tools/Makefile
index 3ae4394..3edf16a 100644
--- a/tools/Makefile
+++ b/tools/Makefile
@@ -5,6 +5,7 @@ help:
@echo ''
@echo '  cpupower   - a tool for all things x86 CPU power'
@echo '  firewire   - the userspace part of nosy, an IEEE-1394 traffic 
sniffer'
+   @echo '  ivshmem   - the userspace tool for ivshmem device'
@echo '  lguest - a minimal 32-bit x86 hypervisor'
@echo '  perf   - Linux performance measurement and analysis tool'
@echo '  selftests  - various kernel selftests'
diff --git a/tools/ivshmem/Makefile b/tools/ivshmem/Makefile
new file mode 100644
index 000..287508e
--- /dev/null
+++ b/tools/ivshmem/Makefile
@@ -0,0 +1,19 @@
+CC = gcc
+CFLAGS = -O1 -Wall -Werror -g
+LIBS = -lrt
+
+# makefile to build ivshmem tools
+
+all: ivring_reader
+
+.c.o:
+   $(CC) $(CFLAGS) -c $^ -o $@
+
+ivring_reader: ivring_reader.o pr_msg.o
+   $(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+install: ivring_reader
+   install ivring_reader /usr/local/bin/
+
+clean:
+   rm -f *.o ivring_reader
diff --git a/tools/ivshmem/ivring_reader.c b/tools/ivshmem/ivring_reader.c
new file mode 100644
index 000..d61e9c9
--- /dev/null
+++ b/tools/ivshmem/ivring_reader.c
@@ -0,0 +1,516 @@
+/*
+ * A trace reader for inter-VM shared memory
+ *
+ * (C) 2012 Hitachi, Ltd.
+ * Written by Hitachi Yokohama Research Laboratory.
+ *
+ * Created by Masami Hiramatsu 
+ *Akihiro Nagai 
+ *Yoshihiro Yunomae 
+ * based on IVShmem Server, http://www.gitorious.org/nahanni/guest-code,
+ *   (C) 2009 Cam Macdonell 

+ *
+ * Licensed under GPL version 2 only.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../../drivers/ivshmem/ivring.h"
+#include "pr_msg.h"
+#include "ivring_reader.h"
+
+/* default pathes */
+#define DEFAULT_SHM_SIZE (1024*1024)
+#define BUFFER_SIZE 4096
+
+static int global_term;
+static int global_outfd;
+static char *global_log_basename;
+static ssize_t global_log_rotate_size;
+static int global_log_rotate_num;
+#define log_rotate_mode() (global_log_rotate_size && global_log_rotate_num)
+
+/* Handle SIGTERM/SIGINT/SIGQUIT to exit */
+void term_handler(int sig)
+{
+   global_term = sig;
+   pr_info("Receive an interrupt %d\n", sig);
+}
+
+/* Utilities */
+static void *zalloc(size_t size)
+{
+   void *ret = malloc(size);
+   if (ret)
+   memset(ret, 0, size);
+   else
+   pr_perror("malloc");
+   return ret;
+}
+
+static u32 __fls32(u32 word)
+{
+   int num = 31;
+   if (!(word & (~0ul << 16))) {
+   num -= 16;
+   word <<= 16;
+   }
+   if (!(word & (~0ul << (32-8 {
+   num -= 8;
+   word <<= 8;
+   }
+   if (!(word & (~0ul << (32-4 {
+   num -= 4;
+   word <<= 4;
+   }
+   if (!(word & (~0ul << (32-2 {
+   num -= 2;
+   word <<= 2;
+   }
+   if (!(word & (~0ul << (32-1
+   num -= 1;
+   return num;
+}
+
+/* IVR

[Qemu-devel] [RFC PATCH 0/2] ivring: Add IVRing driver

2012-06-05 Thread Yoshihiro YUNOMAE
Hi All,

The following patch set provides a new communication path "IVRing" for
collecting kernel log or tracing data of guests by a host without using network
in a virtualization environment. Network is generally used to collect log or
tracing data after outputting the data as a file. However, since I/O resources
such as network or block are shared with other guests, these resources should
not be used for logging or tracing. Moreover, high load will be taken to
applications on guests using network I/O because there are many network stack
layers. Then, a communication method for collecting the data without using
I/O resources is needed.

There are two requirements to collect kernel log or tracing data by a host:
 (1) To minimize for user applications in a guest
 - not using I/O resources
 (2) To be implemented recording buffer like ring
 - keep on recording log data or trace data
To meet these requirements, a ring-buffer as a device driver for guest OSs,
called IVRing, is constructed on Inter-VM shared memory (IVShmem) device.
IVShmem implemented in QEMU is a virtual PCI RAM device and uses POSIX shared
memory on a host. This device is originally used as a virtual device for
low-overhead communication between two guests. On the other hand, here, IVShmem
is used as a communication path between a guest and a host for collecting data.
IVRing is a buffer of logging or tracing data in a guest, and IVRing-reader,
opening shared memory as IVRing on a host, reads the data without memory copying
between a guest and a host. Thus, two requirements are met for collecting kernel
log or tracing data.

We will talk about IVRing in LinuxCon Japan 2012:
https://events.linuxfoundation.org/events/linuxcon-japan
Title: Low-Overhead Ring-Buffer of Kernel Tracing &
   Tracing Across Host OS and Guest OS
Speakers: Yoshihiro Yunomae and Akihiro Nagai
You can download our slides about IVRing in the schedule page.

***Evaluation***
When a host collects tracing data of a guest, the performance of using IVRing
is compared with that of using network.


The overview of this evaluation is as follows:
 (a) A guest on a KVM is prepared.
 - The guest is dedicated one physical CPU as a virtual CPU(VCPU).

 (b) The guest starts to write tracing data to a SystemTap buffer.
 - The probe points of SystemTap are all trace points of sched, timer,
   and kmem.

 (c) The tracing data are recorded to IVRing sharing memory with a host or
 the tracing data are sent to a host via network.
 - 3 patterns, IVRing, NFS, and SSH, are measured.
   Each methods is explained about later.

 (d) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark
 tool in the guest.
 - Dhrystone 2 intends system performance by repeating integer arithmetic
   as a score.
 - Since higher score equals to better system performance, if the score
   decrease based on bare environment, it indicates that any operation
   disturbs the integer arithmetic. Then, we define the overhead of
   transporting trace data is calculated as follows:
OVERHEAD = (1 - SCORE_OF_A_METHOD/BARE_SCORE) * 100.

The performance of each method is compared as follows:
 [1] IVRing
 - A SystemTap script in a guest records trace data to IVRing.
 - A IVRing-reader on a host reads the data.
 [2] NFS
 - A directory in a guest is shared with that in a host via NFS.
 - A SystemTap script in a guest records trace data to a file
   in the directory.
 [3] SSH
 - A SystemTap script in a guest output trace data to a host using
   standard output via SSH.

Other information is as follows:
 - host
   kernel: 3.3.1-5 (Fedora16)
   CPU: Intel Xeon x5660@2.80GHz(6core)
   Memory: 50GB

 - guest(only booting one guest)
   kernel: 3.4.0+ (Fedora16)
   CPU: 1VCPU(dedicated)
   Memory: 2GB


3 patterns based on the bare environment were indicated as follows:
Scores  overhead against [0] Bare
 [0] Bare  29043600-
 [1] IVRing28565398  1.6[%]
 [2] NFS   22000508 24.3[%]
 [3] SSH   10246792 64.7[%]
The overhead of IVRing is much lower than other methods using network. This is
because the IVRing method only records trace data to a ring-buffer. On the
other hand, other methods read trace data from a SystemTap buffer to the
userland and send the data to a host via network. Therefore, a method of using
IVRing minimizes the overhead of transporting trace data from a guest to a host.

***How to use***
Here, how to use IVRing and IVRing-reader is simply given.

1. Prepare any distribution including qemu-kvm binary after 0.13.0 version.
 IVShmem was pushed on qemu-kvm mainline after 0.13.0 version.
 Latest Fedora or Ubuntsu are available.

2. Boot a guest installed IVRing driver with device option.
 A device option is needed as follows:
-