Re: [Qemu-devel] [PATCH 3/5] trace-cmd: Support trace-agent of virtio-trace
Hi Steven, (2012/08/22 22:51), Steven Rostedt wrote: On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote: Add read path and control path to use trace-agent of virtio-trace. When we use trace-agent, trace-cmd will be used as follows: # AGENT_READ_DIR=/tmp/virtio-trace/tracing \ AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \ TRACING_DIR=/tmp/virtio-trace/debugfs/tracing \\ Ha! You used "TRACING_DIR" but patch one introduces TRACE_DIR. Lets change this to DEBUG_TRACING_DIR instead anyway. Oh, sorry for the confusion. Also, I don't like the generic environment variables. Perhaps VIRTIO_TRACE_DIR, or AGENT_TRACE_DIR and AGENT_TRACE_CTL. Lets try to keep the environment namespace sparse. OK, I'll change these name of environment variables as follows: AGENT_READ_DIR AGENT_TRACE_CTL GUEST_TRACING_DIR trace-cmd record -e "sched:*" Here, AGENT_READ_DIR is the path for a reading directory of virtio-trace, AGENT_CTL is a control path of trace-agent, and TRACING_DIR is a debugfs path of a guest. Signed-off-by: Yoshihiro YUNOMAE --- trace-cmd.h |1 + trace-recorder.c | 57 +- trace-util.c | 18 + 3 files changed, 75 insertions(+), 1 deletions(-) diff --git a/trace-cmd.h b/trace-cmd.h index f904dc5..75506ed 100644 --- a/trace-cmd.h +++ b/trace-cmd.h @@ -72,6 +72,7 @@ static inline int tracecmd_host_bigendian(void) } char *tracecmd_find_tracing_dir(void); +char *guest_agent_tracing_read_dir(void); /* --- Opening and Reading the trace.dat file --- */ diff --git a/trace-recorder.c b/trace-recorder.c index 215affc..3b750e9 100644 --- a/trace-recorder.c +++ b/trace-recorder.c @@ -33,6 +33,7 @@ #include #include #include +#include #include "trace-cmd.h" @@ -43,6 +44,8 @@ struct tracecmd_recorder { int page_size; int cpu; int stop; + int ctl_fd; + boolagent_existing; Thanks for the reminder. I need to convert a lot to use 'bool' instead. I'll change 'int' just for flag to use 'bool' as much as possible after finishing this patch set. }; void tracecmd_free_recorder(struct tracecmd_recorder *recorder) @@ -59,11 +62,29 @@ void tracecmd_free_recorder(struct tracecmd_recorder *recorder) free(recorder); } +static char *use_trace_agent_dir(char *ctl_path, + struct tracecmd_recorder *recorder) +{ + ctl_path = strdup(ctl_path); + if (!ctl_path) + die("malloc"); + warning("Use environmental control path: %s\n", ctl_path); s/Use/Using/ OK, I'll correct this. Thank you, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
[Qemu-devel] [PATCH 3/5] trace-cmd: Support trace-agent of virtio-trace
Add read path and control path to use trace-agent of virtio-trace. When we use trace-agent, trace-cmd will be used as follows: # AGENT_READ_DIR=/tmp/virtio-trace/tracing \ AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \ TRACING_DIR=/tmp/virtio-trace/debugfs/tracing \ trace-cmd record -e "sched:*" Here, AGENT_READ_DIR is the path for a reading directory of virtio-trace, AGENT_CTL is a control path of trace-agent, and TRACING_DIR is a debugfs path of a guest. Signed-off-by: Yoshihiro YUNOMAE --- trace-cmd.h |1 + trace-recorder.c | 57 +- trace-util.c | 18 + 3 files changed, 75 insertions(+), 1 deletions(-) diff --git a/trace-cmd.h b/trace-cmd.h index f904dc5..75506ed 100644 --- a/trace-cmd.h +++ b/trace-cmd.h @@ -72,6 +72,7 @@ static inline int tracecmd_host_bigendian(void) } char *tracecmd_find_tracing_dir(void); +char *guest_agent_tracing_read_dir(void); /* --- Opening and Reading the trace.dat file --- */ diff --git a/trace-recorder.c b/trace-recorder.c index 215affc..3b750e9 100644 --- a/trace-recorder.c +++ b/trace-recorder.c @@ -33,6 +33,7 @@ #include #include #include +#include #include "trace-cmd.h" @@ -43,6 +44,8 @@ struct tracecmd_recorder { int page_size; int cpu; int stop; + int ctl_fd; + boolagent_existing; }; void tracecmd_free_recorder(struct tracecmd_recorder *recorder) @@ -59,11 +62,29 @@ void tracecmd_free_recorder(struct tracecmd_recorder *recorder) free(recorder); } +static char *use_trace_agent_dir(char *ctl_path, + struct tracecmd_recorder *recorder) +{ + ctl_path = strdup(ctl_path); + if (!ctl_path) + die("malloc"); + warning("Use environmental control path: %s\n", ctl_path); + + recorder->ctl_fd = open(ctl_path, O_WRONLY); + if (recorder->ctl_fd < 0) + return NULL; + + recorder->agent_existing = true; + + return guest_agent_tracing_read_dir(); +} + struct tracecmd_recorder *tracecmd_create_recorder_fd(int fd, int cpu) { struct tracecmd_recorder *recorder; char *tracing = NULL; char *path = NULL; + char *ctl_path = NULL; int ret; recorder = malloc_or_die(sizeof(*recorder)); @@ -76,12 +97,23 @@ struct tracecmd_recorder *tracecmd_create_recorder_fd(int fd, int cpu) recorder->trace_fd = -1; recorder->brass[0] = -1; recorder->brass[1] = -1; + recorder->ctl_fd = -1; + recorder->agent_existing = false; recorder->page_size = getpagesize(); recorder->fd = fd; - tracing = tracecmd_find_tracing_dir(); + /* +* The trace-agent on a guest is controlled to run or stop by a host, +* so we need to assign the control path of the trace-agent to use +* virtio-trace. +*/ + ctl_path = getenv("AGENT_CTL"); + if (ctl_path) + tracing = use_trace_agent_dir(ctl_path, recorder); + else + tracing = tracecmd_find_tracing_dir(); if (!tracing) { errno = ENODEV; goto out_free; @@ -182,6 +214,24 @@ long tracecmd_flush_recording(struct tracecmd_recorder *recorder) return total; } +static void operation_to_trace_agent(int ctl_fd, bool run_agent) +{ + if (run_agent == true) + write(ctl_fd, "1", 2); + else + write(ctl_fd, "0", 2); +} + +static void run_operation_to_trace_agent(int ctl_fd) +{ + operation_to_trace_agent(ctl_fd, true); +} + +static void stop_operation_to_trace_agent(int ctl_fd) +{ + operation_to_trace_agent(ctl_fd, false); +} + int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long sleep) { struct timespec req; @@ -189,6 +239,9 @@ int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long s recorder->stop = 0; + if (recorder->agent_existing) + run_operation_to_trace_agent(recorder->ctl_fd); + do { if (sleep) { req.tv_sec = sleep / 100; @@ -214,6 +267,8 @@ void tracecmd_stop_recording(struct tracecmd_recorder *recorder) if (!recorder) return; + if (recorder->agent_existing) + stop_operation_to_trace_agent(recorder->ctl_fd); recorder->stop = 1; } diff --git a/trace-util.c b/trace-util.c index d5a3eb4..ff639be 100644 --- a/trace-util.c +++ b/trace-util.c @@ -304,6 +304,24 @@ static int mount_debugfs(void) return ret; } +char *guest_agent_tracing_read_dir(void) +{ + char *tracing_read_dir; + + tracing_read_d
[Qemu-devel] [PATCH 4/5] trace-cmd: Add non-blocking option for open() and splice_read()
Add non-blocking option for open() and splice_read() for avoiding block to read trace data of a guest from FIFO. If SIGINT comes to read/write processes from the parent process in the case where FIFO as a read I/F is assigned, then reading is normally blocked for splice_read(). So, we added nonblock option to open() and splice_read(). Signed-off-by: Yoshihiro YUNOMAE --- trace-recorder.c | 13 - 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/trace-recorder.c b/trace-recorder.c index 3b750e9..6577fe8 100644 --- a/trace-recorder.c +++ b/trace-recorder.c @@ -124,7 +124,7 @@ struct tracecmd_recorder *tracecmd_create_recorder_fd(int fd, int cpu) goto out_free; sprintf(path, "%s/per_cpu/cpu%d/trace_pipe_raw", tracing, cpu); - recorder->trace_fd = open(path, O_RDONLY); + recorder->trace_fd = open(path, O_RDONLY | O_NONBLOCK); if (recorder->trace_fd < 0) goto out_free; @@ -172,14 +172,17 @@ static long splice_data(struct tracecmd_recorder *recorder) long ret; ret = splice(recorder->trace_fd, NULL, recorder->brass[1], NULL, -recorder->page_size, 1 /* SPLICE_F_MOVE */); +recorder->page_size, SPLICE_F_MOVE | SPLICE_F_NONBLOCK); if (ret < 0) { - warning("recorder error in splice input"); - return -1; + if (errno != EAGAIN) { + warning("recorder error in splice input"); + return -1; + } + return 0; /* Buffer is empty */ } ret = splice(recorder->brass[0], NULL, recorder->fd, NULL, -recorder->page_size, 3 /* and NON_BLOCK */); +recorder->page_size, SPLICE_F_MOVE | SPLICE_F_NONBLOCK); if (ret < 0) { if (errno != EAGAIN) { warning("recorder error in splice output");
[Qemu-devel] [PATCH 5/5] trace-cmd: Use polling function
Use poll() for avoiding a busy loop to read trace data of a guest from FIFO. Signed-off-by: Yoshihiro YUNOMAE --- trace-recorder.c | 42 -- 1 files changed, 36 insertions(+), 6 deletions(-) diff --git a/trace-recorder.c b/trace-recorder.c index 6577fe8..bdf9798 100644 --- a/trace-recorder.c +++ b/trace-recorder.c @@ -34,9 +34,12 @@ #include #include #include +#include #include "trace-cmd.h" +#define WAIT_MSEC 1 + struct tracecmd_recorder { int fd; int trace_fd; @@ -235,9 +238,37 @@ static void stop_operation_to_trace_agent(int ctl_fd) operation_to_trace_agent(ctl_fd, false); } -int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long sleep) +static int wait_data(struct tracecmd_recorder *recorder, unsigned long sleep) { + struct pollfd poll_fd; struct timespec req; + int ret = 0; + + if (recorder->agent_existing) { + poll_fd.fd = recorder->trace_fd; + poll_fd.events = POLLIN; + while (1) { + ret = poll(&poll_fd, 1, WAIT_MSEC); + + if(ret < 0) { + warning("polling error"); + return ret; + } + + if (ret) + break; + } + } else if (sleep) { + req.tv_sec = sleep / 100; + req.tv_nsec = (sleep % 100) * 1000; + nanosleep(&req, NULL); + } + + return ret; +} + +int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long sleep) +{ long ret; recorder->stop = 0; @@ -246,11 +277,10 @@ int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long s run_operation_to_trace_agent(recorder->ctl_fd); do { - if (sleep) { - req.tv_sec = sleep / 100; - req.tv_nsec = (sleep % 100) * 1000; - nanosleep(&req, NULL); - } + ret = wait_data(recorder, sleep); + if (ret < 0) + return ret; + ret = splice_data(recorder); if (ret < 0) return ret;
[Qemu-devel] [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs
From: Masami Hiramatsu Count debugfs/tracing/per_cpu/cpu* to determine the number of CPUs. Signed-off-by: Masami Hiramatsu Signed-off-by: Yoshihiro YUNOMAE --- trace-record.c | 41 + 1 files changed, 41 insertions(+), 0 deletions(-) diff --git a/trace-record.c b/trace-record.c index 9dc18a9..ed18951 100644 --- a/trace-record.c +++ b/trace-record.c @@ -1179,6 +1179,41 @@ static void expand_event_list(void) } } +static int count_tracingdir_cpus(void) +{ + char *tracing_dir = NULL; + char *percpu_dir = NULL; + struct dirent **namelist; + int count = 0, n; + + /* Count cpus in per_cpu directory */ + tracing_dir = tracecmd_find_tracing_dir(); + if (!tracing_dir) + return 0; + percpu_dir = malloc_or_die(strlen(tracing_dir) + 9); + if (!percpu_dir) + goto err; + + sprintf(percpu_dir, "%s/per_cpu", tracing_dir); + + n = scandir(percpu_dir, &namelist, NULL, alphasort); + if (n > 0) { + while (n--) { + if (strncmp("cpu", namelist[n]->d_name, 3) == 0) + count++; + free(namelist[n]); + } + free(namelist); + } + + if (percpu_dir) + free(percpu_dir); +err: + if (tracing_dir) + free(tracing_dir); + return count; +} + static int count_cpus(void) { FILE *fp; @@ -1189,6 +1224,12 @@ static int count_cpus(void) size_t n; int r; + cpus = count_tracingdir_cpus(); + if (cpus > 0) + return cpus; + + warning("failed to use tracing_dir to determine number of CPUS"); + cpus = sysconf(_SC_NPROCESSORS_CONF); if (cpus > 0) return cpus;
[Qemu-devel] [PATCH 1/5] trace-cmd: Use TRACE_DIR envrionment variable if defined
From: Masami Hiramatsu Use TRACE_DIR environment variable for setting debugfs/tracing directory if defined. This is for controlling guest(or remote) ftrace. Signed-off-by: Masami Hiramatsu Signed-off-by: Yoshihiro YUNOMAE --- trace-util.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/trace-util.c b/trace-util.c index e128188..d5a3eb4 100644 --- a/trace-util.c +++ b/trace-util.c @@ -311,6 +311,15 @@ char *tracecmd_find_tracing_dir(void) char type[100]; FILE *fp; + tracing_dir = getenv("TRACE_DIR"); + if (tracing_dir) { + tracing_dir = strdup(tracing_dir); + if (!tracing_dir) + die("malloc"); + warning("Use environmental tracing directory: %s\n", tracing_dir); + return tracing_dir; + } + if ((fp = fopen("/proc/mounts","r")) == NULL) { warning("Can't open /proc/mounts for read"); return NULL;
[Qemu-devel] [PATCH 0/5] trace-cmd: Add a recorder readable feature for virtio-trace
Hi Steven, The following patch set provides a feature which can read trace data of a guest using virtio-trace (https://lkml.org/lkml/2012/8/9/210) for a recorder function of trace-cmd. This patch set depends on the trace-agent running on a guest in the virtio-trace system. To translate raw data of a guest to text data on a host, information of debugfs in the guest is also needed on the host. In other words, the guest's debugfs must be exposed (mounted) on the host via other serial line (we don't like to depend on network connection). For this purpose, we'll use DIOD 9pfs server (http://code.google.com/p/diod/) as below. ***HOW TO USE*** We explain about how to translate raw data to text data on a host using trace-cmd applied this patch set and virtio-trace. - Preparation 1. Make FIFO in a host virtio-trace uses virtio-serial pipe as trace data paths as to the number of CPUs and a control path, so FIFO (named pipe) should be created as follows: # mkdir /tmp/virtio-trace/ # mkfifo /tmp/virtio-trace/trace-path-cpu{0,1,2,...,X}.{in,out} # mkfifo /tmp/virtio-trace/agent-ctl-path.{in,out} Here, if we assign 1VCPU for a guest, then we set as follows: trace-path-cpu0.{in.out} and agent-ctl-path.{in,out}. 2. Set up of virtio-serial pipe and unix in a host Add qemu option to use virtio-serial pipe for tracing and unix for debugfs. ##virtio-serial device## -device virtio-serial-pci,id=virtio-serial0\ ##control path## -chardev pipe,id=charchannel0,path=/tmp/virtio-trace/agent-ctl-path\ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,\ id=channel0,name=agent-ctl-path\ ##data path## -chardev pipe,id=charchannel1,path=/tmp/virtio-trace/trace-path-cpu0\ -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,\ id=channel1,name=trace-path-cpu0\ ##9pfs path## -device virtio-serial \ -chardev socket,path=/tmp/virtio-trace/trace-9pfs,server,nowait, \ id=trace-9pfs \ -device virtserialport,chardev=trace-9pfs,name=virtioserial If you manage guests with libvirt, add the following tags to domain XML files. Then, libvirt passes the same command option to qemu. Here, chardev names are restricted to trace-path-cpu0 and agent-ctl-path. UNIX domain socket is automatically created on a host. 3. Boot the guest You can find some chardev in /dev/virtio-ports/ in the guest. 4. Create symbolic link for trace-cmd on the host # ln -s /tmp/virtio-trace/trace-path-cpu0.out \ /tmp/virtio-tracing/tracing/per_cpu/cpu0/trace_pipe_raw 5. Wait for 9pfs server on the host # mount -t 9p -o trans=unix,access=any,uname=root, \ aname=/sys/kernel/debug,version=9p2000.L \ /tmp/virtio-trace/trace-9pfs /tmp/virtio-trace/debugfs 6. Run DIOD on the guest # diod -E -Nn -u 0 7. Connect DIOD to virtio-console on the guest # socat TCP4:127.0.0.1:564 /dev/virtio-ports/trace-9pfs - Execution 1. Run trace-agent on the guest # ./trace-agent 2. Execute trace-cmd on the host # AGENT_READ_DIR=/tmp/virtio-trace/tracing \ AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \ TRACE_DIR=/tmp/virtio-trace/debugfs/tracing \ ./trace-cmd record -e "sched:* 3. Translate raw data to text data on the host # ./trace-cmd report trace.dat ***Just enhancement ideas*** - Support for trace-cmd => done - Support for 9pfs protocol - Support for non-blocking mode in QEMU Thank you, --- Masami Hiramatsu (2): trace-cmd: Use tracing directory to count CPUs trace-cmd: Use TRACE_DIR envrionment variable if defined Yoshihiro YUNOMAE (3): trace-cmd: Use polling function trace-cmd: Add non-blocking option for open() and splice_read() trace-cmd: Support trace-agent of virtio-trace trace-cmd.h |1 trace-record.c | 41 trace-recorder.c | 112 -- trace-util.c | 27 + 4 files changed, 169 insertions(+), 12 deletions(-) -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
[Qemu-devel] [PATCH V2 6/6] tools: Add guest trace agent as a user tool
This patch adds a user tool, "trace agent" for sending trace data of a guest to a Host in low overhead. This agent has the following functions: - splice a page of ring-buffer to read_pipe without memory copying - splice the page from write_pipe to virtio-console without memory copying - write trace data to stdout by using -o option - controlled by start/stop orders from a Host Changes in v2: - Cleanup (change fprintf() to pr_err() and an include guard) Signed-off-by: Yoshihiro YUNOMAE --- tools/virtio/virtio-trace/Makefile | 14 + tools/virtio/virtio-trace/README| 118 tools/virtio/virtio-trace/trace-agent-ctl.c | 137 ++ tools/virtio/virtio-trace/trace-agent-rw.c | 192 +++ tools/virtio/virtio-trace/trace-agent.c | 270 +++ tools/virtio/virtio-trace/trace-agent.h | 75 6 files changed, 806 insertions(+), 0 deletions(-) create mode 100644 tools/virtio/virtio-trace/Makefile create mode 100644 tools/virtio/virtio-trace/README create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c create mode 100644 tools/virtio/virtio-trace/trace-agent.c create mode 100644 tools/virtio/virtio-trace/trace-agent.h diff --git a/tools/virtio/virtio-trace/Makefile b/tools/virtio/virtio-trace/Makefile new file mode 100644 index 000..ef3adfc --- /dev/null +++ b/tools/virtio/virtio-trace/Makefile @@ -0,0 +1,14 @@ +CC = gcc +CFLAGS = -O2 -Wall +LFLAG = -lpthread + +all: trace-agent + +.c.o: + $(CC) $(CFLAGS) $(LFLAG) -c $^ -o $@ + +trace-agent: trace-agent.o trace-agent-ctl.o trace-agent-rw.o + $(CC) $(CFLAGS) $(LFLAG) -o $@ $^ + +clean: + rm -f *.o trace-agent diff --git a/tools/virtio/virtio-trace/README b/tools/virtio/virtio-trace/README new file mode 100644 index 000..b64845b --- /dev/null +++ b/tools/virtio/virtio-trace/README @@ -0,0 +1,118 @@ +Trace Agent for virtio-trace + + +Trace agent is a user tool for sending trace data of a guest to a Host in low +overhead. Trace agent has the following functions: + - splice a page of ring-buffer to read_pipe without memory copying + - splice the page from write_pipe to virtio-console without memory copying + - write trace data to stdout by using -o option + - controlled by start/stop orders from a Host + +The trace agent operates as follows: + 1) Initialize all structures. + 2) Create a read/write thread per CPU. Each thread is bound to a CPU. +The read/write threads hold it. + 3) A controller thread does poll() for a start order of a host. + 4) After the controller of the trace agent receives a start order from a host, +the controller wake read/write threads. + 5) The read/write threads start to read trace data from ring-buffers and +write the data to virtio-serial. + 6) If the controller receives a stop order from a host, the read/write threads +stop to read trace data. + + +Files += + +README: this file +Makefile: Makefile of trace agent for virtio-trace +trace-agent.c: includes main function, sets up for operating trace agent +trace-agent.h: includes all structures and some macros +trace-agent-ctl.c: includes controller function for read/write threads +trace-agent-rw.c: includes read/write threads function + + +Setup += + +To use this trace agent for virtio-trace, we need to prepare some virtio-serial +I/Fs. + +1) Make FIFO in a host + virtio-trace uses virtio-serial pipe as trace data paths as to the number +of CPUs and a control path, so FIFO (named pipe) should be created as follows: + # mkdir /tmp/virtio-trace/ + # mkfifo /tmp/virtio-trace/trace-path-cpu{0,1,2,...,X}.{in,out} + # mkfifo /tmp/virtio-trace/agent-ctl-path.{in,out} + +For example, if a guest use three CPUs, the names are + trace-path-cpu{0,1,2}.{in.out} +and + agent-ctl-path.{in,out}. + +2) Set up of virtio-serial pipe in a host + Add qemu option to use virtio-serial pipe. + + ##virtio-serial device## + -device virtio-serial-pci,id=virtio-serial0\ + ##control path## + -chardev pipe,id=charchannel0,path=/tmp/virtio-trace/agent-ctl-path\ + -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,\ + id=channel0,name=agent-ctl-path\ + ##data path## + -chardev pipe,id=charchannel1,path=/tmp/virtio-trace/trace-path-cpu0\ + -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel0,\ + id=channel1,name=trace-path-cpu0\ + ... + +If you manage guests with libvirt, add the following tags to domain XML files. +Then, libvirt passes the same command option to qemu. + + + + + + + + + + + + ... +Here, chardev names are restricted to trace-path-cpuX and agent-ctl-path. For +example, if a guest use three CPUs, chardev names should be trace-path-cpu0, +trace-path-c
[Qemu-devel] [PATCH V2 4/6] ftrace: Allow stealing pages from pipe buffer
From: Masami Hiramatsu Use generic steal operation on pipe buffer to allow stealing ring buffer's read page from pipe buffer. Note that this could reduce the performance of splice on the splice_write side operation without affinity setting. Since the ring buffer's read pages are allocated on the tracing-node, but the splice user does not always execute splice write side operation on the same node. In this case, the page will be accessed from the another node. Thus, it is strongly recommended to assign the splicing thread to corresponding node. Signed-off-by: Masami Hiramatsu Acked-by: Steven Rostedt --- kernel/trace/trace.c |8 +--- 1 files changed, 1 insertions(+), 7 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index a120f98..ae01930 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4194,12 +4194,6 @@ static void buffer_pipe_buf_release(struct pipe_inode_info *pipe, buf->private = 0; } -static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe, -struct pipe_buffer *buf) -{ - return 1; -} - static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { @@ -4215,7 +4209,7 @@ static const struct pipe_buf_operations buffer_pipe_buf_ops = { .unmap = generic_pipe_buf_unmap, .confirm= generic_pipe_buf_confirm, .release= buffer_pipe_buf_release, - .steal = buffer_pipe_buf_steal, + .steal = generic_pipe_buf_steal, .get= buffer_pipe_buf_get, };
[Qemu-devel] [PATCH V2 5/6] virtio/console: Allocate scatterlist according to the current pipe size
From: Masami Hiramatsu Allocate scatterlist according to the current pipe size. This allows splicing bigger buffer if the pipe size has been changed by fcntl. Changes in v2: - Just a minor fix for avoiding a confliction with previous patch. Signed-off-by: Masami Hiramatsu --- drivers/char/virtio_console.c | 23 --- 1 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index b2fc2ab..e88f843 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -229,7 +229,6 @@ struct port { bool guest_connected; }; -#define MAX_SPLICE_PAGES 32 /* This is the very early arch-specified put chars function. */ static int (*early_put_chars)(u32, const char *, int); @@ -482,15 +481,16 @@ struct buffer_token { void *buf; struct scatterlist *sg; } u; - bool sgpages; + /* If sgpages == 0 then buf is used, else sg is used */ + unsigned int sgpages; }; -static void reclaim_sg_pages(struct scatterlist *sg) +static void reclaim_sg_pages(struct scatterlist *sg, unsigned int nrpages) { int i; struct page *page; - for (i = 0; i < MAX_SPLICE_PAGES; i++) { + for (i = 0; i < nrpages; i++) { page = sg_page(&sg[i]); if (!page) break; @@ -511,7 +511,7 @@ static void reclaim_consumed_buffers(struct port *port) } while ((tok = virtqueue_get_buf(port->out_vq, &len))) { if (tok->sgpages) - reclaim_sg_pages(tok->u.sg); + reclaim_sg_pages(tok->u.sg, tok->sgpages); else kfree(tok->u.buf); kfree(tok); @@ -581,7 +581,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, tok = kmalloc(sizeof(*tok), GFP_ATOMIC); if (!tok) return -ENOMEM; - tok->sgpages = false; + tok->sgpages = 0; tok->u.buf = in_buf; sg_init_one(sg, in_buf, in_count); @@ -597,7 +597,7 @@ static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents, tok = kmalloc(sizeof(*tok), GFP_ATOMIC); if (!tok) return -ENOMEM; - tok->sgpages = true; + tok->sgpages = nents; tok->u.sg = sg; return __send_to_port(port, sg, nents, in_count, tok, nonblock); @@ -797,6 +797,7 @@ out: struct sg_list { unsigned int n; + unsigned int size; size_t len; struct scatterlist *sg; }; @@ -807,7 +808,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct sg_list *sgl = sd->u.data; unsigned int offset, len; - if (sgl->n == MAX_SPLICE_PAGES) + if (sgl->n == sgl->size) return 0; /* Try lock this page */ @@ -868,12 +869,12 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, sgl.n = 0; sgl.len = 0; - sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES, -GFP_KERNEL); + sgl.size = pipe->nrbufs; + sgl.sg = kmalloc(sizeof(struct scatterlist) * sgl.size, GFP_KERNEL); if (unlikely(!sgl.sg)) return -ENOMEM; - sg_init_table(sgl.sg, MAX_SPLICE_PAGES); + sg_init_table(sgl.sg, sgl.size); ret = __splice_from_pipe(pipe, &sd, pipe_to_sg); if (likely(ret > 0)) ret = send_pages(port, sgl.sg, sgl.n, sgl.len, true);
[Qemu-devel] [PATCH V2 3/6] virtio/console: Wait until the port is ready on splice
From: Masami Hiramatsu Wait if the port is not connected or full on splice like as write is doing. Signed-off-by: Masami Hiramatsu --- drivers/char/virtio_console.c | 39 +++ 1 files changed, 27 insertions(+), 12 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index 22b7373..b2fc2ab 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -724,6 +724,26 @@ static ssize_t port_fops_read(struct file *filp, char __user *ubuf, return fill_readbuf(port, ubuf, count, true); } +static int wait_port_writable(struct port *port, bool nonblock) +{ + int ret; + + if (will_write_block(port)) { + if (nonblock) + return -EAGAIN; + + ret = wait_event_freezable(port->waitqueue, + !will_write_block(port)); + if (ret < 0) + return ret; + } + /* Port got hot-unplugged. */ + if (!port->guest_connected) + return -ENODEV; + + return 0; +} + static ssize_t port_fops_write(struct file *filp, const char __user *ubuf, size_t count, loff_t *offp) { @@ -740,18 +760,9 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf, nonblock = filp->f_flags & O_NONBLOCK; - if (will_write_block(port)) { - if (nonblock) - return -EAGAIN; - - ret = wait_event_freezable(port->waitqueue, - !will_write_block(port)); - if (ret < 0) - return ret; - } - /* Port got hot-unplugged. */ - if (!port->guest_connected) - return -ENODEV; + ret = wait_port_writable(port, nonblock); + if (ret < 0) + return ret; count = min((size_t)(32 * 1024), count); @@ -851,6 +862,10 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, .u.data = &sgl, }; + ret = wait_port_writable(port, filp->f_flags & O_NONBLOCK); + if (ret < 0) + return ret; + sgl.n = 0; sgl.len = 0; sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES,
[Qemu-devel] [PATCH V2 2/6] virtio/console: Add a failback for unstealable pipe buffer
From: Masami Hiramatsu Add a failback memcpy path for unstealable pipe buffer. If buf->ops->steal() fails, virtio-serial tries to copy the page contents to an allocated page, instead of just failing splice(). Signed-off-by: Masami Hiramatsu --- drivers/char/virtio_console.c | 28 +--- 1 files changed, 25 insertions(+), 3 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index 730816c..22b7373 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -794,7 +794,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct splice_desc *sd) { struct sg_list *sgl = sd->u.data; - unsigned int len = 0; + unsigned int offset, len; if (sgl->n == MAX_SPLICE_PAGES) return 0; @@ -807,9 +807,31 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, len = min(buf->len, sd->len); sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset); - sgl->n++; - sgl->len += len; + } else { + /* Failback to copying a page */ + struct page *page = alloc_page(GFP_KERNEL); + char *src = buf->ops->map(pipe, buf, 1); + char *dst; + + if (!page) + return -ENOMEM; + dst = kmap(page); + + offset = sd->pos & ~PAGE_MASK; + + len = sd->len; + if (len + offset > PAGE_SIZE) + len = PAGE_SIZE - offset; + + memcpy(dst + offset, src + buf->offset, len); + + kunmap(page); + buf->ops->unmap(pipe, buf, src); + + sg_set_page(&(sgl->sg[sgl->n]), page, len, offset); } + sgl->n++; + sgl->len += len; return len; }
[Qemu-devel] [PATCH V2 1/6] virtio/console: Add splice_write support
From: Masami Hiramatsu Enable to use splice_write from pipe to virtio-console port. This steals pages from pipe and directly send it to host. Note that this may accelerate only the guest to host path. Changes in v2: - Use GFP_KERNEL instead of GFP_ATOMIC in syscall context function. Signed-off-by: Masami Hiramatsu --- drivers/char/virtio_console.c | 136 +++-- 1 files changed, 128 insertions(+), 8 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index cdf2f54..730816c 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -24,6 +24,8 @@ #include #include #include +#include +#include #include #include #include @@ -227,6 +229,7 @@ struct port { bool guest_connected; }; +#define MAX_SPLICE_PAGES 32 /* This is the very early arch-specified put chars function. */ static int (*early_put_chars)(u32, const char *, int); @@ -474,26 +477,52 @@ static ssize_t send_control_msg(struct port *port, unsigned int event, return 0; } +struct buffer_token { + union { + void *buf; + struct scatterlist *sg; + } u; + bool sgpages; +}; + +static void reclaim_sg_pages(struct scatterlist *sg) +{ + int i; + struct page *page; + + for (i = 0; i < MAX_SPLICE_PAGES; i++) { + page = sg_page(&sg[i]); + if (!page) + break; + put_page(page); + } + kfree(sg); +} + /* Callers must take the port->outvq_lock */ static void reclaim_consumed_buffers(struct port *port) { - void *buf; + struct buffer_token *tok; unsigned int len; if (!port->portdev) { /* Device has been unplugged. vqs are already gone. */ return; } - while ((buf = virtqueue_get_buf(port->out_vq, &len))) { - kfree(buf); + while ((tok = virtqueue_get_buf(port->out_vq, &len))) { + if (tok->sgpages) + reclaim_sg_pages(tok->u.sg); + else + kfree(tok->u.buf); + kfree(tok); port->outvq_full = false; } } -static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, - bool nonblock) +static ssize_t __send_to_port(struct port *port, struct scatterlist *sg, + int nents, size_t in_count, + struct buffer_token *tok, bool nonblock) { - struct scatterlist sg[1]; struct virtqueue *out_vq; ssize_t ret; unsigned long flags; @@ -505,8 +534,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, reclaim_consumed_buffers(port); - sg_init_one(sg, in_buf, in_count); - ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC); + ret = virtqueue_add_buf(out_vq, sg, nents, 0, tok, GFP_ATOMIC); /* Tell Host to go! */ virtqueue_kick(out_vq); @@ -544,6 +572,37 @@ done: return in_count; } +static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, + bool nonblock) +{ + struct scatterlist sg[1]; + struct buffer_token *tok; + + tok = kmalloc(sizeof(*tok), GFP_ATOMIC); + if (!tok) + return -ENOMEM; + tok->sgpages = false; + tok->u.buf = in_buf; + + sg_init_one(sg, in_buf, in_count); + + return __send_to_port(port, sg, 1, in_count, tok, nonblock); +} + +static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents, + size_t in_count, bool nonblock) +{ + struct buffer_token *tok; + + tok = kmalloc(sizeof(*tok), GFP_ATOMIC); + if (!tok) + return -ENOMEM; + tok->sgpages = true; + tok->u.sg = sg; + + return __send_to_port(port, sg, nents, in_count, tok, nonblock); +} + /* * Give out the data that's requested from the buffer that we have * queued up. @@ -725,6 +784,66 @@ out: return ret; } +struct sg_list { + unsigned int n; + size_t len; + struct scatterlist *sg; +}; + +static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct sg_list *sgl = sd->u.data; + unsigned int len = 0; + + if (sgl->n == MAX_SPLICE_PAGES) + return 0; + + /* Try lock this page */ + if (buf->ops->steal(pipe, buf) == 0) { + /* Get reference and unlock page for moving */ + get_page(buf->page); + unlock_page(buf->page); + + len = min(buf->len, sd->len); + sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset); + sgl->n++; + sgl->len += len; + } + + return len; +} + +/* Faster zero-copy w
[Qemu-devel] [PATCH V2 0/6] virtio-trace: Support virtio-trace
Use GFP_KERNEL instead of GFP_ATOMIC in syscall context function in 1/6 - Just a minor fix for avoiding a confliction with previous patch in 5/6 - Cleanup (change fprintf() to pr_err() and an include guard) in 6/6 Thank you, --- Masami Hiramatsu (5): virtio/console: Allocate scatterlist according to the current pipe size ftrace: Allow stealing pages from pipe buffer virtio/console: Wait until the port is ready on splice virtio/console: Add a failback for unstealable pipe buffer virtio/console: Add splice_write support Yoshihiro YUNOMAE (1): tools: Add guest trace agent as a user tool drivers/char/virtio_console.c | 198 ++-- kernel/trace/trace.c|8 - tools/virtio/virtio-trace/Makefile | 14 + tools/virtio/virtio-trace/README| 118 tools/virtio/virtio-trace/trace-agent-ctl.c | 137 ++ tools/virtio/virtio-trace/trace-agent-rw.c | 192 +++ tools/virtio/virtio-trace/trace-agent.c | 270 +++ tools/virtio/virtio-trace/trace-agent.h | 75 8 files changed, 985 insertions(+), 27 deletions(-) create mode 100644 tools/virtio/virtio-trace/Makefile create mode 100644 tools/virtio/virtio-trace/README create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c create mode 100644 tools/virtio/virtio-trace/trace-agent.c create mode 100644 tools/virtio/virtio-trace/trace-agent.h -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace
Hi Amit, Sorry for the late reply. (2012/07/27 18:43), Amit Shah wrote: On (Fri) 27 Jul 2012 [17:55:11], Yoshihiro YUNOMAE wrote: Hi Amit, Thank you for commenting on our work. (2012/07/26 20:35), Amit Shah wrote: On (Tue) 24 Jul 2012 [11:36:57], Yoshihiro YUNOMAE wrote: [...] ***Just enhancement ideas*** - Support for trace-cmd - Support for 9pfs protocol - Support for non-blocking mode in QEMU There were patches long back (by me) to make chardevs non-blocking but they didn't make it upstream. Fedora carries them, if you want to try out. Though we want to converge on a reasonable solution that's acceptable upstream as well. Just that no one's working on it currently. Any help here will be appreciated. Thanks! In this case, since a guest will stop to run when host reads trace data of the guest, char device is needed to add a non-blocking mode. I'll read your patch series. Is the latest version 8? http://lists.gnu.org/archive/html/qemu-devel/2010-12/msg00035.html I suppose the latest version on-list is what you quote above. The objections to the patch series are mentioned in Anthony's mails. I'll check the mails. Hans maintains a rebased version of the patches in his tree at http://cgit.freedesktop.org/~jwrdegoede/qemu/ those patches are included in Fedora's qemu-kvm, so you can try that out if it improves performance for you. Thanks. I'll check those patches. - Make "vhost-serial" I need to understand a) why it's perf-critical, and b) why should the host be involved at all, to comment on these. a) To make collecting overhead decrease for application on a guest. (see above) b) Trace data of host kernel is not involved even if we introduce this patch set. I see, so you suggested vhost-serial only because you saw the guest stopping problem due to the absence of non-blocking code? If so, it now makes sense. I don't think we need vhost-serial in any way yet. I understood. We suggested vhost-serial as one of the ideas for improving performances. Other features(trace-cmd, 9pfs, and non-blocking chardev) should be supported first, I think. BTW where do you parse the trace data obtained from guests? On a remote host? It is the best that we can parse the data on a remote host in this tracing system. Existing trace-cmd can already parse it on a remote site. If we add the feature collecting event-format data(guest's debugfs has that) from guests, we can parse tracing data on a remote host as well as on a host running guests. Thank you, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace
Hi Amit, Thank you for commenting on our work. (2012/07/26 20:35), Amit Shah wrote: On (Tue) 24 Jul 2012 [11:36:57], Yoshihiro YUNOMAE wrote: [...] Therefore, we propose a new system "virtio-trace", which uses enhanced virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel tracing data. In this system, there are 5 main components: (1) Ring-buffer of ftrace in a guest - When trace agent reads ring-buffer, a page is removed from ring-buffer. (2) Trace agent in the guest - Splice the page of ring-buffer to read_pipe using splice() without memory copying. Then, the page is spliced from write_pipe to virtio without memory copying. I really like the splicing idea. Thanks. We will improve this patch set. (3) Virtio-console driver in the guest - Pass the page to virtio-ring (4) Virtio-serial bus in QEMU - Copy the page to kernel pipe (5) Reader in the host - Read guest tracing data via FIFO(named pipe) So will this be useful only if guest and host run the same kernel? I'd like to see the host kernel not being used at all -- collect all relevant info from the guest and send it out to qemu, where it can be consumed directly by apps driving the tracing. No, this patch set is used only for guest kernels, so guest and host don't need to run the same kernel. ***Evaluation*** When a host collects tracing data of a guest, the performance of using virtio-trace is compared with that of using native(just running ftrace), IVRing, and virtio-serial(normal method of read/write). Why is tracing performance-sensitive? i.e. why try to optimise this at all? To minimize effects for applications on guests when a host collects tracing data of guests. For example, we assume the situation where guests A and B are running on a host sharing I/O device. An I/O delay problem occur in guest A, but it doesn't for the requirement in guest B. In this case, we need to collect tracing data of guests A and B, but a usual method using network takes high load for applications of guest B even if guest B is normally running. Therefore, we try to decrease the load on guests. We also use this feature for performance analysis on production virtualization systems. [...] ***Just enhancement ideas*** - Support for trace-cmd - Support for 9pfs protocol - Support for non-blocking mode in QEMU There were patches long back (by me) to make chardevs non-blocking but they didn't make it upstream. Fedora carries them, if you want to try out. Though we want to converge on a reasonable solution that's acceptable upstream as well. Just that no one's working on it currently. Any help here will be appreciated. Thanks! In this case, since a guest will stop to run when host reads trace data of the guest, char device is needed to add a non-blocking mode. I'll read your patch series. Is the latest version 8? http://lists.gnu.org/archive/html/qemu-devel/2010-12/msg00035.html - Make "vhost-serial" I need to understand a) why it's perf-critical, and b) why should the host be involved at all, to comment on these. a) To make collecting overhead decrease for application on a guest. (see above) b) Trace data of host kernel is not involved even if we introduce this patch set. Thank you, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace
Hi Stefan, (2012/07/24 22:41), Stefan Hajnoczi wrote: On Tue, Jul 24, 2012 at 12:19 PM, Yoshihiro YUNOMAE wrote: Are you using text formatted ftrace? No, currently using raw format, but we'd like to reformat it in text. Capturing the info necessary to translate numbers into symbols is one of the problems of host<->guest tracing so I'm curious how you handle this :). Right, your consideration is true. Apologies for my lack of ftrace knowledge but how useful is the raw tracing data on the host? How do you pretty-print it in human-readable form? perf and trace-cmd can actually translate raw-formatted trace data to text-formatted trace data by using information of kernel or trace format under tracing/events directory in debugfs. In the same way, if the information of a guest is exported to a host, we can translate raw trace data of a guest to text trace data on a host. We will use 9pfs to export that. Thank you, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
Re: [Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace
Hi Stefan, Thank you for commenting on our patch set. (2012/07/24 20:03), Masami Hiramatsu wrote: (2012/07/24 19:02), Stefan Hajnoczi wrote: On Tue, Jul 24, 2012 at 3:36 AM, Yoshihiro YUNOMAE wrote: The performance of each method is compared as follows: [1] Native - only recording trace data to ring-buffer on a guest [2] Virtio-trace - running a trace agent on a guest - a reader on a host opens FIFO using cat command [3] IVRing - A SystemTap script in a guest records trace data to IVRing. -- probe points are same as ftrace. [4] Virtio-serial(normal) - A reader(using cat) on a guest output trace data to a host using standard output via virtio-serial. The first time I read this I thought you are adding a new virtio-trace device. But it looks like this series really add splice support to virtio-console and that yields a big performance improvement when sending trace_pipe_raw. Yes, sorry for the confusion. Actually this is an enhancement of virtio-serial. I'm working with Yoshihiro on this feature. Guest ftrace is useful and I like this. Have you thought about controlling ftrace from the host? Perhaps a command could be added to the QEMU guest agent which basically invokes trace-cmd/perf. As you can see, guest trace-agent can be controlled via a control channel. In our scenario, host tools can control that instead of guest one. We are considering that exporting the tracing part of guest's debugfs to host via another virtio-serial channel by using 9pfs, so that the host tools can refer that. (In this scenario, guest trace-agent will also provide 9pfs server. Since it means that the agent can handle writing a special file, trace-agent can be controlled via the special file on exported debugfs.) Of course, this also requires modifying trace-cmd/perf to accept some options like guest-debugfs mount point, guest's serial channel pipe (or unix socket?), etc. However, it will be a small change. Thank you, >> Are you using text formatted ftrace? No, currently using raw format, but we'd like to reformat it in text. Thank you, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
[Qemu-devel] [RFC PATCH 5/6] virtio/console: Allocate scatterlist according to the current pipe size
From: Masami Hiramatsu Allocate scatterlist according to the current pipe size. This allows splicing bigger buffer if the pipe size has been changed by fcntl. Signed-off-by: Masami Hiramatsu Cc: Amit Shah Cc: Arnd Bergmann Cc: Greg Kroah-Hartman --- drivers/char/virtio_console.c | 23 --- 1 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index e49d435..f5063d5 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -229,7 +229,6 @@ struct port { bool guest_connected; }; -#define MAX_SPLICE_PAGES 32 /* This is the very early arch-specified put chars function. */ static int (*early_put_chars)(u32, const char *, int); @@ -482,15 +481,16 @@ struct buffer_token { void *buf; struct scatterlist *sg; } u; - bool sgpages; + /* If sgpages == 0 then buf is used, else sg is used */ + unsigned int sgpages; }; -static void reclaim_sg_pages(struct scatterlist *sg) +static void reclaim_sg_pages(struct scatterlist *sg, unsigned int nrpages) { int i; struct page *page; - for (i = 0; i < MAX_SPLICE_PAGES; i++) { + for (i = 0; i < nrpages; i++) { page = sg_page(&sg[i]); if (!page) break; @@ -511,7 +511,7 @@ static void reclaim_consumed_buffers(struct port *port) } while ((tok = virtqueue_get_buf(port->out_vq, &len))) { if (tok->sgpages) - reclaim_sg_pages(tok->u.sg); + reclaim_sg_pages(tok->u.sg, tok->sgpages); else kfree(tok->u.buf); kfree(tok); @@ -581,7 +581,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, tok = kmalloc(sizeof(*tok), GFP_ATOMIC); if (!tok) return -ENOMEM; - tok->sgpages = false; + tok->sgpages = 0; tok->u.buf = in_buf; sg_init_one(sg, in_buf, in_count); @@ -597,7 +597,7 @@ static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents, tok = kmalloc(sizeof(*tok), GFP_ATOMIC); if (!tok) return -ENOMEM; - tok->sgpages = true; + tok->sgpages = nents; tok->u.sg = sg; return __send_to_port(port, sg, nents, in_count, tok, nonblock); @@ -797,6 +797,7 @@ out: struct sg_list { unsigned int n; + unsigned int size; size_t len; struct scatterlist *sg; }; @@ -807,7 +808,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct sg_list *sgl = sd->u.data; unsigned int offset, len; - if (sgl->n == MAX_SPLICE_PAGES) + if (sgl->n == sgl->size) return 0; /* Try lock this page */ @@ -868,12 +869,12 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, sgl.n = 0; sgl.len = 0; - sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES, -GFP_ATOMIC); + sgl.size = pipe->nrbufs; + sgl.sg = kmalloc(sizeof(struct scatterlist) * sgl.size, GFP_ATOMIC); if (unlikely(!sgl.sg)) return -ENOMEM; - sg_init_table(sgl.sg, MAX_SPLICE_PAGES); + sg_init_table(sgl.sg, sgl.size); ret = __splice_from_pipe(pipe, &sd, pipe_to_sg); if (likely(ret > 0)) ret = send_pages(port, sgl.sg, sgl.n, sgl.len, true);
[Qemu-devel] [RFC PATCH 4/6] ftrace: Allow stealing pages from pipe buffer
From: Masami Hiramatsu Use generic steal operation on pipe buffer to allow stealing ring buffer's read page from pipe buffer. Note that this could reduce the performance of splice on the splice_write side operation without affinity setting. Since the ring buffer's read pages are allocated on the tracing-node, but the splice user does not always execute splice write side operation on the same node. In this case, the page will be accessed from the another node. Thus, it is strongly recommended to assign the splicing thread to corresponding node. Signed-off-by: Masami Hiramatsu Cc: Steven Rostedt Cc: Frederic Weisbecker Cc: Ingo Molnar --- kernel/trace/trace.c |8 +--- 1 files changed, 1 insertions(+), 7 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index a120f98..ae01930 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4194,12 +4194,6 @@ static void buffer_pipe_buf_release(struct pipe_inode_info *pipe, buf->private = 0; } -static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe, -struct pipe_buffer *buf) -{ - return 1; -} - static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { @@ -4215,7 +4209,7 @@ static const struct pipe_buf_operations buffer_pipe_buf_ops = { .unmap = generic_pipe_buf_unmap, .confirm= generic_pipe_buf_confirm, .release= buffer_pipe_buf_release, - .steal = buffer_pipe_buf_steal, + .steal = generic_pipe_buf_steal, .get= buffer_pipe_buf_get, };
[Qemu-devel] [RFC PATCH 6/6] tools: Add guest trace agent as a user tool
This patch adds a user tool, "trace agent" for sending trace data of a guest to a Host in low overhead. This agent has the following functions: - splice a page of ring-buffer to read_pipe without memory copying - splice the page from write_pipe to virtio-console without memory copying - write trace data to stdout by using -o option - controlled by start/stop orders from a Host Signed-off-by: Yoshihiro YUNOMAE --- tools/virtio/virtio-trace/Makefile | 14 + tools/virtio/virtio-trace/README| 118 tools/virtio/virtio-trace/trace-agent-ctl.c | 137 ++ tools/virtio/virtio-trace/trace-agent-rw.c | 192 +++ tools/virtio/virtio-trace/trace-agent.c | 270 +++ tools/virtio/virtio-trace/trace-agent.h | 75 6 files changed, 806 insertions(+), 0 deletions(-) create mode 100644 tools/virtio/virtio-trace/Makefile create mode 100644 tools/virtio/virtio-trace/README create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c create mode 100644 tools/virtio/virtio-trace/trace-agent.c create mode 100644 tools/virtio/virtio-trace/trace-agent.h diff --git a/tools/virtio/virtio-trace/Makefile b/tools/virtio/virtio-trace/Makefile new file mode 100644 index 000..ef3adfc --- /dev/null +++ b/tools/virtio/virtio-trace/Makefile @@ -0,0 +1,14 @@ +CC = gcc +CFLAGS = -O2 -Wall +LFLAG = -lpthread + +all: trace-agent + +.c.o: + $(CC) $(CFLAGS) $(LFLAG) -c $^ -o $@ + +trace-agent: trace-agent.o trace-agent-ctl.o trace-agent-rw.o + $(CC) $(CFLAGS) $(LFLAG) -o $@ $^ + +clean: + rm -f *.o trace-agent diff --git a/tools/virtio/virtio-trace/README b/tools/virtio/virtio-trace/README new file mode 100644 index 000..b64845b --- /dev/null +++ b/tools/virtio/virtio-trace/README @@ -0,0 +1,118 @@ +Trace Agent for virtio-trace + + +Trace agent is a user tool for sending trace data of a guest to a Host in low +overhead. Trace agent has the following functions: + - splice a page of ring-buffer to read_pipe without memory copying + - splice the page from write_pipe to virtio-console without memory copying + - write trace data to stdout by using -o option + - controlled by start/stop orders from a Host + +The trace agent operates as follows: + 1) Initialize all structures. + 2) Create a read/write thread per CPU. Each thread is bound to a CPU. +The read/write threads hold it. + 3) A controller thread does poll() for a start order of a host. + 4) After the controller of the trace agent receives a start order from a host, +the controller wake read/write threads. + 5) The read/write threads start to read trace data from ring-buffers and +write the data to virtio-serial. + 6) If the controller receives a stop order from a host, the read/write threads +stop to read trace data. + + +Files += + +README: this file +Makefile: Makefile of trace agent for virtio-trace +trace-agent.c: includes main function, sets up for operating trace agent +trace-agent.h: includes all structures and some macros +trace-agent-ctl.c: includes controller function for read/write threads +trace-agent-rw.c: includes read/write threads function + + +Setup += + +To use this trace agent for virtio-trace, we need to prepare some virtio-serial +I/Fs. + +1) Make FIFO in a host + virtio-trace uses virtio-serial pipe as trace data paths as to the number +of CPUs and a control path, so FIFO (named pipe) should be created as follows: + # mkdir /tmp/virtio-trace/ + # mkfifo /tmp/virtio-trace/trace-path-cpu{0,1,2,...,X}.{in,out} + # mkfifo /tmp/virtio-trace/agent-ctl-path.{in,out} + +For example, if a guest use three CPUs, the names are + trace-path-cpu{0,1,2}.{in.out} +and + agent-ctl-path.{in,out}. + +2) Set up of virtio-serial pipe in a host + Add qemu option to use virtio-serial pipe. + + ##virtio-serial device## + -device virtio-serial-pci,id=virtio-serial0\ + ##control path## + -chardev pipe,id=charchannel0,path=/tmp/virtio-trace/agent-ctl-path\ + -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,\ + id=channel0,name=agent-ctl-path\ + ##data path## + -chardev pipe,id=charchannel1,path=/tmp/virtio-trace/trace-path-cpu0\ + -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel0,\ + id=channel1,name=trace-path-cpu0\ + ... + +If you manage guests with libvirt, add the following tags to domain XML files. +Then, libvirt passes the same command option to qemu. + + + + + + + + + + + + ... +Here, chardev names are restricted to trace-path-cpuX and agent-ctl-path. For +example, if a guest use three CPUs, chardev names should be trace-path-cpu0, +trace-path-cpu1, trace-path-cpu2, and agent-ctl-path. + +3) Boot the guest + You can find some
[Qemu-devel] [RFC PATCH 3/6] virtio/console: Wait until the port is ready on splice
From: Masami Hiramatsu Wait if the port is not connected or full on splice like as write is doing. Signed-off-by: Masami Hiramatsu Cc: Amit Shah Cc: Arnd Bergmann Cc: Greg Kroah-Hartman --- drivers/char/virtio_console.c | 39 +++ 1 files changed, 27 insertions(+), 12 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index 911cb3e..e49d435 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -724,6 +724,26 @@ static ssize_t port_fops_read(struct file *filp, char __user *ubuf, return fill_readbuf(port, ubuf, count, true); } +static int wait_port_writable(struct port *port, bool nonblock) +{ + int ret; + + if (will_write_block(port)) { + if (nonblock) + return -EAGAIN; + + ret = wait_event_freezable(port->waitqueue, + !will_write_block(port)); + if (ret < 0) + return ret; + } + /* Port got hot-unplugged. */ + if (!port->guest_connected) + return -ENODEV; + + return 0; +} + static ssize_t port_fops_write(struct file *filp, const char __user *ubuf, size_t count, loff_t *offp) { @@ -740,18 +760,9 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf, nonblock = filp->f_flags & O_NONBLOCK; - if (will_write_block(port)) { - if (nonblock) - return -EAGAIN; - - ret = wait_event_freezable(port->waitqueue, - !will_write_block(port)); - if (ret < 0) - return ret; - } - /* Port got hot-unplugged. */ - if (!port->guest_connected) - return -ENODEV; + ret = wait_port_writable(port, nonblock); + if (ret < 0) + return ret; count = min((size_t)(32 * 1024), count); @@ -851,6 +862,10 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, .u.data = &sgl, }; + ret = wait_port_writable(port, filp->f_flags & O_NONBLOCK); + if (ret < 0) + return ret; + sgl.n = 0; sgl.len = 0; sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES,
[Qemu-devel] [RFC PATCH 1/6] virtio/console: Add splice_write support
From: Masami Hiramatsu Enable to use splice_write from pipe to virtio-console port. This steals pages from pipe and directly send it to host. Note that this may accelerate only the guest to host path. Signed-off-by: Masami Hiramatsu Cc: Amit Shah Cc: Arnd Bergmann Cc: Greg Kroah-Hartman --- drivers/char/virtio_console.c | 136 +++-- 1 files changed, 128 insertions(+), 8 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index cdf2f54..fe31b2f 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -24,6 +24,8 @@ #include #include #include +#include +#include #include #include #include @@ -227,6 +229,7 @@ struct port { bool guest_connected; }; +#define MAX_SPLICE_PAGES 32 /* This is the very early arch-specified put chars function. */ static int (*early_put_chars)(u32, const char *, int); @@ -474,26 +477,52 @@ static ssize_t send_control_msg(struct port *port, unsigned int event, return 0; } +struct buffer_token { + union { + void *buf; + struct scatterlist *sg; + } u; + bool sgpages; +}; + +static void reclaim_sg_pages(struct scatterlist *sg) +{ + int i; + struct page *page; + + for (i = 0; i < MAX_SPLICE_PAGES; i++) { + page = sg_page(&sg[i]); + if (!page) + break; + put_page(page); + } + kfree(sg); +} + /* Callers must take the port->outvq_lock */ static void reclaim_consumed_buffers(struct port *port) { - void *buf; + struct buffer_token *tok; unsigned int len; if (!port->portdev) { /* Device has been unplugged. vqs are already gone. */ return; } - while ((buf = virtqueue_get_buf(port->out_vq, &len))) { - kfree(buf); + while ((tok = virtqueue_get_buf(port->out_vq, &len))) { + if (tok->sgpages) + reclaim_sg_pages(tok->u.sg); + else + kfree(tok->u.buf); + kfree(tok); port->outvq_full = false; } } -static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, - bool nonblock) +static ssize_t __send_to_port(struct port *port, struct scatterlist *sg, + int nents, size_t in_count, + struct buffer_token *tok, bool nonblock) { - struct scatterlist sg[1]; struct virtqueue *out_vq; ssize_t ret; unsigned long flags; @@ -505,8 +534,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, reclaim_consumed_buffers(port); - sg_init_one(sg, in_buf, in_count); - ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC); + ret = virtqueue_add_buf(out_vq, sg, nents, 0, tok, GFP_ATOMIC); /* Tell Host to go! */ virtqueue_kick(out_vq); @@ -544,6 +572,37 @@ done: return in_count; } +static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, + bool nonblock) +{ + struct scatterlist sg[1]; + struct buffer_token *tok; + + tok = kmalloc(sizeof(*tok), GFP_ATOMIC); + if (!tok) + return -ENOMEM; + tok->sgpages = false; + tok->u.buf = in_buf; + + sg_init_one(sg, in_buf, in_count); + + return __send_to_port(port, sg, 1, in_count, tok, nonblock); +} + +static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents, + size_t in_count, bool nonblock) +{ + struct buffer_token *tok; + + tok = kmalloc(sizeof(*tok), GFP_ATOMIC); + if (!tok) + return -ENOMEM; + tok->sgpages = true; + tok->u.sg = sg; + + return __send_to_port(port, sg, nents, in_count, tok, nonblock); +} + /* * Give out the data that's requested from the buffer that we have * queued up. @@ -725,6 +784,66 @@ out: return ret; } +struct sg_list { + unsigned int n; + size_t len; + struct scatterlist *sg; +}; + +static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct sg_list *sgl = sd->u.data; + unsigned int len = 0; + + if (sgl->n == MAX_SPLICE_PAGES) + return 0; + + /* Try lock this page */ + if (buf->ops->steal(pipe, buf) == 0) { + /* Get reference and unlock page for moving */ + get_page(buf->page); + unlock_page(buf->page); + + len = min(buf->len, sd->len); + sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset); + sgl->n++; + sgl->len += len; + } + + return len; +} + +/* Faster zero-copy write by splicing */ +static
[Qemu-devel] [RFC PATCH 2/6] virtio/console: Add a failback for unstealable pipe buffer
From: Masami Hiramatsu Add a failback memcpy path for unstealable pipe buffer. If buf->ops->steal() fails, virtio-serial tries to copy the page contents to an allocated page, instead of just failing splice(). Signed-off-by: Masami Hiramatsu Cc: Amit Shah Cc: Arnd Bergmann Cc: Greg Kroah-Hartman --- drivers/char/virtio_console.c | 28 +--- 1 files changed, 25 insertions(+), 3 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index fe31b2f..911cb3e 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -794,7 +794,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct splice_desc *sd) { struct sg_list *sgl = sd->u.data; - unsigned int len = 0; + unsigned int offset, len; if (sgl->n == MAX_SPLICE_PAGES) return 0; @@ -807,9 +807,31 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, len = min(buf->len, sd->len); sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset); - sgl->n++; - sgl->len += len; + } else { + /* Failback to copying a page */ + struct page *page = alloc_page(GFP_KERNEL); + char *src = buf->ops->map(pipe, buf, 1); + char *dst; + + if (!page) + return -ENOMEM; + dst = kmap(page); + + offset = sd->pos & ~PAGE_MASK; + + len = sd->len; + if (len + offset > PAGE_SIZE) + len = PAGE_SIZE - offset; + + memcpy(dst + offset, src + buf->offset, len); + + kunmap(page); + buf->ops->unmap(pipe, buf, src); + + sg_set_page(&(sgl->sg[sgl->n]), page, len, offset); } + sgl->n++; + sgl->len += len; return len; }
[Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace
Hi All, The following patch set provides a low-overhead system for collecting kernel tracing data of guests by a host in a virtualization environment. A guest OS generally shares some devices with other guests or a host, so reasons of any problems occurring in a guest may be from other guests or a host. Then, to collect some tracing data of a number of guests and a host is needed when some problems occur in a virtualization environment. One of methods to realize that is to collect tracing data of guests in a host. To do this, network is generally used. However, high load will be taken to applications on guests using network I/O because there are many network stack layers. Therefore, a communication method for collecting the data without using network is needed. We submitted a patch set of "IVRing", a ring-buffer driver constructed on Inter-VM shared memory (IVShmem), to LKML http://lwn.net/Articles/500304/ in this June. IVRing and the IVRing reader use POSIX shared memory each other without using network, so a low-overhead system for collecting guest tracing data is realized. However, this patch set has some problems as follows: - use IVShmem instead of virtio - create a new ring-buffer without using existing ring-buffer in kernel - scalability -- not support SMP environment -- buffer size limitation -- not support live migration (maybe difficult for realize this) Therefore, we propose a new system "virtio-trace", which uses enhanced virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel tracing data. In this system, there are 5 main components: (1) Ring-buffer of ftrace in a guest - When trace agent reads ring-buffer, a page is removed from ring-buffer. (2) Trace agent in the guest - Splice the page of ring-buffer to read_pipe using splice() without memory copying. Then, the page is spliced from write_pipe to virtio without memory copying. (3) Virtio-console driver in the guest - Pass the page to virtio-ring (4) Virtio-serial bus in QEMU - Copy the page to kernel pipe (5) Reader in the host - Read guest tracing data via FIFO(named pipe) ***Evaluation*** When a host collects tracing data of a guest, the performance of using virtio-trace is compared with that of using native(just running ftrace), IVRing, and virtio-serial(normal method of read/write). The overview of this evaluation is as follows: (a) A guest on a KVM is prepared. - The guest is dedicated one physical CPU as a virtual CPU(VCPU). (b) The guest starts to write tracing data to ring-buffer of ftrace. - The probe points are all trace points of sched, timer, and kmem. (c) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark tool in the guest. - Dhrystone 2 intends system performance by repeating integer arithmetic as a score. - Since higher score equals to better system performance, if the score decrease based on bare environment, it indicates that any operation disturbs the integer arithmetic. Then, we define the overhead of transporting trace data is calculated as follows: OVERHEAD = (1 - SCORE_OF_A_METHOD/NATIVE_SCORE) * 100. The performance of each method is compared as follows: [1] Native - only recording trace data to ring-buffer on a guest [2] Virtio-trace - running a trace agent on a guest - a reader on a host opens FIFO using cat command [3] IVRing - A SystemTap script in a guest records trace data to IVRing. -- probe points are same as ftrace. [4] Virtio-serial(normal) - A reader(using cat) on a guest output trace data to a host using standard output via virtio-serial. Other information is as follows: - host kernel: 3.3.7-1 (Fedora16) CPU: Intel Xeon x5660@2.80GHz(12core) Memory: 48GB - guest(only booting one guest) kernel: 3.5.0-rc4+ (Fedora16) CPU: 1VCPU(dedicated) Memory: 1GB 3 patterns based on the bare environment were indicated as follows: Scores overhead against [0] Native [0] Native: 28807569.5 - [1] Virtio-trace:28685049.5 0.43% [2] IVRing: 28418595.5 1.35% [3] Virtio-serial: 13262258.753.96% ***Just enhancement ideas*** - Support for trace-cmd - Support for 9pfs protocol - Support for non-blocking mode in QEMU - Make "vhost-serial" Thank you, --- Masami Hiramatsu (5): virtio/console: Allocate scatterlist according to the current pipe size ftrace: Allow stealing pages from pipe buffer virtio/console: Wait until the port is ready on splice virtio/console: Add a failback for unstealable pipe buffer virtio/console: Add splice_write support Yoshihiro YUNOMAE (1): tools: Add guest trace agent as a user tool drivers/char/virtio_console.c | 198
[Qemu-devel] [RFC PATCH 1/2] ivring: Add a ring-buffer driver on IVShmem
This patch adds a ring-buffer driver for IVShmem device, a virtual RAM device in QEMU. This driver can be used as a ring-buffer for kernel logging or tracing of a guest OS by recording kernel programing or SystemTap. This ring-buffer driver is implemented very simple. First 4kB of shared memory region is control structure of a ring-buffer. In this region, some values for managing the ring-buffer is stored such as bits and mask of whole memory size, writing position, threshold value for notification to a reader on a host OS. This region is used by the reader to know writing position. Then, "total memory size - 4kB" equals to usable memory region for recording data. This ring-buffer driver records any data from start to end of the writable memory region. When writing size exceeds a threshold value, this driver can notify a reader to read data by using writel(). As this later patch, reader does not have any function for receiving the notification. This notification feature will be used near the future. As a writer records data in this ring-buffer, spinlock function is used to avoid competing by some writers in multi CPU environment. Not to use spinlock, lockless ring-buffer like as ftrace and one ring-buffer one CPU will be implemented near the future. Signed-off-by: Yoshihiro YUNOMAE Signed-off-by: Masami Hiramatsu Signed-off-by: Akihiro Nagai Cc: Greg Kroah-Hartman Cc: Ohad Ben-Cohen Cc: Linus Walleij Cc: MyungJoo Ham Cc: Rusty Russell Cc: Joerg Roedel Cc: Grant Likely Cc: linux-ker...@vger.kernel.org Cc: Cam Macdonell Cc: qemu-devel@nongnu.org Cc: system...@sourceware.org --- drivers/Kconfig |1 drivers/Makefile |1 drivers/ivshmem/Kconfig |9 + drivers/ivshmem/Makefile |5 drivers/ivshmem/ivring.c | 551 ++ drivers/ivshmem/ivring.h | 77 ++ 6 files changed, 644 insertions(+), 0 deletions(-) create mode 100644 drivers/ivshmem/Kconfig create mode 100644 drivers/ivshmem/Makefile create mode 100644 drivers/ivshmem/ivring.c create mode 100644 drivers/ivshmem/ivring.h diff --git a/drivers/Kconfig b/drivers/Kconfig index bfc9186..e01adcd 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -148,4 +148,5 @@ source "drivers/iio/Kconfig" source "drivers/vme/Kconfig" +source "drivers/ivshmem/Kconfig" endmenu diff --git a/drivers/Makefile b/drivers/Makefile index 2ba29ff..1ebdd03 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -23,6 +23,7 @@ obj-y += amba/ # really early. obj-$(CONFIG_DMA_ENGINE) += dma/ +obj-$(CONFIG_IVRING_MANAGER) += ivshmem/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_XEN) += xen/ diff --git a/drivers/ivshmem/Kconfig b/drivers/ivshmem/Kconfig new file mode 100644 index 000..e84364a --- /dev/null +++ b/drivers/ivshmem/Kconfig @@ -0,0 +1,9 @@ +# +# IVShmem support drivers +# + +config IVRING_MANAGER + tristate "IVRing management driver" + help + It allows IVShmem, a virtual PCI RAM device in QEMU, to use as a + ring-buffer for tracing of a guest. diff --git a/drivers/ivshmem/Makefile b/drivers/ivshmem/Makefile new file mode 100644 index 000..e725f8c --- /dev/null +++ b/drivers/ivshmem/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for IVShmem drivers +# + +obj-$(CONFIG_IVRING_MANAGER) += ivring.o diff --git a/drivers/ivshmem/ivring.c b/drivers/ivshmem/ivring.c new file mode 100644 index 000..5cbcfb6 --- /dev/null +++ b/drivers/ivshmem/ivring.c @@ -0,0 +1,551 @@ +/* + * Ring buffer on IVShmem Driver + * + * (C) 2012 Hitachi, Ltd. + * Written by Hitachi Yokohama Research Laboratory. + * + * Created by Masami Hiramatsu + * Akihiro Nagai + *Yoshihiro Yunomae + * based on UIOIVShmem Driver, http://www.gitorious.org/nahanni/guest-code, + * (C) 2009 Cam Macdonell + * based on Hilscher CIF card driver (C) 2007 Hans J. Koch + * + * Licensed under GPL version 2 only. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "./ivring.h" + + +#define IVSHM_OFFS_INTRMASK0 +#define IVSHM_OFFS_INTRSTATUS 4 +#define IVSHM_OFFS_IVPOSITION 8 +#define IVSHM_OFFS_DOORBELL12 + +#define MSIX_NAMEBUF_SIZE 128 +#define DEFAULT_NR_VECTORS 4 + +#define IVRING_DEVNAME "ivring" + +struct ivring_mem { + unsigned long addr; + unsigned long size; + void __iomem*ioaddr; +}; + +struct ivring_info { + struct pci_dev *dev; + int irq; + struct ivring_mem mem[2]; /* 0:control, 1:shmem */ + struct msix_entry *msix_entries; + char(*msix_names)[MSIX_NAMEBUF_SIZE]; + int nvectors; + int posn; + struct ivring_hdr *hdr; +}; + +#
Re: [Qemu-devel] [RFC PATCH 1/2] ivring: Add a ring-buffer driver on IVShmem
(2012/06/05 22:10), Borislav Petkov wrote: On Tue, Jun 05, 2012 at 10:01:17PM +0900, Yoshihiro YUNOMAE wrote: This patch adds a ring-buffer driver for IVShmem device, a virtual RAM device in QEMU. This driver can be used as a ring-buffer for kernel logging or tracing of a guest OS by recording kernel programing or SystemTap. This ring-buffer driver is implemented very simple. First 4kB of shared memory region is control structure of a ring-buffer. In this region, some values for managing the ring-buffer is stored such as bits and mask of whole memory size, writing position, threshold value for notification to a reader on a host OS. This region is used by the reader to know writing position. Then, "total memory size - 4kB" equals to usable memory region for recording data. This ring-buffer driver records any data from start to end of the writable memory region. When writing size exceeds a threshold value, this driver can notify a reader to read data by using writel(). As this later patch, reader does not have any function for receiving the notification. This notification feature will be used near the future. As a writer records data in this ring-buffer, spinlock function is used to avoid competing by some writers in multi CPU environment. Not to use spinlock, lockless ring-buffer like as ftrace and one ring-buffer one CPU will be implemented near the future. Yet another ring buffer? Yes, unfortunately... We already have an ftrace and perf ring buffer, can't you use one of those? No, because those do not support to allocate buffer from PCI memory device, nor pass the control structure over it. However, indeed, we understand what you would like to say. This series is just RFC and we'd like to ask who is interested in the guest tracing and how it should be implemented. - no more ring buffer. enhance perf/ftrace ring buffer to enable allocating buffers on shared memory. Other comments are welcome. Thank you, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com
[Qemu-devel] [RFC PATCH 1/2] ivring: Add a ring-buffer driver on IVShmem
This patch adds a ring-buffer driver for IVShmem device, a virtual RAM device in QEMU. This driver can be used as a ring-buffer for kernel logging or tracing of a guest OS by recording kernel programing or SystemTap. This ring-buffer driver is implemented very simple. First 4kB of shared memory region is control structure of a ring-buffer. In this region, some values for managing the ring-buffer is stored such as bits and mask of whole memory size, writing position, threshold value for notification to a reader on a host OS. This region is used by the reader to know writing position. Then, "total memory size - 4kB" equals to usable memory region for recording data. This ring-buffer driver records any data from start to end of the writable memory region. When writing size exceeds a threshold value, this driver can notify a reader to read data by using writel(). As this later patch, reader does not have any function for receiving the notification. This notification feature will be used near the future. As a writer records data in this ring-buffer, spinlock function is used to avoid competing by some writers in multi CPU environment. Not to use spinlock, lockless ring-buffer like as ftrace and one ring-buffer one CPU will be implemented near the future. Signed-off-by: Yoshihiro YUNOMAE Signed-off-by: Masami Hiramatsu Signed-off-by: Akihiro Nagai Cc: Greg Kroah-Hartman Cc: Ohad Ben-Cohen Cc: Linus Walleij Cc: MyungJoo Ham Cc: Rusty Russell Cc: Joerg Roedel Cc: Grant Likely Cc: linux-ker...@vger.kernel.org Cc: Cam Macdonell Cc: qemu-devel@nongnu.org Cc: system...@sourceware.org --- drivers/Kconfig |1 drivers/Makefile |1 drivers/ivshmem/Kconfig |9 + drivers/ivshmem/Makefile |5 drivers/ivshmem/ivring.c | 551 ++ drivers/ivshmem/ivring.h | 77 ++ 6 files changed, 644 insertions(+), 0 deletions(-) create mode 100644 drivers/ivshmem/Kconfig create mode 100644 drivers/ivshmem/Makefile create mode 100644 drivers/ivshmem/ivring.c create mode 100644 drivers/ivshmem/ivring.h diff --git a/drivers/Kconfig b/drivers/Kconfig index bfc9186..e01adcd 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -148,4 +148,5 @@ source "drivers/iio/Kconfig" source "drivers/vme/Kconfig" +source "drivers/ivshmem/Kconfig" endmenu diff --git a/drivers/Makefile b/drivers/Makefile index 2ba29ff..1ebdd03 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -23,6 +23,7 @@ obj-y += amba/ # really early. obj-$(CONFIG_DMA_ENGINE) += dma/ +obj-$(CONFIG_IVRING_MANAGER) += ivshmem/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_XEN) += xen/ diff --git a/drivers/ivshmem/Kconfig b/drivers/ivshmem/Kconfig new file mode 100644 index 000..e84364a --- /dev/null +++ b/drivers/ivshmem/Kconfig @@ -0,0 +1,9 @@ +# +# IVShmem support drivers +# + +config IVRING_MANAGER + tristate "IVRing management driver" + help + It allows IVShmem, a virtual PCI RAM device in QEMU, to use as a + ring-buffer for tracing of a guest. diff --git a/drivers/ivshmem/Makefile b/drivers/ivshmem/Makefile new file mode 100644 index 000..e725f8c --- /dev/null +++ b/drivers/ivshmem/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for IVShmem drivers +# + +obj-$(CONFIG_IVRING_MANAGER) += ivring.o diff --git a/drivers/ivshmem/ivring.c b/drivers/ivshmem/ivring.c new file mode 100644 index 000..5cbcfb6 --- /dev/null +++ b/drivers/ivshmem/ivring.c @@ -0,0 +1,551 @@ +/* + * Ring buffer on IVShmem Driver + * + * (C) 2012 Hitachi, Ltd. + * Written by Hitachi Yokohama Research Laboratory. + * + * Created by Masami Hiramatsu + * Akihiro Nagai + *Yoshihiro Yunomae + * based on UIOIVShmem Driver, http://www.gitorious.org/nahanni/guest-code, + * (C) 2009 Cam Macdonell + * based on Hilscher CIF card driver (C) 2007 Hans J. Koch + * + * Licensed under GPL version 2 only. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "./ivring.h" + + +#define IVSHM_OFFS_INTRMASK0 +#define IVSHM_OFFS_INTRSTATUS 4 +#define IVSHM_OFFS_IVPOSITION 8 +#define IVSHM_OFFS_DOORBELL12 + +#define MSIX_NAMEBUF_SIZE 128 +#define DEFAULT_NR_VECTORS 4 + +#define IVRING_DEVNAME "ivring" + +struct ivring_mem { + unsigned long addr; + unsigned long size; + void __iomem*ioaddr; +}; + +struct ivring_info { + struct pci_dev *dev; + int irq; + struct ivring_mem mem[2]; /* 0:control, 1:shmem */ + struct msix_entry *msix_entries; + char(*msix_names)[MSIX_NAMEBUF_SIZE]; + int nvectors; + int posn; + struct ivring_hdr *hdr; +}; + +#
[Qemu-devel] [RFC PATCH 2/2] ivring: Add a ring-buffer reader tool
This patch adds a reader tool for IVRing. This tool is used on a host OS and reads data written by a guest. This reader reads data from a ring-buffer via POSIX share memory, so the data will be read without memory copying between a guest and a host. To read data written by a guest, s option assigning same shared memory object of IVShmem is needed. Some options are available as follows: -f: output log file -h: show usage -m: shared memory size in MB -s: shared memory object path -N: number of log files -S: log file size in MB Example: ./ivring_reader -m 2 -f /tmp/log.txt -S 10 -N 2 -s /ivshmem In this case, two log files are output as /tmp/log.txt.0 and /tmp/log.txt.1 whose sizes are 10MB. Signed-off-by: Yoshihiro YUNOMAE Signed-off-by: Masami Hiramatsu Signed-off-by: Akihiro Nagai Cc: Borislav Petkov Cc: Arnaldo Carvalho de Melo Cc: linux-ker...@vger.kernel.org Cc: Cam Macdonell Cc: qemu-devel@nongnu.org Cc: system...@sourceware.org --- tools/Makefile|1 tools/ivshmem/Makefile| 19 ++ tools/ivshmem/ivring_reader.c | 516 + tools/ivshmem/ivring_reader.h | 15 + tools/ivshmem/pr_msg.c| 125 ++ tools/ivshmem/pr_msg.h| 19 ++ 6 files changed, 695 insertions(+), 0 deletions(-) create mode 100644 tools/ivshmem/Makefile create mode 100644 tools/ivshmem/ivring_reader.c create mode 100644 tools/ivshmem/ivring_reader.h create mode 100644 tools/ivshmem/pr_msg.c create mode 100644 tools/ivshmem/pr_msg.h diff --git a/tools/Makefile b/tools/Makefile index 3ae4394..3edf16a 100644 --- a/tools/Makefile +++ b/tools/Makefile @@ -5,6 +5,7 @@ help: @echo '' @echo ' cpupower - a tool for all things x86 CPU power' @echo ' firewire - the userspace part of nosy, an IEEE-1394 traffic sniffer' + @echo ' ivshmem - the userspace tool for ivshmem device' @echo ' lguest - a minimal 32-bit x86 hypervisor' @echo ' perf - Linux performance measurement and analysis tool' @echo ' selftests - various kernel selftests' diff --git a/tools/ivshmem/Makefile b/tools/ivshmem/Makefile new file mode 100644 index 000..287508e --- /dev/null +++ b/tools/ivshmem/Makefile @@ -0,0 +1,19 @@ +CC = gcc +CFLAGS = -O1 -Wall -Werror -g +LIBS = -lrt + +# makefile to build ivshmem tools + +all: ivring_reader + +.c.o: + $(CC) $(CFLAGS) -c $^ -o $@ + +ivring_reader: ivring_reader.o pr_msg.o + $(CC) $(CFLAGS) -o $@ $^ $(LIBS) + +install: ivring_reader + install ivring_reader /usr/local/bin/ + +clean: + rm -f *.o ivring_reader diff --git a/tools/ivshmem/ivring_reader.c b/tools/ivshmem/ivring_reader.c new file mode 100644 index 000..d61e9c9 --- /dev/null +++ b/tools/ivshmem/ivring_reader.c @@ -0,0 +1,516 @@ +/* + * A trace reader for inter-VM shared memory + * + * (C) 2012 Hitachi, Ltd. + * Written by Hitachi Yokohama Research Laboratory. + * + * Created by Masami Hiramatsu + *Akihiro Nagai + *Yoshihiro Yunomae + * based on IVShmem Server, http://www.gitorious.org/nahanni/guest-code, + * (C) 2009 Cam Macdonell + * + * Licensed under GPL version 2 only. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "../../drivers/ivshmem/ivring.h" +#include "pr_msg.h" +#include "ivring_reader.h" + +/* default pathes */ +#define DEFAULT_SHM_SIZE (1024*1024) +#define BUFFER_SIZE 4096 + +static int global_term; +static int global_outfd; +static char *global_log_basename; +static ssize_t global_log_rotate_size; +static int global_log_rotate_num; +#define log_rotate_mode() (global_log_rotate_size && global_log_rotate_num) + +/* Handle SIGTERM/SIGINT/SIGQUIT to exit */ +void term_handler(int sig) +{ + global_term = sig; + pr_info("Receive an interrupt %d\n", sig); +} + +/* Utilities */ +static void *zalloc(size_t size) +{ + void *ret = malloc(size); + if (ret) + memset(ret, 0, size); + else + pr_perror("malloc"); + return ret; +} + +static u32 __fls32(u32 word) +{ + int num = 31; + if (!(word & (~0ul << 16))) { + num -= 16; + word <<= 16; + } + if (!(word & (~0ul << (32-8 { + num -= 8; + word <<= 8; + } + if (!(word & (~0ul << (32-4 { + num -= 4; + word <<= 4; + } + if (!(word & (~0ul << (32-2 { + num -= 2; + word <<= 2; + } + if (!(word & (~0ul << (32-1 + num -= 1; + return num; +} + +/* IVR
[Qemu-devel] [RFC PATCH 0/2] ivring: Add IVRing driver
Hi All, The following patch set provides a new communication path "IVRing" for collecting kernel log or tracing data of guests by a host without using network in a virtualization environment. Network is generally used to collect log or tracing data after outputting the data as a file. However, since I/O resources such as network or block are shared with other guests, these resources should not be used for logging or tracing. Moreover, high load will be taken to applications on guests using network I/O because there are many network stack layers. Then, a communication method for collecting the data without using I/O resources is needed. There are two requirements to collect kernel log or tracing data by a host: (1) To minimize for user applications in a guest - not using I/O resources (2) To be implemented recording buffer like ring - keep on recording log data or trace data To meet these requirements, a ring-buffer as a device driver for guest OSs, called IVRing, is constructed on Inter-VM shared memory (IVShmem) device. IVShmem implemented in QEMU is a virtual PCI RAM device and uses POSIX shared memory on a host. This device is originally used as a virtual device for low-overhead communication between two guests. On the other hand, here, IVShmem is used as a communication path between a guest and a host for collecting data. IVRing is a buffer of logging or tracing data in a guest, and IVRing-reader, opening shared memory as IVRing on a host, reads the data without memory copying between a guest and a host. Thus, two requirements are met for collecting kernel log or tracing data. We will talk about IVRing in LinuxCon Japan 2012: https://events.linuxfoundation.org/events/linuxcon-japan Title: Low-Overhead Ring-Buffer of Kernel Tracing & Tracing Across Host OS and Guest OS Speakers: Yoshihiro Yunomae and Akihiro Nagai You can download our slides about IVRing in the schedule page. ***Evaluation*** When a host collects tracing data of a guest, the performance of using IVRing is compared with that of using network. The overview of this evaluation is as follows: (a) A guest on a KVM is prepared. - The guest is dedicated one physical CPU as a virtual CPU(VCPU). (b) The guest starts to write tracing data to a SystemTap buffer. - The probe points of SystemTap are all trace points of sched, timer, and kmem. (c) The tracing data are recorded to IVRing sharing memory with a host or the tracing data are sent to a host via network. - 3 patterns, IVRing, NFS, and SSH, are measured. Each methods is explained about later. (d) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark tool in the guest. - Dhrystone 2 intends system performance by repeating integer arithmetic as a score. - Since higher score equals to better system performance, if the score decrease based on bare environment, it indicates that any operation disturbs the integer arithmetic. Then, we define the overhead of transporting trace data is calculated as follows: OVERHEAD = (1 - SCORE_OF_A_METHOD/BARE_SCORE) * 100. The performance of each method is compared as follows: [1] IVRing - A SystemTap script in a guest records trace data to IVRing. - A IVRing-reader on a host reads the data. [2] NFS - A directory in a guest is shared with that in a host via NFS. - A SystemTap script in a guest records trace data to a file in the directory. [3] SSH - A SystemTap script in a guest output trace data to a host using standard output via SSH. Other information is as follows: - host kernel: 3.3.1-5 (Fedora16) CPU: Intel Xeon x5660@2.80GHz(6core) Memory: 50GB - guest(only booting one guest) kernel: 3.4.0+ (Fedora16) CPU: 1VCPU(dedicated) Memory: 2GB 3 patterns based on the bare environment were indicated as follows: Scores overhead against [0] Bare [0] Bare 29043600- [1] IVRing28565398 1.6[%] [2] NFS 22000508 24.3[%] [3] SSH 10246792 64.7[%] The overhead of IVRing is much lower than other methods using network. This is because the IVRing method only records trace data to a ring-buffer. On the other hand, other methods read trace data from a SystemTap buffer to the userland and send the data to a host via network. Therefore, a method of using IVRing minimizes the overhead of transporting trace data from a guest to a host. ***How to use*** Here, how to use IVRing and IVRing-reader is simply given. 1. Prepare any distribution including qemu-kvm binary after 0.13.0 version. IVShmem was pushed on qemu-kvm mainline after 0.13.0 version. Latest Fedora or Ubuntsu are available. 2. Boot a guest installed IVRing driver with device option. A device option is needed as follows: -