Re: [PATCH RFC] perf: Go back to by-hand proc mmap parsing and kill timeout

2018-10-31 Thread Arnaldo Carvalho de Melo
Em Wed, Oct 31, 2018 at 12:19:10PM -0400, Liang, Kan escreveu:
> 
> 
> On 10/31/2018 9:21 AM, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Oct 30, 2018 at 10:54:16PM -0700, David Miller escreveu:
> > > 
> > > This goes back to by-hand parsing of the proc mmap file, and removes
> > > the timeout.
> > > 
> > > In my measurements this makes the parsing about twice as fast.
> > > 
> > > Profiling thread synthesizing shows that most of the time is spent in
> > > the sscanf() call.
> > > 
> 
> I did some performance test to evaluate the time cost on proc mmap parsing
> with different options/patches.
> 
> The benchmark I used is case-small-allocs from vm-scalability.
> 
> I did the test on two machines.
> M1: Skylake server (224 CPUs)
> M2: Knights Landing (272 CPUs)
> 
> With three different options for perf top.
> P1: baseline test. No David's patch. "perf top"
> P2: No David's patch. Set a big timeout. The proc map processing will not be
> truncated. "perf top --proc-map-timeout 1"
> P3: With David's patch. "perf top"
> 
> The elapsed time is the time cost for synthesize_threads.
> 
> On M1
>   Elapsed time (Seconds)
> P1~6
> P28-24*
> P36-11*
> *The time cost is not stable. I did several tests. Here lists a range of the
> results.
> 
> On M2
>   Elapsed time (Seconds)
> P141
> P2282
> P3176
> I only did test one time.
> 
> The conclusion,
> - David's patch significantly improve the time cost of proc mmap parsing.
> - On some platform, especially Knights Landing, the time cost is still huge
> even with David's patch.
> - I think timeout is still needed. We have to give user warning/hints if the
> processing time is too long. For example, with P3 on M2, I can only see a
> black screen during the ~3 mins processing time.
 

That is why I mentioned that:

> I'm going to split this patch into two, one that makes it go back to
> by-hand proc mmap parsing, and the other killing the mmap-timeout.

I.e. the improvement is welcome, the removal of the timeout is a
separate issue :-)

- Arnaldo

> Thanks,
> Kan
> 
> > > Processing samples is critical for perf top and perf record -a startup
> > > performance, and perf's ability to keep up with high sample rates.
> > 
> > I'm going to split this patch into two, one that makes it go back to
> > by-hand proc mmap parsing, and the other killing the mmap-timeout.
> > 
> > Thanks,
> > 
> > - Arnaldo
> > > Signed-off-by: David S. Miller 
> > > 
> > > diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
> > > index 2b1ef704169f..d4b5177c3c22 100644
> > > --- a/tools/perf/builtin-kvm.c
> > > +++ b/tools/perf/builtin-kvm.c
> > > @@ -1364,8 +1364,6 @@ static int kvm_events_live(struct perf_kvm_stat 
> > > *kvm,
> > >   "show events other than"
> > >   " HLT (x86 only) or Wait state (s390 only)"
> > >   " that take longer than duration usecs"),
> > > - OPT_UINTEGER(0, "proc-map-timeout", &kvm->opts.proc_map_timeout,
> > > - "per thread proc mmap processing timeout in 
> > > ms"),
> > >   OPT_END()
> > >   };
> > >   const char * const live_usage[] = {
> > > @@ -1394,7 +1392,6 @@ static int kvm_events_live(struct perf_kvm_stat 
> > > *kvm,
> > >   kvm->opts.target.uses_mmap = false;
> > >   kvm->opts.target.uid_str = NULL;
> > >   kvm->opts.target.uid = UINT_MAX;
> > > - kvm->opts.proc_map_timeout = 500;
> > >   symbol__init(NULL);
> > >   disable_buildid_cache();
> > > @@ -1453,8 +1450,7 @@ static int kvm_events_live(struct perf_kvm_stat 
> > > *kvm,
> > >   perf_session__set_id_hdr_size(kvm->session);
> > >   
> > > ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
> > >   machine__synthesize_threads(&kvm->session->machines.host, 
> > > &kvm->opts.target,
> > > - kvm->evlist->threads, false,
> > > - kvm->opts.proc_map_timeout, 1);
> > > + kvm->evlist->threads, false, 1);
> > >   err = kvm_live_open_events(kvm);
> > >   if (err)
> > >   goto out;
> > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > index 0980dfe3396b..4259ea42a8ef 100644
> > > --- a/tools/perf/builtin-record.c
> > > +++ b/tools/perf/builtin-record.c
> > > @@ -633,8 +633,7 @@ static int record__synthesize_workload(struct record 
> > > *rec, bool tail)
> > >   err = perf_event__synthesize_thread_map(&rec->tool, thread_map,
> > >
> > > process_synthesized_event,
> > >
> > > &rec->session->machines.host,
> > > -  rec->opts.sample_address,
> > > -  rec->opts.proc_map_timeout);
> > > +  

Re: [PATCH RFC] perf: Go back to by-hand proc mmap parsing and kill timeout

2018-10-31 Thread Liang, Kan




On 10/31/2018 9:21 AM, Arnaldo Carvalho de Melo wrote:

Em Tue, Oct 30, 2018 at 10:54:16PM -0700, David Miller escreveu:


This goes back to by-hand parsing of the proc mmap file, and removes
the timeout.

In my measurements this makes the parsing about twice as fast.

Profiling thread synthesizing shows that most of the time is spent in
the sscanf() call.



I did some performance test to evaluate the time cost on proc mmap 
parsing with different options/patches.


The benchmark I used is case-small-allocs from vm-scalability.

I did the test on two machines.
M1: Skylake server (224 CPUs)
M2: Knights Landing (272 CPUs)

With three different options for perf top.
P1: baseline test. No David's patch. "perf top"
P2: No David's patch. Set a big timeout. The proc map processing will 
not be truncated. "perf top --proc-map-timeout 1"

P3: With David's patch. "perf top"

The elapsed time is the time cost for synthesize_threads.

On M1
Elapsed time (Seconds)
P1  ~6
P2  8-24*
P3  6-11*
*The time cost is not stable. I did several tests. Here lists a range of 
the results.


On M2
Elapsed time (Seconds)
P1  41
P2  282
P3  176
I only did test one time.

The conclusion,
- David's patch significantly improve the time cost of proc mmap parsing.
- On some platform, especially Knights Landing, the time cost is still 
huge even with David's patch.
- I think timeout is still needed. We have to give user warning/hints if 
the processing time is too long. For example, with P3 on M2, I can only 
see a black screen during the ~3 mins processing time.


Thanks,
Kan


Processing samples is critical for perf top and perf record -a startup
performance, and perf's ability to keep up with high sample rates.


I'm going to split this patch into two, one that makes it go back to
by-hand proc mmap parsing, and the other killing the mmap-timeout.

Thanks,

- Arnaldo
  

Signed-off-by: David S. Miller 

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index 2b1ef704169f..d4b5177c3c22 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -1364,8 +1364,6 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
"show events other than"
" HLT (x86 only) or Wait state (s390 only)"
" that take longer than duration usecs"),
-   OPT_UINTEGER(0, "proc-map-timeout", &kvm->opts.proc_map_timeout,
-   "per thread proc mmap processing timeout in 
ms"),
OPT_END()
};
const char * const live_usage[] = {
@@ -1394,7 +1392,6 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
kvm->opts.target.uses_mmap = false;
kvm->opts.target.uid_str = NULL;
kvm->opts.target.uid = UINT_MAX;
-   kvm->opts.proc_map_timeout = 500;
  
  	symbol__init(NULL);

disable_buildid_cache();
@@ -1453,8 +1450,7 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
perf_session__set_id_hdr_size(kvm->session);
ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
machine__synthesize_threads(&kvm->session->machines.host, 
&kvm->opts.target,
-   kvm->evlist->threads, false,
-   kvm->opts.proc_map_timeout, 1);
+   kvm->evlist->threads, false, 1);
err = kvm_live_open_events(kvm);
if (err)
goto out;
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0980dfe3396b..4259ea42a8ef 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -633,8 +633,7 @@ static int record__synthesize_workload(struct record *rec, 
bool tail)
err = perf_event__synthesize_thread_map(&rec->tool, thread_map,
 process_synthesized_event,
 &rec->session->machines.host,
-rec->opts.sample_address,
-rec->opts.proc_map_timeout);
+rec->opts.sample_address);
thread_map__put(thread_map);
return err;
  }
@@ -848,8 +847,7 @@ static int record__synthesize(struct record *rec, bool tail)
}
  
  	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,

-   process_synthesized_event, 
opts->sample_address,
-   opts->proc_map_timeout, 1);
+   process_synthesized_event, 
opts->sample_address, 1);
  out:
return err;
  }
@@ -1521,7 +1519,6 @@ static struct record record = {
.uses_mmap   = true,
.default_per_cpu = true,
},
-   .proc_map_timeout = 500,
},

Re: [PATCH RFC] perf: Go back to by-hand proc mmap parsing and kill timeout

2018-10-31 Thread Arnaldo Carvalho de Melo
Em Tue, Oct 30, 2018 at 10:54:16PM -0700, David Miller escreveu:
> 
> This goes back to by-hand parsing of the proc mmap file, and removes
> the timeout.
> 
> In my measurements this makes the parsing about twice as fast.
> 
> Profiling thread synthesizing shows that most of the time is spent in
> the sscanf() call.
> 
> Processing samples is critical for perf top and perf record -a startup
> performance, and perf's ability to keep up with high sample rates.

I'm going to split this patch into two, one that makes it go back to
by-hand proc mmap parsing, and the other killing the mmap-timeout.

Thanks,

- Arnaldo
 
> Signed-off-by: David S. Miller 
> 
> diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
> index 2b1ef704169f..d4b5177c3c22 100644
> --- a/tools/perf/builtin-kvm.c
> +++ b/tools/perf/builtin-kvm.c
> @@ -1364,8 +1364,6 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
>   "show events other than"
>   " HLT (x86 only) or Wait state (s390 only)"
>   " that take longer than duration usecs"),
> - OPT_UINTEGER(0, "proc-map-timeout", &kvm->opts.proc_map_timeout,
> - "per thread proc mmap processing timeout in 
> ms"),
>   OPT_END()
>   };
>   const char * const live_usage[] = {
> @@ -1394,7 +1392,6 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
>   kvm->opts.target.uses_mmap = false;
>   kvm->opts.target.uid_str = NULL;
>   kvm->opts.target.uid = UINT_MAX;
> - kvm->opts.proc_map_timeout = 500;
>  
>   symbol__init(NULL);
>   disable_buildid_cache();
> @@ -1453,8 +1450,7 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
>   perf_session__set_id_hdr_size(kvm->session);
>   ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
>   machine__synthesize_threads(&kvm->session->machines.host, 
> &kvm->opts.target,
> - kvm->evlist->threads, false,
> - kvm->opts.proc_map_timeout, 1);
> + kvm->evlist->threads, false, 1);
>   err = kvm_live_open_events(kvm);
>   if (err)
>   goto out;
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 0980dfe3396b..4259ea42a8ef 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -633,8 +633,7 @@ static int record__synthesize_workload(struct record 
> *rec, bool tail)
>   err = perf_event__synthesize_thread_map(&rec->tool, thread_map,
>process_synthesized_event,
>&rec->session->machines.host,
> -  rec->opts.sample_address,
> -  rec->opts.proc_map_timeout);
> +  rec->opts.sample_address);
>   thread_map__put(thread_map);
>   return err;
>  }
> @@ -848,8 +847,7 @@ static int record__synthesize(struct record *rec, bool 
> tail)
>   }
>  
>   err = __machine__synthesize_threads(machine, tool, &opts->target, 
> rec->evlist->threads,
> - process_synthesized_event, 
> opts->sample_address,
> - opts->proc_map_timeout, 1);
> + process_synthesized_event, 
> opts->sample_address, 1);
>  out:
>   return err;
>  }
> @@ -1521,7 +1519,6 @@ static struct record record = {
>   .uses_mmap   = true,
>   .default_per_cpu = true,
>   },
> - .proc_map_timeout = 500,
>   },
>   .tool = {
>   .sample = process_sample_event,
> @@ -1651,8 +1648,6 @@ static struct option __record_options[] = {
>   parse_clockid),
>   OPT_STRING_OPTARG('S', "snapshot", &record.opts.auxtrace_snapshot_opts,
> "opts", "AUX area tracing Snapshot Mode", ""),
> - OPT_UINTEGER(0, "proc-map-timeout", &record.opts.proc_map_timeout,
> - "per thread proc mmap processing timeout in ms"),
>   OPT_BOOLEAN(0, "namespaces", &record.opts.record_namespaces,
>   "Record namespaces events"),
>   OPT_BOOLEAN(0, "switch-events", &record.opts.record_switch_events,
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index d21d8751e749..4eec44d815bd 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -1096,7 +1096,6 @@ static int __cmd_top(struct perf_top *top)
>  
>   machine__synthesize_threads(&top->session->machines.host, &opts->target,
>   top->evlist->threads, false,
> - opts->proc_map_timeout,
>   top->nr_threads_synthesize);
>  
>   if (top->nr_threads_synthesize > 1)
> @@ -1256

[PATCH RFC] perf: Go back to by-hand proc mmap parsing and kill timeout

2018-10-30 Thread David Miller


This goes back to by-hand parsing of the proc mmap file, and removes
the timeout.

In my measurements this makes the parsing about twice as fast.

Profiling thread synthesizing shows that most of the time is spent in
the sscanf() call.

Processing samples is critical for perf top and perf record -a startup
performance, and perf's ability to keep up with high sample rates.

Signed-off-by: David S. Miller 

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index 2b1ef704169f..d4b5177c3c22 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -1364,8 +1364,6 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
"show events other than"
" HLT (x86 only) or Wait state (s390 only)"
" that take longer than duration usecs"),
-   OPT_UINTEGER(0, "proc-map-timeout", &kvm->opts.proc_map_timeout,
-   "per thread proc mmap processing timeout in 
ms"),
OPT_END()
};
const char * const live_usage[] = {
@@ -1394,7 +1392,6 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
kvm->opts.target.uses_mmap = false;
kvm->opts.target.uid_str = NULL;
kvm->opts.target.uid = UINT_MAX;
-   kvm->opts.proc_map_timeout = 500;
 
symbol__init(NULL);
disable_buildid_cache();
@@ -1453,8 +1450,7 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
perf_session__set_id_hdr_size(kvm->session);
ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
machine__synthesize_threads(&kvm->session->machines.host, 
&kvm->opts.target,
-   kvm->evlist->threads, false,
-   kvm->opts.proc_map_timeout, 1);
+   kvm->evlist->threads, false, 1);
err = kvm_live_open_events(kvm);
if (err)
goto out;
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0980dfe3396b..4259ea42a8ef 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -633,8 +633,7 @@ static int record__synthesize_workload(struct record *rec, 
bool tail)
err = perf_event__synthesize_thread_map(&rec->tool, thread_map,
 process_synthesized_event,
 &rec->session->machines.host,
-rec->opts.sample_address,
-rec->opts.proc_map_timeout);
+rec->opts.sample_address);
thread_map__put(thread_map);
return err;
 }
@@ -848,8 +847,7 @@ static int record__synthesize(struct record *rec, bool tail)
}
 
err = __machine__synthesize_threads(machine, tool, &opts->target, 
rec->evlist->threads,
-   process_synthesized_event, 
opts->sample_address,
-   opts->proc_map_timeout, 1);
+   process_synthesized_event, 
opts->sample_address, 1);
 out:
return err;
 }
@@ -1521,7 +1519,6 @@ static struct record record = {
.uses_mmap   = true,
.default_per_cpu = true,
},
-   .proc_map_timeout = 500,
},
.tool = {
.sample = process_sample_event,
@@ -1651,8 +1648,6 @@ static struct option __record_options[] = {
parse_clockid),
OPT_STRING_OPTARG('S', "snapshot", &record.opts.auxtrace_snapshot_opts,
  "opts", "AUX area tracing Snapshot Mode", ""),
-   OPT_UINTEGER(0, "proc-map-timeout", &record.opts.proc_map_timeout,
-   "per thread proc mmap processing timeout in ms"),
OPT_BOOLEAN(0, "namespaces", &record.opts.record_namespaces,
"Record namespaces events"),
OPT_BOOLEAN(0, "switch-events", &record.opts.record_switch_events,
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index d21d8751e749..4eec44d815bd 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1096,7 +1096,6 @@ static int __cmd_top(struct perf_top *top)
 
machine__synthesize_threads(&top->session->machines.host, &opts->target,
top->evlist->threads, false,
-   opts->proc_map_timeout,
top->nr_threads_synthesize);
 
if (top->nr_threads_synthesize > 1)
@@ -1256,7 +1255,6 @@ int cmd_top(int argc, const char **argv)
.target = {
.uses_mmap   = true,
},
-   .proc_map_timeout= 500,
.overwrite  = 1,
},
.max_sta