Re: [v8-dev] V8 performance question

sirishrp Wed, 06 Sep 2017 14:32:12 -0700


On Wednesday, August 30, 2017 at 7:28:41 PM UTC-5, Jakob Kummerow wrote:
>
> On Wed, Aug 30, 2017 at 2:04 PM, <[email protected] <javascript:>> wrote:
>
>> valgrind might be more accurate as it  measures actual number of executed 
>> instructions vs perf counting samples
>>
>
> Well, that means it measures something different. Ultimately the number of 
> executed instructions doesn't matter; what matters is how long it took.
>


Agreed, ultimately what matters is how long it takes to run a program. 
Valgrind is a tool for me to look at which are the "most time consuming" 
part of code. 
 

>  
>
>> Are  you aware of any --prof options that will give me more detail about 
>> the parsing in V8? 
>>
>
> No. What kind of detail are you looking for? --prof gives you 
> per-function tick counts, is that not detailed enough?
>

Yes, per-function tick counts are great. But the benchmark that I am using 
(Octane) does not really represent the real world scenarios. Ideally, I 
want to look at  per-function tick counting when running speedometer2.0, 
but I have not gotten to a point running speedometer 2.0 on a command line.

I also think runtime-call-stack measures the time for the functions that v8 
"wants" to measure. For example, when I run Octane/typescirpt-compiler, it 
measures ParseFunctionLiteral as the "hottest" function taking 270ms 
(34.4%) and being called 1385 times. It looks like its taking measurement 
is based on the calls/counters  (RuntimeCallStats) that  is placed at the 
entry of the functions that it wants to instrument eg. 
ParseFunctionLiteral. If we don't instrument a function (say 
ParseAssignmentExpression), we will never see that function is hot (eg. 
ParseAssignmentExpression is never seen as hot with --call-runtime-stats). 
Maybe v8 developers have already done this analysis and have come to 
conclusion that ParseAssignmentExpress is not hot enough, and can be 
ignored to get its call stats. 

Here's the output that I get from --prof --runtime-call-stats for 
Octane/typescipt-compiler

                      Runtime Function/C++ Builtin        Time             
Count
========================================================================================
                              ParseFunctionLiteral    270.44ms  34.30%     
 1385  15.15%
                              StoreIC_Premonomorphic    130.69ms  16.58%   
   1224  13.39%
                    API_ScriptCompiler_CompileUnbound     74.92ms   9.50%   
      3   0.03%
                    PreParseWithVariableResolution     46.62ms   5.91%     
  393   4.30%
                         GC_Custom_SlowAllocateRaw     36.87ms   4.68%     
   14   0.15%


However, tools/linux-tick-processer does not show me that 
ParseFunctionLiteral is as "hot". If it was taking 34% of the time for 
execution, then it would have more ticks on its execution.
Here's the output from tools/linux-tick-processor
Statistical profiling result from v8.log, (801 ticks, 224 unaccounted, 0 
excluded).

 [Shared libraries]:
   ticks  total  nonlib   name
      7    0.9%          /lib/x86_64-linux-gnu/libc-2.23.so
      1    0.1%          [vdso]

 [JavaScript]:
   ticks  total  nonlib   name
      1    0.1%    0.1%  Builtin: LoadIC_Uninitialized

 [C++]:
   ticks  total  nonlib   name
    341   42.6%   43.0% 
 v8::Shell::Load(v8::FunctionCallbackInfo<v8::Value> const&)
     13    1.6%    1.6%  _init
      9    1.1%    1.1%  v8::internal::MemoryAllocator::ZapBlock(unsigned 
char*, unsigned long)
      9    1.1%    1.1%  std::__1::basic_string<char, 
std::__1::char_traits<char>, std::__1::allocator<char> >* 
v8::base::CheckEQImpl<bool, bool>(bool, bool, char const*)
      8    1.0%    1.0%  v8::base::Acquire_Load(long const volatile*)
      7    0.9%    0.9% 
 v8::internal::SmiTagging<8ul>::SmiToInt(v8::internal::Object const*)


So, I am a little confused about how to corelate performance data from the 
same run using --runtime-call-stats, and linux-tick-processor. Is there 
anything I am missing?

Because these data was inconsistent, I went back to measuring V8 with 
valgrind and got the hottest function from valgrind. All I did this this:
valgrind --tool=callgrind ./d8 run.js
kcachegrind callgrind.out.XXXX

It does show that ParseFunctionLiteral is hot, but the hottest one is 
ParseAssignmentExpression. I understand you have not used valgrind to 
measure performance, and you have not seen this data. So now, I don't know 
which one is the hottest function in Parsing, but I do now it is Parsing in 
V8 which is very expensive. Hence, I wanted to know if  there are specific 
options that measures  parsing in a little more detail.

Sirish 
  

>  
>
>>
>> Thanks.
>>
>> Sirish 
>>
>>
>> On Monday, August 14, 2017 at 1:29:20 PM UTC-5, Jakob Kummerow wrote:
>>>
>>> Measuring and investigating performance is indeed difficult, and there 
>>> is no single answer to how best to do it. I haven't heard of valgrind being 
>>> used for this purpose, and don't know how to make sense of its output. I 
>>> mostly use V8's builtin --prof and tools/linux-tick-processor, or when 
>>> that's too coarse, the Linux perf tool (for the latter, see 
>>> instructions in V8's wiki). --runtime-call-stats can also be highly 
>>> useful for investigating certain situations.
>>>
>>> On Mon, Aug 14, 2017 at 9:16 AM, <[email protected]> wrote:
>>>
>>>> Hi, 
>>>>
>>>> I am trying to look at performance of V8. At this time, this is all on 
>>>> X86. To build V8, I am using NDK compiler that comes when I download V8. 
>>>> That is pretty straightforward process as mentioned. Once I build 
>>>> "release" 
>>>> or  "debug" V8, I ran mandreel and codeload Octance benchmarks with 
>>>> valgrind. 
>>>>
>>>
>>> Be aware that Release and Debug mode have vastly different performance 
>>> profiles. Only Release mode is representative of real-world performance.
>>>  
>>>
>>>> Then I use cg_annotate to look at Instruction count, data read/write, 
>>>> branches etc.  I am just curious where the bottlenecks are. I get the 
>>>> following numbers for mandreel. 
>>>>
>>>
>>> Also, note that "the bottlenecks" can be *very* different depending on 
>>> what test/benchmark you run.
>>>  
>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------------------
>>>>            Ir            Dr            Dw         Bi          Bc
>>>>
>>>> --------------------------------------------------------------------------------
>>>> 6,737,556,200 1,975,238,610 1,015,483,012 65,020,470 941,081,112 
>>>>  PROGRAM TOTALS
>>>>
>>>>
>>>> --------------------------------------------------------------------------------
>>>>            Ir          Dr          Dw         Bi          Bc 
>>>>  file:function
>>>>
>>>> --------------------------------------------------------------------------------
>>>> 2,750,360,898 769,151,058 269,297,660 45,068,095 410,353,595  ???:???
>>>>   165,387,529  59,780,937  25,440,838      9,340  27,302,136 
>>>>  ???:v8::internal::Scanner::ScanIdentifierOrKeyword()
>>>>   156,426,896  43,240,550  25,588,801          0  17,909,648 
>>>>  
>>>> ???:v8::internal::ExpressionClassifier<v8::internal::ParserTypes<v8::internal::Parser>
>>>>  
>>>> >::Accumulate(v8::internal::ExpressionClassifier<v8::internal::ParserTypes<v8::
>>>> > 
>>>>         internal::Parser> >*, unsigned int, bool)
>>>>   150,476,281  44,181,365  28,594,154  3,145,871   9,250,952 
>>>>  ???:v8::internal::Scanner::Scan()
>>>> ..
>>>> ..
>>>>
>>>>
>>>> It shows there there are 2.7billion instructions for ???:???. I am 
>>>> guessing that these are the instructions that are reponsible for hidden 
>>>> classes, loading/creating/deleting/accessing etc. If my guess is not 
>>>> correct, please let me know. And then there are usual 165 million for 
>>>> v8::internal::Scanner::ScanIdentifierOrKeyword(),150 million for scan() 
>>>> etc.
>>>>
>>>> However, what is interesting is this - on the subsequent run on the 
>>>> same machine, I see the following numbers:
>>>>
>>>> --------------------------------------------------------------------------------
>>>>             Ir            Dr            Dw          Bi            Bc
>>>>
>>>> --------------------------------------------------------------------------------
>>>> 12,369,840,202 3,844,013,654 1,701,470,472 248,336,756 1,605,561,615 
>>>>  PROGRAM TOTALS
>>>>
>>>>
>>>> --------------------------------------------------------------------------------
>>>>            Ir            Dr          Dw          Bi          Bc 
>>>>  file:function
>>>>
>>>> --------------------------------------------------------------------------------
>>>> 6,361,147,238 2,029,054,548 684,679,708 228,306,958 762,170,252  ???:???
>>>>   690,157,426   260,436,764 130,218,388           0  91,152,866 
>>>>  ???:v8::internal::Runtime_TryInstallOptimizedCode(int, 
>>>> v8::internal::Object**, v8::internal::Isolate*)
>>>>   470,287,848   104,517,164  26,109,566           0  91,422,886 
>>>>  
>>>> /build/glibc-bfm8X4/glibc-2.23/nptl/../nptl/pthread_mutex_lock.c:pthread_mutex_lock
>>>>   365,612,760   104,464,484  13,067,997           0 104,438,144 
>>>>  
>>>> /build/glibc-bfm8X4/glibc-2.23/nptl/pthread_mutex_unlock.c:pthread_mutex_unlock
>>>>   364,611,492    91,152,878 104,174,716           0  13,021,838 
>>>>  
>>>> ???:v8::internal::StackGuard::CheckAndClearInterrupt(v8::internal::StackGuard::InterruptFlag)
>>>>   165,387,529    59,780,937  25,440,838       9,340  27,302,136 
>>>>  ???:v8::internal::Scanner::ScanIdentifierOrKeyword()
>>>>   156,426,896    43,240,550  25,588,801           0  17,909,648 
>>>>  
>>>> ???:v8::internal::ExpressionClassifier<v8::internal::ParserTypes<v8::internal::Parser>
>>>>  
>>>> >::Accumulate(v8::internal::ExpressionClassifier<v8::internal::ParserTypes<v8::
>>>> > 
>>>>      internal::Parser> >*, unsigned int, bool)
>>>>   150,476,281    44,181,365  28,594,154   3,145,871   9,250,952 
>>>>  ???:v8::internal::Scanner::Scan()
>>>>
>>>>
>>>> All the numbers to the V8 functions remained the same; however all 
>>>> hidden classes numbers and calls to the library went off the chart. For 
>>>> eg, 
>>>> instructions dealing with hidden classes went from 2.7 billion to 6.3 
>>>> billion. I don't see how using the same benchmark would make such huge 
>>>> difference in the numbers. Can anyone please explain? 
>>>>
>>>> Also, how does V8 community look at performance numbers if not using 
>>>> standard performance monitoring tools like valgrind? Is there any other 
>>>> way 
>>>> to look at performance numbers?
>>>>
>>>> Thanks.
>>>> Sirish
>>>>
>>>> -- 
>>>> -- 
>>>> v8-dev mailing list
>>>> [email protected]
>>>> http://groups.google.com/group/v8-dev
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "v8-dev" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> -- 
>> v8-dev mailing list
>> [email protected] <javascript:>
>> http://groups.google.com/group/v8-dev
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "v8-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [v8-dev] V8 performance question

Reply via email to