Re: [Qemu-devel] [PATCH v2] rtc: placing RTC memory region outside BQL
> > > > > > > $ cat strace_c.sh > > > strace -tt -p $1 -c -o result_$1.log & > > > sleep $2 > > > pid=$(pidof strace) > > > kill $pid > > > cat result_$1.log > > > > > > Before appling this change: > > > $ ./strace_c.sh 10528 30 > > > % time seconds usecs/call callserrors syscall > > > -- --- --- - - > > > 93.870.119070 30 4000 ppoll > > > 3.270.004148 2 2038 ioctl > > > 2.660.003370 2 2014 futex > > > 0.090.000113 1 106 read > > > 0.090.000109 1 104 io_getevents > > > 0.020.29 130 poll > > > 0.000.00 0 1 write > > > -- --- --- - - > > > 100.000.126839 8293 total > > > > > > After appling the change: > > > $ ./strace_c.sh 23829 30 > > > % time seconds usecs/call callserrors syscall > > > -- --- --- - - > > > 92.860.067441 16 4094 ppoll > > > 4.850.003522 2 2136 ioctl > > > 1.170.000850 4 189 futex > > > 0.540.000395 2 202 read > > > 0.520.000379 2 202 io_getevents > > > 0.050.37 130 poll > > > -- --- --- - - > > > 100.000.072624 6853 total > > > > > > The futex call number decreases ~90.6% on an idle windows 7 guest. > > > > These are the same figures as from v1 -- it would be interesting > > to check whether the additional locking that v2 adds has affected > > the results. > > > Oh, yes. the futex number of v2 don't decline compared too much to v1 because > it > takes the BQL before raising the outbound IRQ line now. > > Before applying v2: > # ./strace_c.sh 8776 30 > % time seconds usecs/call callserrors syscall > -- --- --- - - > 78.010.164188 26 6436 ppoll > 8.390.017650 5 370039 futex > 7.680.016157 6 2758 ioctl > 5.480.011530 3 4586 1113 read > 0.300.000640 2032 io_submit > 0.150.000317 489 write > -- --- --- - - > 100.000.210482 17601 1152 total > > After applying v2: > # ./strace_c.sh 15968 30 > % time seconds usecs/call callserrors syscall > -- --- --- - - > 78.280.171117 27 6272 ppoll > 8.500.018571 5 366321 futex > 7.760.016973 6 2732 ioctl > 4.850.010597 3 4115 853 read > 0.310.000672 1163 io_submit > 0.300.000659 4 180 write > -- --- --- - - > 100.000.218589 17025 874 total > > > Does the patch improve performance in a more interesting use > > case than "the guest is just idle" ? > > > I think so, after all, the scope of the locking is reduced . > Besides this, can we optimize the rtc timer to avoid to hold BQL > by separate threads? > Hi Peter, Paolo I tested PCMark 8 (https://www.futuremark.com/benchmarks/pcmark) in win7 guest and got the below results: Guest: 2U2G Before applying v2: Your Work 2.0 score: 2000 Web Browsing - JunglePin0.334s Web Browsing - Amazonia0.132s Writing3.59s Spreadsheet70.13s Video Chat v2/Video Chat playback 1 v2 22.8 fps Video Chat v2/Video Chat encoding v2 307.0 ms Benchmark duration1h 35min 46s After applying v2: Your Work 2.0 score: 2040 Web Browsing - JunglePin0.345s Web Browsing - Amazonia0.132s Writing3.56s Spreadsheet67.83s Video Chat v2/Video Chat playback 1 v2 28.7 fps Video Chat v2/Video Chat encoding v2 324.7 ms Benchmark duration1h 32min 5s Test results show that optimization is very effective in stressful situations. Thanks, -Gonglei
Re: [Qemu-devel] [PATCH v2] rtc: placing RTC memory region outside BQL
> -Original Message- > From: Peter Maydell [mailto:peter.mayd...@linaro.org] > Sent: Tuesday, February 06, 2018 10:36 PM > To: Gonglei (Arei) > Cc: QEMU Developers; Paolo Bonzini; Huangweidong (C) > Subject: Re: [PATCH v2] rtc: placing RTC memory region outside BQL > > On 6 February 2018 at 14:07, Gongleiwrote: > > As windows guest use rtc as the clock source device, > > and access rtc frequently. Let's move the rtc memory > > region outside BQL to decrease overhead for windows guests. > > Meanwhile, adding a new lock to avoid different vCPUs > > access the RTC together. > > > > $ cat strace_c.sh > > strace -tt -p $1 -c -o result_$1.log & > > sleep $2 > > pid=$(pidof strace) > > kill $pid > > cat result_$1.log > > > > Before appling this change: > > $ ./strace_c.sh 10528 30 > > % time seconds usecs/call callserrors syscall > > -- --- --- - - > > 93.870.119070 30 4000 ppoll > > 3.270.004148 2 2038 ioctl > > 2.660.003370 2 2014 futex > > 0.090.000113 1 106 read > > 0.090.000109 1 104 io_getevents > > 0.020.29 130 poll > > 0.000.00 0 1 write > > -- --- --- - - > > 100.000.126839 8293 total > > > > After appling the change: > > $ ./strace_c.sh 23829 30 > > % time seconds usecs/call callserrors syscall > > -- --- --- - - > > 92.860.067441 16 4094 ppoll > > 4.850.003522 2 2136 ioctl > > 1.170.000850 4 189 futex > > 0.540.000395 2 202 read > > 0.520.000379 2 202 io_getevents > > 0.050.37 130 poll > > -- --- --- - - > > 100.000.072624 6853 total > > > > The futex call number decreases ~90.6% on an idle windows 7 guest. > > These are the same figures as from v1 -- it would be interesting > to check whether the additional locking that v2 adds has affected > the results. > Oh, yes. the futex number of v2 don't decline compared too much to v1 because it takes the BQL before raising the outbound IRQ line now. Before applying v2: # ./strace_c.sh 8776 30 % time seconds usecs/call callserrors syscall -- --- --- - - 78.010.164188 26 6436 ppoll 8.390.017650 5 370039 futex 7.680.016157 6 2758 ioctl 5.480.011530 3 4586 1113 read 0.300.000640 2032 io_submit 0.150.000317 489 write -- --- --- - - 100.000.210482 17601 1152 total After applying v2: # ./strace_c.sh 15968 30 % time seconds usecs/call callserrors syscall -- --- --- - - 78.280.171117 27 6272 ppoll 8.500.018571 5 366321 futex 7.760.016973 6 2732 ioctl 4.850.010597 3 4115 853 read 0.310.000672 1163 io_submit 0.300.000659 4 180 write -- --- --- - - 100.000.218589 17025 874 total > Does the patch improve performance in a more interesting use > case than "the guest is just idle" ? > I think so, after all, the scope of the locking is reduced . Besides this, can we optimize the rtc timer to avoid to hold BQL by separate threads? > > +static void rtc_rasie_irq(RTCState *s) > > Typo: should be "raise". > Good catch. :) Thanks, -Gonglei
Re: [Qemu-devel] [PATCH v2] rtc: placing RTC memory region outside BQL
On 6 February 2018 at 14:07, Gongleiwrote: > As windows guest use rtc as the clock source device, > and access rtc frequently. Let's move the rtc memory > region outside BQL to decrease overhead for windows guests. > Meanwhile, adding a new lock to avoid different vCPUs > access the RTC together. > > $ cat strace_c.sh > strace -tt -p $1 -c -o result_$1.log & > sleep $2 > pid=$(pidof strace) > kill $pid > cat result_$1.log > > Before appling this change: > $ ./strace_c.sh 10528 30 > % time seconds usecs/call callserrors syscall > -- --- --- - - > 93.870.119070 30 4000 ppoll > 3.270.004148 2 2038 ioctl > 2.660.003370 2 2014 futex > 0.090.000113 1 106 read > 0.090.000109 1 104 io_getevents > 0.020.29 130 poll > 0.000.00 0 1 write > -- --- --- - - > 100.000.126839 8293 total > > After appling the change: > $ ./strace_c.sh 23829 30 > % time seconds usecs/call callserrors syscall > -- --- --- - - > 92.860.067441 16 4094 ppoll > 4.850.003522 2 2136 ioctl > 1.170.000850 4 189 futex > 0.540.000395 2 202 read > 0.520.000379 2 202 io_getevents > 0.050.37 130 poll > -- --- --- - - > 100.000.072624 6853 total > > The futex call number decreases ~90.6% on an idle windows 7 guest. These are the same figures as from v1 -- it would be interesting to check whether the additional locking that v2 adds has affected the results. Does the patch improve performance in a more interesting use case than "the guest is just idle" ? > +static void rtc_rasie_irq(RTCState *s) Typo: should be "raise". thanks -- PMM