Re: [Qemu-devel] 答复: Expansion Ratio Issue

2014-06-05 Thread Alex Bennée

Chaos Shu writes:

 Hi

 I'm running SPEC CPU2006 on three kinds of situation, native aarch64 binary 
 and emulator x86_64 system running SPEC CPU2006 and linux user mode level 
 running x86_64 SPEC CPU2006 binary.

 To find where the performance lose, translator ? or execution of instruction 
 after TCG? Or something else

 I guess most of  time, up to 90% should be spent on exec the
 instruction of TCG, does that mean the quality of translating lead to
 the performance lost directly ?

It really depends on the type of code you are executing but yes most of
the time should be spent in TCG generated code. However if you are
running a lot of FP heavy code you'll find it spends a lot of time in
helper routines calling the internal softfloat code.

I posted some patches a few months ago that enabled output to help the
Linux perf tool track this. I haven't got time to re-work at the
moment but it might give you a head start to instrumentation:

https://patches.linaro.org/27229/


 Thanks
 Chaos

 On 29.05.2014 13:04, Peter Maydell wrote:
 No, we don't in general have any benchmarking of TCG codegen. I think 
 if we did do benchmarking we'd be interested in performance 
 benchmarking -- code expansion ratio doesn't seem like a very 
 interesting thing to measure to me.

 Hi,

 I have a plan to play with TCG performance benchmarking. And then try to 
 implement some optimizations. So maybe there would be some suggestions on how 
 to perform such benchmarking? What tests seems to be appropriate for this 
 task? I think the benchmarking should reflect real TCG use cases. So what the 
 most typical use cases for TCG are there? Seems that system and user modes 
 may be different from this point.

 Appreciate any help.

 Thanks,
 Sergey.

-- 
Alex Bennée



Re: [Qemu-devel] 答复: Expansion Ratio Issue

2014-06-05 Thread Sergey Fedorov
On 05.06.2014 12:02, Alex Bennée wrote:
 Chaos Shu writes:

 Hi

 I'm running SPEC CPU2006 on three kinds of situation, native aarch64 binary 
 and emulator x86_64 system running SPEC CPU2006 and linux user mode level 
 running x86_64 SPEC CPU2006 binary.

 To find where the performance lose, translator ? or execution of instruction 
 after TCG? Or something else

 I guess most of  time, up to 90% should be spent on exec the
 instruction of TCG, does that mean the quality of translating lead to
 the performance lost directly ?
 It really depends on the type of code you are executing but yes most of
 the time should be spent in TCG generated code. However if you are
 running a lot of FP heavy code you'll find it spends a lot of time in
 helper routines calling the internal softfloat code.

 I posted some patches a few months ago that enabled output to help the
 Linux perf tool track this. I haven't got time to re-work at the
 moment but it might give you a head start to instrumentation:

 https://patches.linaro.org/27229/

Thanks for replying! I used to think about Drystone, gzim, gcc in user
mode. In system mode, Linux boot up and, again, Drystone, gzim, gcc.
Regarding SPEC test, that is not available for free, isn't it?

Thanks,
Sergey


 Thanks
 Chaos

 On 29.05.2014 13:04, Peter Maydell wrote:
 No, we don't in general have any benchmarking of TCG codegen. I think 
 if we did do benchmarking we'd be interested in performance 
 benchmarking -- code expansion ratio doesn't seem like a very 
 interesting thing to measure to me.
 Hi,

 I have a plan to play with TCG performance benchmarking. And then try to 
 implement some optimizations. So maybe there would be some suggestions on 
 how to perform such benchmarking? What tests seems to be appropriate for 
 this task? I think the benchmarking should reflect real TCG use cases. So 
 what the most typical use cases for TCG are there? Seems that system and 
 user modes may be different from this point.

 Appreciate any help.

 Thanks,
 Sergey.




Re: [Qemu-devel] 答复: Expansion Ratio Issue

2014-06-05 Thread Peter Maydell
On 5 June 2014 14:00, Sergey Fedorov serge.f...@gmail.com wrote:
 Thanks for replying! I used to think about Drystone, gzim, gcc in user
 mode. In system mode, Linux boot up and, again, Drystone, gzim, gcc.

Probably worth making sure you also test workloads that do different
things in multiple processes (to catch performance issues from over
frequent TB/TLB flushes, and so on).

thanks
-- PMM



Re: [Qemu-devel] 答复: Expansion Ratio Issue

2014-06-05 Thread Sergey Fedorov
On 05.06.2014 17:07, Peter Maydell wrote:
 Probably worth making sure you also test workloads that do different
 things in multiple processes (to catch performance issues from over
 frequent TB/TLB flushes, and so on).

Maybe make -jN?

Thanks,
Sergey



[Qemu-devel] 答复: Expansion Ratio Issue

2014-06-04 Thread Chaos Shu
Hi

I'm running SPEC CPU2006 on three kinds of situation, native aarch64 binary and 
emulator x86_64 system running SPEC CPU2006 and linux user mode level running 
x86_64 SPEC CPU2006 binary.

To find where the performance lose, translator ? or execution of instruction 
after TCG? Or something else

I guess most of  time, up to 90% should be spent on exec the instruction of 
TCG, does that mean the quality of translating lead to the performance lost 
directly ? 

Thanks
Chaos

On 29.05.2014 13:04, Peter Maydell wrote:
 No, we don't in general have any benchmarking of TCG codegen. I think 
 if we did do benchmarking we'd be interested in performance 
 benchmarking -- code expansion ratio doesn't seem like a very 
 interesting thing to measure to me.

Hi,

I have a plan to play with TCG performance benchmarking. And then try to 
implement some optimizations. So maybe there would be some suggestions on how 
to perform such benchmarking? What tests seems to be appropriate for this task? 
I think the benchmarking should reflect real TCG use cases. So what the most 
typical use cases for TCG are there? Seems that system and user modes may be 
different from this point.

Appreciate any help.

Thanks,
Sergey.





[Qemu-devel] 答复: Expansion Ratio Issue

2014-05-29 Thread Chaos Shu
On 29 May 2014 08:58, Chaos Shu chaos.s...@live.com wrote:
 1.   Any benchmarks paying attention to TCG code generate quality
 measured by code expansion ratio? Of course I’ve got some news said 
 that the ratio maybe 4 or 5 in X86 to MIPS, that is to say 1 x86 insn 
 to 4 or 5 mips insns, Does it mean the industry level or average 
 level? Any official report given?

No, we don't in general have any benchmarking of TCG codegen. I think if we did 
do benchmarking we'd be interested in performance benchmarking -- code 
expansion ratio doesn't seem like a very interesting thing to measure to me.

[Chaos] Assuming that we just care about running x86 application on arm, in 
general way we translate x86 insn to operations then to arm insns, but that 
means we need more cycles in arm to finish issuing the insns if the code 
expansion ratio is high.

I've investigated some industry methods such as registers map(x86 registers 
maps to arm registers directly but not op on register memory table), actually 
the improvement is limited

And another idea such insn-insn directly, according the runtime statistics, 
mostly we use 20% insn such as br move load, I mean is that possible to map 
those 20% insns directly from x86 to arm and make the left 80% right, but this 
is just original and more important maybe impossible to practice idea, is there 
any research about this on the way? How do you think about this?

 2.   I’ve noticed that once Apple merge from PowerPC to X86, they
 developed the software named Rosetta which is described by apple to be 
 successful, is it the same to Qemu? Any internal infos covered?

It's a similar concept, though as I understand it it focused on doing 
translation for a single application (like QEMU's linux-user mode, not like our 
system emulation mode). I have no idea about its internal design.

[Chaos] I think ARM should provide a runtime library help the customers to 
merge from x86 more smoothly even performance loss, So does arm really get 
that? 

 3.   Assume that we just wanna x86 to arm, so may we can strip out the
 little operations and work on insn to insn such as move in x86 to move 
 in arm, insn level translate but not insn-op-insn, I think there must 
 be someone have ever made this try, anyone got their news?

Certainly if you started from scratch with the intention of doing a more 
specifically targeted design (and in particular if you wanted to do 
single-application translation as your core focus rather than as a bolt-on 
extension to system emulation) you could probably get better performance than 
QEMU. QEMU generally aims to be a general-purpose project, though.

Personally I would (even if doing only x86-to-ARM) still include an 
intermediate representation of some form: the history of compiler design shows 
that it has a lot of utility.

[Chaos] Case be compiled, all syntax information has been stripped out, all we 
get is op_reg_reg different from JVM, we only dance with registers and insn 
without anything, that's the problem.

 4.   Why Qemu use only one TCG runtime, I found a project named PQEMU
 once try to make TCG running on multicore but it’s out of date and got 
 some commercial issues, is there any project trying to make it go?

Not that I currently know of. Truly parallel TCG execution of multiple guest 
cores is a hard problem, especially if you want to produce maintainable solid 
code that can be included upstream, rather than just enough of a prototype to 
demonstrate proof of concept and run some simple benchmarks for an academic 
paper.

[Chaos] After all, what's current status of industry product making x86 
application running on arm? Is still dark night in middle age and I have to 
make big effort to make Qemu to be so? 

Anyway, any infos about that issue is welcome, thanks very much.

Thanks
Chaos