Re: [Qemu-devel] TCG flow vs dyngen
That said, QEMU's currently working fairly well on this front too, so studying either should work pretty well... Mr Richard Henderson's patch on elfload.c says I was right.. at least the version I am working on (qemu-0.13.0) had some bugs and weaknesses though it worked smoothly for most cases.. And to be honest, the best way to get up to speed on this is to read this: http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html uhmm seems a good piece.. maybe one of the last I still didn't have :) Thank you!! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
Again wow!! Is that really possible? Some sort of callback triggered at every instruction execution? Yes, this mechanism works. I have written a code to count different kinds of instructions. Great! that opens a lot of possibilities!. It exists in file qemu/target-i386/translate.c Ops right! I checked target-ppc/translate.c as I need Power-PC as target.. I wonder what function replaces it there.. You are also talking about qemu source code privided here http://wiki.qemu.org/Download, right? Yes I am using this http://wiki.qemu.org/download/qemu-0.13.0.tar.gz If you need, I can give the source code of counting implementation with some documentation. Hope this helps. Wow that would be awesome! I'd really appreciate it very much! Thank you! :) You are free of sending it to my address! :) Best regards!! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
On 01/25/2011 10:05 AM, Edgar E. Iglesias wrote: On Tue, Jan 25, 2011 at 10:04:39AM +0100, Stefano Bonifazi wrote: Again wow!! Is that really possible? Some sort of callback triggered at every instruction execution? Yes, this mechanism works. I have written a code to count different kinds of instructions. Great! that opens a lot of possibilities!. It exists in file qemu/target-i386/translate.c Ops right! I checked target-ppc/translate.c as I need Power-PC as target.. I wonder what function replaces it there.. You are also talking about qemu source code privided here http://wiki.qemu.org/Download, right? Yes I am using this http://wiki.qemu.org/download/qemu-0.13.0.tar.gz If you need, I can give the source code of counting implementation with some documentation. Hope this helps. Wow that would be awesome! I'd really appreciate it very much! Thank you! :) You are free of sending it to my address! :) Hi, If you are interested in instruction counting maybe you should take a look at the -icount option as well. Cheers Thank you! Already tried long ago, it doesn't work with qemu-user..If I remember fine its core was in files not used in qemu-user :( Regards, Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
On Tue, Jan 25, 2011 at 10:04:39AM +0100, Stefano Bonifazi wrote: Again wow!! Is that really possible? Some sort of callback triggered at every instruction execution? Yes, this mechanism works. I have written a code to count different kinds of instructions. Great! that opens a lot of possibilities!. It exists in file qemu/target-i386/translate.c Ops right! I checked target-ppc/translate.c as I need Power-PC as target.. I wonder what function replaces it there.. You are also talking about qemu source code privided here http://wiki.qemu.org/Download, right? Yes I am using this http://wiki.qemu.org/download/qemu-0.13.0.tar.gz If you need, I can give the source code of counting implementation with some documentation. Hope this helps. Wow that would be awesome! I'd really appreciate it very much! Thank you! :) You are free of sending it to my address! :) Hi, If you are interested in instruction counting maybe you should take a look at the -icount option as well. Cheers
Re: [Qemu-devel] TCG flow vs dyngen
On 01/24/2011 12:40 AM, Rob Landley wrote: On 01/23/2011 04:25 PM, Stefano Bonifazi wrote: I am trying to shift in memory the target executable .. now the code is supposed to be loaded by the elfloader at the exact start address set at link time .. Ah, elf loading. That's a whole 'nother bag of worms. Oddly enough, I was deling with this last year trying to debug the uClibc dynamic linker. I blogged a bit about it at the time: http://landley.net/notes-2010.html#12-07-2010 (And the next few days. Sigh, I never did go back and fill in the holes, did I?) Inside elfloader there is even a check for verifying whether that address range is busy.. but no action is taken in that case o.O Maybe I'll post a new thread about this problem (bug?) .. anyway if you think you can help me anyway I'll give you further details.. Tired right now, but if you post a clearer question (what are you trying to _do_) and cc: me on it I'll try to respond. Maybe I can find some decent documentation to point you at, or maybe I'll write some... Rob Thank you! I read your post, and yup you also noticed the weird of load_bias.. and wondered how it can work on x86.. But I think your work was on qemu-system.. I am working on qemu-user.. Yup better to post a new thread, I'll cc: you there! Thank you very much! Stefano B
Re: [Qemu-devel] TCG flow vs dyngen
2011/1/23 Rob Landley r...@landley.net: Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at answering: ...here's a couple of clarifications: 2. how can I check the number of target cpu cycles or target instructions executed inside qemu-user (i.e. qemu-ppc)? You can't, because QEMU doesn't work that way. QEMU isn't an instruction level emulator, it's closer to a Java JIT. Being a JIT doesn't prohibit counting target instructions executed. It just means that counting them generally requires generating code to do the counting at runtime, so it's a more complicated change to make than it would be in a non-JIT emulator. The major reason for not counting cycles is that for an emulation of a modern CPU this is pretty nearly impossible: the number of cycles an instruction takes can depend on whether it causes a cache miss, which CPU internal pipeline it uses, whether it needs to stall waiting for a result from an earlier insn, whether the CPU correctly predicted the branch leading up to it or not, and on and on. You would need to precisely model all the internals of each variant of each CPU, which would be a mammoth undertaking requiring probably unpublished internal data, and if you ever managed to finish it then it would run incredibly slowly and would probably contain enough bugs you couldn't trust the data it gave you anyway. This means that QEMU can no longer run on a type of host it can't execute target code for This isn't correct; for instance there's hppa support in TCG for hppa hosts but no hppa target support, and there's sh4 target support but no TCG backend for it. The two ends are cleanly separated in qemu and don't generally depend on each other. -- PMM
Re: [Qemu-devel] TCG flow vs dyngen
On 01/24/2011 03:32 PM, Peter Maydell wrote: Being a JIT doesn't prohibit counting target instructions executed. It just means that counting them generally requires generating code to do the counting at runtime, so it's a more complicated change to make than it would be in a non-JIT emulator. What do you mean? Should I change the code of qemu-user for counting the instructions, or should I add code into the target binaries? The major reason for not counting cycles is that for an emulation of a modern CPU this is pretty nearly impossible: the number of cycles an instruction takes can depend on whether it causes a cache miss, which CPU internal pipeline it uses, whether it needs to stall waiting for a result from an earlier insn, whether the CPU correctly predicted the branch leading up to it or not, and on and on. You would need to precisely model all the internals of each variant of each CPU, which would be a mammoth undertaking requiring probably unpublished internal data, and if you ever managed to finish it then it would run incredibly slowly and would probably contain enough bugs you couldn't trust the data it gave you anyway. Yup, I think it was just a silly mistake of mine when in the first post I wrote cycles.. that was because for me anything that can estimate how long it takes to do the work would be fine.. I can't simply check the time because that is host machine dependent... Number of executed instructions would be fine.. This means that QEMU can no longer run on a type of host it can't execute target code for This isn't correct; for instance there's hppa support in TCG for hppa hosts but no hppa target support, and there's sh4 target support but no TCG backend for it. The two ends are cleanly separated in qemu and don't generally depend on each other. Well I experienced a strange behavior some time ago that initially made me think mr Rob was right on that though I knew host support and target support were separated in qemu: I tried to make directly qemu-ppc on a x86_64 machine from inside ppc-linux-user folder (i can do fine onto x86 machine) and it failed because there was no tgc/x86_64/tcg_target.h, whereas doing the make from within the main folder worked. So I do not understand very well.. is there some required headers fix when using the main make file? Best regards! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
Stefano Bonifazi writes: On 01/24/2011 03:32 PM, Peter Maydell wrote: Being a JIT doesn't prohibit counting target instructions executed. It just means that counting them generally requires generating code to do the counting at runtime, so it's a more complicated change to make than it would be in a non-JIT emulator. What do you mean? Should I change the code of qemu-user for counting the instructions, or should I add code into the target binaries? If I recall this correctly, target-i386 has a generic function (whose name I don't remember) called whenever the rdtsc instruction is executed. This function rebuilds the counter that contains the number of executed instructions (more or less, this number can be tuned from a variety of sources). Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth
Re: [Qemu-devel] TCG flow vs dyngen
On Monday 24 January 2011 08:26 PM, Stefano Bonifazi wrote: On 01/24/2011 03:32 PM, Peter Maydell wrote: Being a JIT doesn't prohibit counting target instructions executed. It just means that counting them generally requires generating code to do the counting at runtime, so it's a more complicated change to make than it would be in a non-JIT emulator. What do you mean? Should I change the code of qemu-user for counting the instructions, or should I add code into the target binaries? You should see this pdf (www.ecs.syr.edu/faculty/yin/Teaching/TC2010/Proj4.pdf). It talks about tracing the instructions. -- Dushyant
Re: [Qemu-devel] TCG flow vs dyngen
On 01/24/2011 04:17 AM, Stefano Bonifazi wrote: I read your post, and yup you also noticed the weird of load_bias.. and wondered how it can work on x86.. But I think your work was on qemu-system.. I am working on qemu-user.. My post wasn't on qemu-anything, it was while I was trying to debug the uClibc dynamic loader on a new platform (the Qualcomm Hexagon) that Linux support still hasn't gone upstream for yet. The thing is, the kernel currently _does_ work, so studying the relevant kernel code (and possibly the dynamic loader code) is one way to learn how it currently works. Rob
Re: [Qemu-devel] TCG flow vs dyngen
On 01/24/2011 07:02 PM, Dushyant Bansal wrote: On Monday 24 January 2011 08:26 PM, Stefano Bonifazi wrote: On 01/24/2011 03:32 PM, Peter Maydell wrote: Being a JIT doesn't prohibit counting target instructions executed. It just means that counting them generally requires generating code to do the counting at runtime, so it's a more complicated change to make than it would be in a non-JIT emulator. What do you mean? Should I change the code of qemu-user for counting the instructions, or should I add code into the target binaries? You should see this pdf (www.ecs.syr.edu/faculty/yin/Teaching/TC2010/Proj4.pdf). It talks about tracing the instructions. -- Dushyant Wow thank you! It sounds incredibly interesting!! What we really need is to insert a function call into the translated code, so when each instruction is executed at runtime, our inserted function will be executed. Again wow!! Is that really possible? Some sort of callback triggered at every instruction execution? Do you have any another document explaining that? This pdf just gives instructions on how to do it on an old version of qemu (disas_insn doesn't exist at all on my code now), and does not explain what it is, what's behind that suggested code .. Also the code for single step would be of great help to me! I really needed that.. but when I tried it on qemu-user didn't work at all.. Thank you very much! Best regards, Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
Hi! Thanks for replying me! The thing is, the kernel currently _does_ work, so studying the relevant kernel code (and possibly the dynamic loader code) is one way to learn how it currently works. Sorry what kernel? Qemu's? Linux's?
Re: [Qemu-devel] TCG flow vs dyngen
On 01/24/2011 03:16 PM, Stefano Bonifazi wrote: Hi! Thanks for replying me! The thing is, the kernel currently _does_ work, so studying the relevant kernel code (and possibly the dynamic loader code) is one way to learn how it currently works. Sorry what kernel? Qemu's? Linux's? QEMU isn't a kernel, it's an emulator. Linux is a kernel. I meant Linux loads and runs Linux ELF executables. That's pretty much the definition of how to do it. So if there's ever a conflict between how qemu does it and how the Linux kernel does it, the Linux kernel is going to win. (And yes, this has come up before, for me it was http://www.mail-archive.com/qemu-devel@nongnu.org/msg25336.html ) That said, QEMU's currently working fairly well on this front too, so studying either should work pretty well... One advantage of the kernel is cat /proc/$PID/maps which lets you know what the mappings are, and then you can look up the appropriate chunks of the executable and read the elf spec: http://refspecs.freestandards.org/elf/elf.pdf And to be honest, the best way to get up to speed on this is to read this: http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html Where some guy asked ok, what do we actually NEED and then set out to prove it. This book is pretty good too, although so dry it's almost unreadable. You might have better luck getting a paper copy out of the library: http://www.iecc.com/linker/ Rob
Re: [Qemu-devel] TCG flow vs dyngen
You should see this pdf (www.ecs.syr.edu/faculty/yin/Teaching/TC2010/Proj4.pdf). It talks about tracing the instructions. -- Dushyant Wow thank you! It sounds incredibly interesting!! What we really need is to insert a function call into the translated code, so when each instruction is executed at runtime, our inserted function will be executed. Again wow!! Is that really possible? Some sort of callback triggered at every instruction execution? Yes, this mechanism works. I have written a code to count different kinds of instructions. Do you have any another document explaining that? No. But maybe you can try to understand this through qemu source code. Here are some resources for that http://stackoverflow.com/questions/4501173/a-call-to-those-who-have-worked-with-qemu This pdf just gives instructions on how to do it on an old version of qemu (disas_insn doesn't exist at all on my code now), and does not explain what it is, what's behind that suggested code .. Also the code for single step would be of great help to me! I really needed that.. but when I tried it on qemu-user didn't work at all.. It exists in file qemu/target-i386/translate.c You are also talking about qemu source code privided here http://wiki.qemu.org/Download, right? If you need, I can give the source code of counting implementation with some documentation. Hope this helps. -- Dushyant
Re: [Qemu-devel] TCG flow vs dyngen
On 01/16/2011 10:01 AM, Raphaël Lefèvre wrote: On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: 2. how can I check the number of target cpu cycles or target instructions executed inside qemu-user (i.e. qemu-ppc)? Is there any variable I can inspect for such informations? at Dec, 2010 Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at answering: You can't, because QEMU doesn't work that way. QEMU isn't an instruction level emulator, it's closer to a Java JIT. It doesn't translate one instruction at a time but instead translates large blocks of code all at once, and keeps a cache of translated blocks around. Execution jumps into each block and either waits for it to exit again (meaning it jumped out of that page and QEMU's main execution loop has to look up what page to execute next, possibly translating it first if it's not in the cache yet), or else QEMU interrupts it after while to fake an IRQ of some kind (such as a timer interrupt). You may want to read Fabrice Bellard's original paper on the QEMU design: http://www.usenix.org/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf Since that was written, dyngen was replaced with tcg, but that does the same thing in a slightly different way. Building a QEMU with dyngen support used to use the host compiler to compile chunks of code corresponding to the target operations it would see at runtime, and then strip the machine language out of the resulting .o files and save them in a table. Then at runtime dyngen could generate translated pages by gluing together the resulting saved machine language snippets the host compiler had produced when qemu was built. The problem was, beating the right kind of machine language snippets out of the .o files the compiler produced from the example code turned out to be VERY COMPILER DEPENDENT. This is why you couldn't build qemu with gcc 4.x for the longest time, gcc's code generator and the layout of the .o files changed in a bunch of subtle ways which broke dyngen's ability to extract usable machine code snippets to put 'em into the table so it could translate pages at runtime. TCG stands for Tiny Code Generator. It just hardwires a code generator into QEMU. They wrote a mini-compiler in C, which knows what instructions to output for each host qemu supports. If QEMU understands target instructions well enough to _read_ them, it's not a big stretch to be able to _write_ them when running on that kind of host. (It's more or less the same operation in reverse.) This means that QEMU can no longer run on a type of host it can't execute target code for, but the solution is to just add support for all the interesting machines out there, on both sides. So, when QEMU executes code, the virtual MMU faults a new page into the virtual TLB, and goes I can't execute this, fix it up! And the fixup handler looks for a translation of the page in the cache of translated pages, and if it can't find it it calls the translator to convert the target code into a page of corresponding host code. Which may involve discarding an existing entry out of the cache, but this is how instruction caches work on real hardware anyway so the delays in QEMU are where they'd be on real hardware anyway, and optimizing for one is pretty close to optimizing for the other, so life is good. The chunk you found earlier is a function pointer typecast: #define tcg_qemu_tb_exec(tb_ptr) \ ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr) Which looks like it's calling code_gen_prologue() with tp_ptr as its argument (typecast to a void *), and it returns a long. That calls a translated page, and when the function returns that means the page of code needs to jump to code somewhere outside of that page, and we go back to the main loop to figure out where to go next. The reason QEMU is as fast as it is is because once it has a page of translated code, actually _running_ it is entirely native. It jumps into the page, and executes natively until it leaves the page. Control only goes back to QEMU to switch pages or to handle I/O and interrupts and such. So when you ask how many clock cycles did that instruction take, the answer is it doesn't work that way. QEMU emulates at memory page level (generally 4k of target code), not at individual instruction level. (Oh, and the worst thing you can do to QEMU from a performance perspective is self-modifying code. Because the virtual MMU has to strip the executable bit off the TLB entry and re-translate the entire page next time something tries to execute it. It _works_, it's just slow. But again, real hardware can hiccup a bit on this too.) Does that answer your question? Rob
Re: [Qemu-devel] TCG flow vs dyngen
On 01/23/2011 10:50 PM, Rob Landley wrote: On 01/16/2011 10:01 AM, Raphaël Lefèvre wrote: On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: 2. how can I check the number of target cpu cycles or target instructions executed inside qemu-user (i.e. qemu-ppc)? Is there any variable I can inspect for such informations? at Dec, 2010 Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at answering: You can't, because QEMU doesn't work that way. QEMU isn't an instruction level emulator, it's closer to a Java JIT. It doesn't translate one instruction at a time but instead translates large blocks of code all at once, and keeps a cache of translated blocks around. Execution jumps into each block and either waits for it to exit again (meaning it jumped out of that page and QEMU's main execution loop has to look up what page to execute next, possibly translating it first if it's not in the cache yet), or else QEMU interrupts it after while to fake an IRQ of some kind (such as a timer interrupt). You may want to read Fabrice Bellard's original paper on the QEMU design: http://www.usenix.org/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf Since that was written, dyngen was replaced with tcg, but that does the same thing in a slightly different way. Building a QEMU with dyngen support used to use the host compiler to compile chunks of code corresponding to the target operations it would see at runtime, and then strip the machine language out of the resulting .o files and save them in a table. Then at runtime dyngen could generate translated pages by gluing together the resulting saved machine language snippets the host compiler had produced when qemu was built. The problem was, beating the right kind of machine language snippets out of the .o files the compiler produced from the example code turned out to be VERY COMPILER DEPENDENT. This is why you couldn't build qemu with gcc 4.x for the longest time, gcc's code generator and the layout of the .o files changed in a bunch of subtle ways which broke dyngen's ability to extract usable machine code snippets to put 'em into the table so it could translate pages at runtime. TCG stands for Tiny Code Generator. It just hardwires a code generator into QEMU. They wrote a mini-compiler in C, which knows what instructions to output for each host qemu supports. If QEMU understands target instructions well enough to _read_ them, it's not a big stretch to be able to _write_ them when running on that kind of host. (It's more or less the same operation in reverse.) This means that QEMU can no longer run on a type of host it can't execute target code for, but the solution is to just add support for all the interesting machines out there, on both sides. So, when QEMU executes code, the virtual MMU faults a new page into the virtual TLB, and goes I can't execute this, fix it up! And the fixup handler looks for a translation of the page in the cache of translated pages, and if it can't find it it calls the translator to convert the target code into a page of corresponding host code. Which may involve discarding an existing entry out of the cache, but this is how instruction caches work on real hardware anyway so the delays in QEMU are where they'd be on real hardware anyway, and optimizing for one is pretty close to optimizing for the other, so life is good. The chunk you found earlier is a function pointer typecast: #define tcg_qemu_tb_exec(tb_ptr) \ ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr) Which looks like it's calling code_gen_prologue() with tp_ptr as its argument (typecast to a void *), and it returns a long. That calls a translated page, and when the function returns that means the page of code needs to jump to code somewhere outside of that page, and we go back to the main loop to figure out where to go next. The reason QEMU is as fast as it is is because once it has a page of translated code, actually _running_ it is entirely native. It jumps into the page, and executes natively until it leaves the page. Control only goes back to QEMU to switch pages or to handle I/O and interrupts and such. So when you ask how many clock cycles did that instruction take, the answer is it doesn't work that way. QEMU emulates at memory page level (generally 4k of target code), not at individual instruction level. (Oh, and the worst thing you can do to QEMU from a performance perspective is self-modifying code. Because the virtual MMU has to strip the executable bit off the TLB entry and re-translate the entire page next time something tries to execute it. It _works_, it's just slow. But again, real hardware can hiccup a bit on this too.) Does that answer your question? Rob Wow! Thank you! That's an ANSWER! Gold for who's studying all of that! Though at the stage of my work I had to understand almost all of it, your perfect summary make everything much clearer.. About counting
Re: [Qemu-devel] TCG flow vs dyngen
On 01/23/2011 04:25 PM, Stefano Bonifazi wrote: I am trying to shift in memory the target executable .. now the code is supposed to be loaded by the elfloader at the exact start address set at link time .. Ah, elf loading. That's a whole 'nother bag of worms. Oddly enough, I was deling with this last year trying to debug the uClibc dynamic linker. I blogged a bit about it at the time: http://landley.net/notes-2010.html#12-07-2010 (And the next few days. Sigh, I never did go back and fill in the holes, did I?) Inside elfloader there is even a check for verifying whether that address range is busy.. but no action is taken in that case o.O Maybe I'll post a new thread about this problem (bug?) .. anyway if you think you can help me anyway I'll give you further details.. Tired right now, but if you post a clearer question (what are you trying to _do_) and cc: me on it I'll try to respond. Maybe I can find some decent documentation to point you at, or maybe I'll write some... Rob
Re: [Qemu-devel] TCG flow vs dyngen
Stefano Bonifazi writes: Hi! In case you are interested in helping me, I'll give you a big piece of news I've just got (even my teacher is not informed yet! :) ) I still don't understand what is your high-level objective... Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth
Re: [Qemu-devel] TCG flow vs dyngen
On Wed, Dec 15, 2010 at 4:17 AM, Stefano Bonifazi stefboombas...@gmail.com wrote: On 12/11/2010 03:44 PM, Blue Swirl wrote: Hi! Thank you very much! Knowing exactly where I should check, in a so big project helped me very much!! Anyway after having spent more than 2 days on that code I still can't understand how it works the real execution: in cpu-exec.c : cpu_exec_nocache i find: /* execute the generated code */ next_tb = tcg_qemu_tb_exec(tb-tc_ptr); and in cpu-exec.c : cpu_exec /* execute the generated code */ next_tb = tcg_qemu_tb_exec(tc_ptr); so I thought tcg_qemu_tb_exec function should do the work of executing the translated binary in the host. But then I found out it is just a define in tcg.h: #define tcg_qemu_tb_exec(tb_ptr) ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr) and again in exec.c uint8_t code_gen_prologue[1024] code_gen_section; Maybe I have some problems with that C syntax, but I really don't understand what happens there.. how the execution happens! Here instead with QEMU/TCG I understood that at runtime the target binary is translated into host binary (somehow) .. but then.. how can this new host binary be run? Shall the host code at runtime do some sort of (assembly speaking) branch jump to an area of memory with new host binary instructions .. and then jump back to the old process binary code? 1. As I know, the host codes translated from the target instructions exist by the format of object file, that’s why they can be executed directly. 2. I think you catch the right concept in some point of view, one part of the internal of QEMU does such jump back works certainly. If so, can you explain me how this happens in those lines of code? I only can give a rough profile, the code you listed do a simple thing: Modify the pointer of the host code execution to point the next address that the host processor should continue to execute. I am just a student.. unluckily at university they just tell you that a cpu follows some sort of fetch -decode-execute flow .. but then you open QEMU.. and wow there is a huge gap for understanding it, and no books where to study it! ;) The QEMU is not used to simulate the every details of the processor should behave, it just try to approximate the necessary operations what a machine should be! “fetch-decode-execute” flow only need to be concerned when you involve into the hardware design. Raphaël Lefèvre
Re: [Qemu-devel] TCG flow vs dyngen
On 01/16/2011 03:46 PM, Raphael Lefevre wrote: On Wed, Dec 15, 2010 at 4:17 AM, Stefano Bonifazi stefboombas...@gmail.com wrote: On 12/11/2010 03:44 PM, Blue Swirl wrote: Hi! Thank you very much! Knowing exactly where I should check, in a so big project helped me very much!! Anyway after having spent more than 2 days on that code I still can't understand how it works the real execution: in cpu-exec.c : cpu_exec_nocache i find: /* execute the generated code */ next_tb = tcg_qemu_tb_exec(tb-tc_ptr); and in cpu-exec.c : cpu_exec /* execute the generated code */ next_tb = tcg_qemu_tb_exec(tc_ptr); so I thought tcg_qemu_tb_exec function should do the work of executing the translated binary in the host. But then I found out it is just a define in tcg.h: #define tcg_qemu_tb_exec(tb_ptr) ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr) and again in exec.c uint8_t code_gen_prologue[1024] code_gen_section; Maybe I have some problems with that C syntax, but I really don't understand what happens there.. how the execution happens! Here instead with QEMU/TCG I understood that at runtime the target binary is translated into host binary (somehow) .. but then.. how can this new host binary be run? Shall the host code at runtime do some sort of (assembly speaking) branch jump to an area of memory with new host binary instructions .. and then jump back to the old process binary code? 1. As I know, the host codes translated from the target instructions exist by the format of object file, that’s why they can be executed directly. 2. I think you catch the right concept in some point of view, one part of the internal of QEMU does such jump back works certainly. If so, can you explain me how this happens in those lines of code? I only can give a rough profile, the code you listed do a simple thing: Modify the pointer of the host code execution to point the next address that the host processor should continue to execute. I am just a student.. unluckily at university they just tell you that a cpu follows some sort of fetch -decode-execute flow .. but then you open QEMU.. and wow there is a huge gap for understanding it, and no books where to study it! ;) The QEMU is not used to simulate the every details of the processor should behave, it just try to approximate the necessary operations what a machine should be! “fetch-decode-execute” flow only need to be concerned when you involve into the hardware design. Raphaël Lefèvre Thank you very much! I've already solved this problem.. Right now I am fighting with the possibility of changing qemu-user code for making it run several binaries in succession .. But it seems to remember the first translated code.. Nobody answered to my post about it, do you have any idea?
Re: [Qemu-devel] TCG flow vs dyngen
On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: Thank you very much! I've already solved this problem.. Right now I am fighting with the possibility of changing qemu-user code for making it run several binaries in succession .. But it seems to remember the first translated code.. Nobody answered to my post about it, do you have any idea? Sorry for my belated on this discussion, after I searched for the topics you posted, it seems two main problems are unsolved? (Am I right?? I'm not sure...) 1. I edited QEMU user, more exactly qemu-ppc launching the main function (inside main.c) from another c function I created, passing it the appropriate parameters. ...balabala at Jan, 2011 2. how can I check the number of target cpu cycles or target instructions executed inside qemu-user (i.e. qemu-ppc)? Is there any variable I can inspect for such informations? at Dec, 2010 If I'm not correct, please let me know where the problem is. Raphaël Lefèvre
Re: [Qemu-devel] TCG flow vs dyngen
Sorry for my belated on this discussion, after I searched for the topics you posted, it seems two main problems are unsolved? (Am I right?? I'm not sure...) 1. I edited QEMU user, more exactly qemu-ppc launching the main function (inside main.c) from another c function I created, passing it the appropriate parameters. ...balabala at Jan, 2011 2. how can I check the number of target cpu cycles or target instructions executed inside qemu-user (i.e. qemu-ppc)? Is there any variable I can inspect for such informations? at Dec, 2010 If I'm not correct, please let me know where the problem is. Raphaël Lefèvre Hi! Thank you very much for Your concern! Honestly I had lost hope in any help, I even contacted directly some developers in this mailing list without luck! I am a student who needs to use qemu for a project where it will be used for its capabilities of running PowerPC code. As you can imagine qemu goes far beyond the knowledge in electronics and computer science of a student. Nevertheless I have to do that! I have been studying all the possible technical documents available in the internet, but it is really not much at all , not sufficient for getting the code and being able of understanding it .. It is in C, even not modular C++ Anyway with some help from this mailing list, and a lot of studying about assembly, loaders, compilers.. I am going on, though there are still big problems due of the nature of the QEMU code.. First of all, I am starting from qemu-user, more specifically, qemu-ppc as I don't need the full system capabilities, and it is easier for me to control the binary target memory with qemu-user. Originally I started with a lot of work on libqemu .. until some developer here told me it was deprecated (though still in the source) and not working fine. I edited the code of qemu-ppc so that another function of mine calls qemu-user main, with the appropriate parameters.. The pursued goal was to launch it several times with different target binaries in succession.. For some reason, I still can't find out, qemu code remembers the old code, running it instead of the new loaded binary.. and if I flush the cache of translated code before loading a new binary it stops and can't go on! My workaround to this problem was compiling qemu-ppc as a dynamic library and load it at runtime.. I also managed to load multiple copies of it (with dlmopen each at a different address space) ..in fact I need to run more than one qemu-ppc at the same time but a new big problem popped up now: the target binary is loaded always at a fixed address.. no matter if another qemu-ppc already loaded code there.. it is like the internal elf loader can't understand those addresses are not available, and then relocate them .. I tried to link (ld) the binary target elf as position independent code, but then qemu-ppc complains it can't find /usr/lib/libc.so.1 and /usr/lib/ld.so.1 To sum up the problems are (in order of importance): - making the elf loader relocate the target code into other addresses when the default ones (I guess those embedded into the target binary when it is not compiled as position independent code) are taken - making qemu-user able of running more than one target binary in succession - counting qemu-user executed instructions My university is a public one, so my project will be open to the community, I will also upload the documentation I am writing about qemu coming from the knowledge I am acquiring working on it, so that, I hope, other people will find less frustrating the first steps into developing qemu! Any help will be more than welcome! Thank you in advance! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
2011/1/16 Stefano Bonifazi stefboombas...@gmail.com: My workaround to this problem was compiling qemu-ppc as a dynamic library and load it at runtime.. I also managed to load multiple copies of it (with dlmopen each at a different address space) ..in fact I need to run more than one qemu-ppc at the same time This approach seems very unlikely to work -- in general qemu in both system and user mode assumes that there is only one instance running in the host process address space, and things are bound to clash. (Linux doesn't seem to have dlmopen but google suggests that it puts the library in its own namespace but not its own address space.) Running each qemu as its own process and using interprocess communication for whatever coordination you need between the various instances seems more likely to be workable to me. This will also fix your can't run more than one binary in succession problem, because you can just have the first qemu run and exit as normal and launch a second qemu to run the second binary. -- PMM
Re: [Qemu-devel] TCG flow vs dyngen
Thank you very much for Your fast reply! On 01/16/2011 07:29 PM, Peter Maydell wrote: Linux doesn't seem to have dlmopen http://www.unix.com/man-page/All/3c/dlmopen/ #define __USE_GNU #include dlfcn.h lib_handle1 = dlmopen(LM_ID_NEWLM,./libqemu-ppc.so, RTLD_NOW); I am developing that on a clean ubuntu 10.10 but google suggests that it puts the library in its own namespace but not its own address space. I need to make the different instances of qemu-user exchange data .. obviously keeping all of them in the same address space would be the easiest way (unless I have to change all qemu code ;) ) Running each qemu as its own process and using interprocess communication for whatever coordination you need between the various instances seems more likely to be workable to me. This will also fix your can't run more than one binary in succession problem, because you can just have the first qemu run and exit as normal and launch a second qemu to run the second binary. -- PMM Exactly, it was the easiest way also for me.. and I've already done it, works smoothly .. the only big problem is that it is not good for my teacher.. he says it should work the dynamic library way o.O Working with libraries even solved the problem of consecutive runs, though according to me it is not good a software when you must reboot it for making it run again fine.. sounds more Windows style :D Clearly it makes memory dirty and do not clean after the target process completes its execution.. leaving the OS care about it. I tried zeroing all global variables before starting a new execution without results (other than making it stall) .. After very long time spent trying to find a solution I think the problem should be with the mmap' ings stuff in the loader .. the same reason why 2 different libraries with their own namespaces clash according to me.. the elf loaders work globally within the unique address space .. I think for a guru of loaders-linkers should not be so difficult to patch it.. but not for a student who almost heard about them for the first time ;) Any help is very appreciated :) Thank you again! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
2011/1/17 Stefano Bonifazi stefboombas...@gmail.com: Hi! Thank you very much for Your concern! Honestly I had lost hope in any help, I even contacted directly some developers in this mailing list without luck! I guess many good developers in mailing list are still try their best to solve your problems, such as Blue Swirl, Paolo Bonzini, Stefan Weil, Peter Maydell, Mulyadi Santosa, Andreas Färber and Alexander Graf (hope I won't lost anyone that had helped you, and the order of name list without any meaning) ...etc., every developer has his expertises, and it is hard to recognize all of the activities of qemu. Please trust one thing: you are not alone:). I am a student who needs to use qemu for a project where it will be used for its capabilities of running PowerPC code. As you can imagine qemu goes far beyond the knowledge in electronics and computer science of a student. Nevertheless I have to do that! I have been studying all the possible technical documents available in the internet, but it is really not much at all , not sufficient for getting the code and being able of understanding it .. It is in C, even not modular C++ Due to the lack of tehnical document of qemu and you are a student (maybe study for master/phd degree?), some literatures that published on IEEE/ACM may give you some inspiration and help (suppose that your university have bought the authority for download). As I know, though the issue of qemu is relative new for the academia, there still are literatures have been discussed. Maybe you can find which research domain categorized that is most approximative to your works. If any literature has inspired you or related to your research, don't hasitate to discuss. Anyway with some help from this mailing list, and a lot of studying about assembly, loaders, compilers.. I am going on, though there are still big problems due of the nature of the QEMU code.. First of all, I am starting from qemu-user, more specifically, qemu-ppc as I don't need the full system capabilities, and it is easier for me to control the binary target memory with qemu-user. Is there any reason why should you use the user mode of qemu, not the system mode? Sometime, the system mode of qemu will release you from the nightmare for managing the memory hierarchy. Maybe you can start from talking about what is the original goal of the project instead of falling into the hell of code tracing. Originally I started with a lot of work on libqemu .. until some developer here told me it was deprecated (though still in the source) and not working fine. I edited the code of qemu-ppc so that another function of mine calls qemu-user main, with the appropriate parameters.. The pursued goal was to launch it several times with different target binaries in succession.. For some reason, I still can't find out, qemu code remembers the old code, running it instead of the new loaded binary.. and if I flush the cache of translated code before loading a new binary it stops and can't go on! My workaround to this problem was compiling qemu-ppc as a dynamic library and load it at runtime.. I also managed to load multiple copies of it (with dlmopen each at a different address space) ..in fact I need to run more than one qemu-ppc at the same time but a new big problem popped up now: the I need to thanks the Peter Maydell explained the principle that I'm not familiar with. And from your description, would you want to invoke multi-cores? Because I cannot imagine which application need to run multiple qemu-ppc at the same time. target binary is loaded always at a fixed address.. no matter if another qemu-ppc already loaded code there.. it is like the internal elf loader can't understand those addresses are not available, and then relocate them .. I tried to link (ld) the binary target elf as position independent code, but then qemu-ppc complains it can't find /usr/lib/libc.so.1 and /usr/lib/ld.so.1 The above description seems to be out of my scope to answer, because I only studied on system mode of qemu. To sum up the problems are (in order of importance): - making the elf loader relocate the target code into other addresses when the default ones (I guess those embedded into the target binary when it is not compiled as position independent code) are taken Maybe the problem only can be solved by re-write the loader if you insist to use user mode. (just as your response to Peter) - making qemu-user able of running more than one target binary in succession Will more than one target binary in succession (assume A then B then C) be achieved by compile ABC into one binary in sequence? - counting qemu-user executed instructions I guess all the works before this are for the goal: counting qemu-user executed instructions, am I right? If so, the paper published in IEEE 2010 maybe give some help (I guess) http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5475901 (Make sure that your university
Re: [Qemu-devel] TCG flow vs dyngen
2011/1/16 Stefano Bonifazi stefboombas...@gmail.com: I need to make the different instances of qemu-user exchange data .. obviously keeping all of them in the same address space would be the easiest way (unless I have to change all qemu code ;) ) The problem is that you're trying to break a fundamental assumption made by a lot of qemu code. That's a large job which involves understanding, checking and possibly changing lots of already written code. In contrast, the code you need to exchange data between the instances is going to be fairly small and self contained and you'll already understand it because you've written it/will write it. I think it's pretty clear which one is going to be easier. Running each qemu as its own process and using interprocess communication for whatever coordination you need between the various instances seems more likely to be workable to me. Exactly, it was the easiest way also for me.. and I've already done it, works smoothly .. the only big problem is that it is not good for my teacher.. he says it should work the dynamic library way o.O I think he's wrong. (You might like to think about what happens if the program being emulated in qemu user-mode does a fork()). Basically you're trying to do things the hard way; maybe you can get something that sort of works in the subset of cases you care about, but why on earth put in that much time and effort on something irrelevant to the actual problem you're trying to work on? -- PMM
Re: [Qemu-devel] TCG flow vs dyngen
Hi! In case you are interested in helping me, I'll give you a big piece of news I've just got (even my teacher is not informed yet! :) ) I've just managed to make more than one instance of qemu-user run at the same time linking the target code with a specified address for the code section (-Ttext address of ld). It works fine and this proves my idea that the problem is within the elf loader.. Making it relocate the target code properly would fix the problem ;) Now let's work on it :) Regards, Stefano B. On 01/16/2011 08:02 PM, Stefano Bonifazi wrote: Thank you very much for Your fast reply! On 01/16/2011 07:29 PM, Peter Maydell wrote: Linux doesn't seem to have dlmopen http://www.unix.com/man-page/All/3c/dlmopen/ #define __USE_GNU #include dlfcn.h lib_handle1 = dlmopen(LM_ID_NEWLM,./libqemu-ppc.so, RTLD_NOW); I am developing that on a clean ubuntu 10.10 but google suggests that it puts the library in its own namespace but not its own address space. I need to make the different instances of qemu-user exchange data .. obviously keeping all of them in the same address space would be the easiest way (unless I have to change all qemu code ;) ) Running each qemu as its own process and using interprocess communication for whatever coordination you need between the various instances seems more likely to be workable to me. This will also fix your can't run more than one binary in succession problem, because you can just have the first qemu run and exit as normal and launch a second qemu to run the second binary. -- PMM Exactly, it was the easiest way also for me.. and I've already done it, works smoothly .. the only big problem is that it is not good for my teacher.. he says it should work the dynamic library way o.O Working with libraries even solved the problem of consecutive runs, though according to me it is not good a software when you must reboot it for making it run again fine.. sounds more Windows style :D Clearly it makes memory dirty and do not clean after the target process completes its execution.. leaving the OS care about it. I tried zeroing all global variables before starting a new execution without results (other than making it stall) .. After very long time spent trying to find a solution I think the problem should be with the mmap' ings stuff in the loader .. the same reason why 2 different libraries with their own namespaces clash according to me.. the elf loaders work globally within the unique address space .. I think for a guru of loaders-linkers should not be so difficult to patch it.. but not for a student who almost heard about them for the first time ;) Any help is very appreciated :) Thank you again! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
2011/1/17 Stefano Bonifazi stefboombas...@gmail.com: Hi! In case you are interested in helping me, I'll give you a big piece of news I've just got (even my teacher is not informed yet! :) ) I've just managed to make more than one instance of qemu-user run at the same time linking the target code with a specified address for the code section (-Ttext address of ld). It works fine and this proves my idea that the problem is within the elf loader.. Making it relocate the target code properly would fix the problem ;) Now let's work on it :) Regards, Stefano B. Congratulation~ just keep going on~! Raphaël Lefèvre
Re: [Qemu-devel] TCG flow vs dyngen
On 12/11/2010 03:44 PM, Blue Swirl wrote: On Sat, Dec 11, 2010 at 2:32 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: Where does the execution of host binary take place in the previous list of events? Between point 5) and 6) ? After 6) ? In what QEMU source code file/function does the final execution of host binary take place? In the previous list of events, when does the translator try to chain the current TB with previous ones? Before TCG generates the binary in order to feed it with linked micro code? All of this happens in cpu-exec.c:581 to 618. Hi! Thank you very much! Knowing exactly where I should check, in a so big project helped me very much!! Anyway after having spent more than 2 days on that code I still can't understand how it works the real execution: in cpu-exec.c : cpu_exec_nocache i find: /* execute the generated code */ next_tb = tcg_qemu_tb_exec(tb-tc_ptr); and in cpu-exec.c : cpu_exec /* execute the generated code */ next_tb = tcg_qemu_tb_exec(tc_ptr); so I thought tcg_qemu_tb_exec function should do the work of executing the translated binary in the host. But then I found out it is just a define in tcg.h: #define tcg_qemu_tb_exec(tb_ptr) ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr) and again in exec.c uint8_t code_gen_prologue[1024] code_gen_section; Maybe I have some problems with that C syntax, but I really don't understand what happens there.. how the execution happens! Maybe I am too stuck to my idea of a common emulator fetch - decode - execute where an addition would be implemented simply as env-regC = env-regA +env-regB ... where this C instruction would be compiled offline into host machine binary by host compiler.. so the emulator would be a monolith block of host code just with branches for the different opcodes that would come from the target binary loaded at runtime.. Here instead with QEMU/TCG I understood that at runtime the target binary is translated into host binary (somehow) .. but then.. how can this new host binary be run? Shall the host code at runtime do some sort of (assembly speaking) branch jump to an area of memory with new host binary instructions .. and then jump back to the old process binary code? If so, can you explain me how this happens in those lines of code? I am just a student.. unluckily at university they just tell you that a cpu follows some sort of fetch -decode-execute flow .. but then you open QEMU.. and wow there is a huge gap for understanding it, and no books where to study it! ;) Please help me understanding it :) Thank you very very much in advance! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
On Fri, Dec 10, 2010 at 9:26 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: Hi all! From the technical documentation (http://www.usenix.org/publications/library/proceedings/usenix05/tech/freenix/bellard.html) I read: The first step is to split each target CPU instruction into fewer simpler instructions called micro operations. Each micro operation is implemented by a small piece of C code. This small C source code is compiled by GCC to an object file. The micro operations are chosen so that their number is much smaller (typically a few hundreds) than all the combinations of instructions and operands of the target CPU. The translation from target CPU instructions to micro operations is done entirely with hand coded code. A compile time tool called dyngen uses the object file containing the micro operations as input to generate a dynamic code generator. This dynamic code generator is invoked at runtime to generate a complete host function which concatenates several micro operations. instead from wikipedia(http://en.wikipedia.org/wiki/QEMU) and other sources I read: The Tiny Code Generator (TCG) aims to remove the shortcoming of relying on a particular version of GCC or any compiler, instead incorporating the compiler (code generator) into other tasks performed by QEMU in run-time. The whole translation task thus consists of two parts: blocks of target code (TBs) being rewritten in TCG ops - a kind of machine-independent intermediate notation, and subsequently this notation being compiled for the host's architecture by TCG. Optional optimisation passes are performed between them. - So, I think that the technical documentation is now obsolete, isn't it? At least we shouldn't link to that paper anymore. There's also documentation generated from qemu-tech.texi that should be up to date. - The old way used much offline (compile time) work compiling the micro operations into host machine code, while if I understand well, TCG does everything in run-time(please correct me if I am wrong!).. so I wonder, how can it be as fast as the previous method (or even faster)? The dyngen way was to extract machine instructions for each micro-op from an object file (op.o) compiled by GCC during QEMU build. TCG instead generates the instructions directly. Since the whole host register set is available for the micro-ops (in contrast to fixed T0/T1/T2 used by dyngen), TCG should outperform dyngen in some cases. In other cases, GCC may have used some optimization when generating the op which would be too complex to implement by the TCG generator so the dyngen op may have been more optimal. The old way was not portable to GCC 4.x series. Now it might be even possible to replace GCC extensions with something else and use other compilers. - If I understand well, TGC runtime flow is the following: - TCG takes the target binary, and splits it into target blocks - if the TB is not cached, TGC translates it (or better the target instructions it is composed by) into TCG micro ops, The above is not the job of TCG (which is host specific), but the target specific translators (target-*/translate.c). - TGC compiles TGC uops into host object code, OK. - TGC caches the TB, - TGC tries to chain the block with others, The above is part of the CPU execution loop (cpu-exec.c), TCG is not involved anymore. - TGC copies the TB into the execution buffer There is no copying. - TGC runs it Am I right? Please correct me, whether I am wrong, as I wanna use that flow scheme for trying to understand the code.. Otherwise right.
Re: [Qemu-devel] TCG flow vs dyngen
Thank you very very much! I'd take months for understanding everything myself from the source code! :) On 12/11/2010 12:02 PM, Blue Swirl wrote: On Fri, Dec 10, 2010 at 9:26 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: [..] - So, I think that the technical documentation is now obsolete, isn't it? At least we shouldn't link to that paper anymore. There's also documentation generated from qemu-tech.texi that should be up to date. Do you mean this: http://www.weilnetz.de/qemu-tech.html ? - If I understand well, TCG runtime flow is the following: - TCG takes the target binary, and splits it into target blocks - if the TB is not cached, TCG translates it (or better the target instructions it is composed by) into TCG micro ops, The above is not the job of TCG (which is host specific), but the target specific translators (target-*/translate.c). Ok, then considering QEMU flow instead of simply TCG, do those steps take place in the order I considered? - TCG caches the TB, - TCG tries to chain the block with others, The above is part of the CPU execution loop (cpu-exec.c), TCG is not involved anymore. Ok! Thank you, now I have a clearer idea of where different operations are implemented.. but again considering the whole QEMU flow, are the steps I reported executed in the order I put them? - TCG copies the TB into the execution buffer There is no copying. Does that mean TCG produces the host object code directly into the emulator's memory for it to fetch? Or does TCG make the emulator even execute that object code as soon as it is produced? But, if the object code is consumed on the fly, it means there is no cashing of it, is it there? What is actually cached? Only target blocks? Their translation into TCG uops? Host binary code generated by TCG? Again many many thanks!!! Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
On Sat, Dec 11, 2010 at 12:29 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: Thank you very very much! I'd take months for understanding everything myself from the source code! :) On 12/11/2010 12:02 PM, Blue Swirl wrote: On Fri, Dec 10, 2010 at 9:26 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: [..] - So, I think that the technical documentation is now obsolete, isn't it? At least we shouldn't link to that paper anymore. There's also documentation generated from qemu-tech.texi that should be up to date. Do you mean this: http://www.weilnetz.de/qemu-tech.html ? Yes. - If I understand well, TCG runtime flow is the following: - TCG takes the target binary, and splits it into target blocks - if the TB is not cached, TCG translates it (or better the target instructions it is composed by) into TCG micro ops, The above is not the job of TCG (which is host specific), but the target specific translators (target-*/translate.c). Ok, then considering QEMU flow instead of simply TCG, do those steps take place in the order I considered? Yes, that's about it. - TCG caches the TB, - TCG tries to chain the block with others, The above is part of the CPU execution loop (cpu-exec.c), TCG is not involved anymore. Ok! Thank you, now I have a clearer idea of where different operations are implemented.. but again considering the whole QEMU flow, are the steps I reported executed in the order I put them? - TCG copies the TB into the execution buffer There is no copying. Does that mean TCG produces the host object code directly into the emulator's memory for it to fetch? Or does TCG make the emulator even execute that object code as soon as it is produced? But, if the object code is consumed on the fly, it means there is no cashing of it, is it there? What is actually cached? Only target blocks? Their translation into TCG uops? Host binary code generated by TCG? There's a large buffer for generated code, allocated in exec.c. This is filled with host code by TCG, when full it is flushed. The CPU execution loop generates new TBs when needed, otherwise the old code can be executed. TCG also uses intermediate ops but those are used only once during translation.
RE: [Qemu-devel] TCG flow vs dyngen
-Original Message- From: Blue Swirl [mailto:blauwir...@gmail.com] Sent: sabato 11 dicembre 2010 14:12 To: Stefano Bonifazi Cc: qemu-devel@nongnu.org Subject: Re: [Qemu-devel] TCG flow vs dyngen There's a large buffer for generated code, allocated in exec.c. This is filled with host code by TCG, when full it is flushed. The CPU execution loop generates new TBs when needed, otherwise the old code can be executed. TCG also uses intermediate ops but those are used only once during translation. So if I understand well the flow is the following: 1) the CPU execution loop at runtime takes a new TB from the target code 2) I guess some hash function is computed on this TB for getting a key for searching into the buffer of generated code that probably should store the binary as a map key-binary 3) if the search is successful the binary is given to the translator(how? You said no copy involved) and we return to point 1) otherwise: 4) the target specific translator generates TCG uops from the TB 5) TCG uses uops for generating host binary code 6) this new binary code is cached by TGC if there is enough storage place Is that all correct? Where does the execution of host binary take place in the previous list of events? Between point 5) and 6) ? After 6) ? In what QEMU source code file/function does the final execution of host binary take place? In the previous list of events, when does the translator try to chain the current TB with previous ones? Before TCG generates the binary in order to feed it with linked micro code? Thank you very very much! :) Stefano B.
Re: [Qemu-devel] TCG flow vs dyngen
On Sat, Dec 11, 2010 at 2:32 PM, Stefano Bonifazi stefboombas...@gmail.com wrote: -Original Message- From: Blue Swirl [mailto:blauwir...@gmail.com] Sent: sabato 11 dicembre 2010 14:12 To: Stefano Bonifazi Cc: qemu-devel@nongnu.org Subject: Re: [Qemu-devel] TCG flow vs dyngen There's a large buffer for generated code, allocated in exec.c. This is filled with host code by TCG, when full it is flushed. The CPU execution loop generates new TBs when needed, otherwise the old code can be executed. TCG also uses intermediate ops but those are used only once during translation. So if I understand well the flow is the following: 1) the CPU execution loop at runtime takes a new TB from the target code 2) I guess some hash function is computed on this TB for getting a key for searching into the buffer of generated code that probably should store the binary as a map key-binary 3) if the search is successful the binary is given to the translator(how? You said no copy involved) and we return to point 1) otherwise: 1-3) Please see tb_find_fast() and its caller in cpu-exec.c. Only pointer passing is involved. 4) the target specific translator generates TCG uops from the TB 5) TCG uses uops for generating host binary code 6) this new binary code is cached by TGC if there is enough storage place Is that all correct? 4-5) OK. 6) If there is no space, all previously generated code is thrown away. Where does the execution of host binary take place in the previous list of events? Between point 5) and 6) ? After 6) ? In what QEMU source code file/function does the final execution of host binary take place? In the previous list of events, when does the translator try to chain the current TB with previous ones? Before TCG generates the binary in order to feed it with linked micro code? All of this happens in cpu-exec.c:581 to 618.
[Qemu-devel] TCG flow vs dyngen
Hi all! From the technical documentation (http://www.usenix.org/publications/library/proceedings/usenix05/tech/freenix/bellard.html) I read: The first step is to split each target CPU instruction into fewer simpler instructions called /micro operations/. Each micro operation is implemented by a small piece of C code. This small C source code is compiled by GCC to an object file. The micro operations are chosen so that their number is much smaller (typically a few hundreds) than all the combinations of instructions and operands of the target CPU. The translation from target CPU instructions to micro operations is done entirely with hand coded code. A compile time tool called dyngen uses the object file containing the micro operations as input to generate a dynamic code generator. This dynamic code generator is invoked at runtime to generate a complete host function which concatenates several micro operations. instead from wikipedia(http://en.wikipedia.org/wiki/QEMU) and other sources I read: The Tiny Code Generator (TCG) aims to remove the shortcoming of relying on a particular version of GCC http://en.wikipedia.org/wiki/GNU_Compiler_Collection or any compiler, instead incorporating the compiler (code generator) into other tasks performed by QEMU in run-time. The whole translation task thus consists of two parts: blocks of target code (/TBs/) being rewritten in *TCG ops* - a kind of machine-independent intermediate notation, and subsequently this notation being compiled for the host's architecture by TCG. Optional optimisation passes are performed between them. - So, I think that the technical documentation is now obsolete, isn't it? - The old way used much offline (compile time) work compiling the micro operations into host machine code, while if I understand well, TCG does everything in run-time(please correct me if I am wrong!).. so I wonder, how can it be as fast as the previous method (or even faster)? - If I understand well, TGC runtime flow is the following: - TCG takes the target binary, and splits it into target blocks - if the TB is not cached, TGC translates it (or better the target instructions it is composed by) into TCG micro ops, - TGC compiles TGC uops into host object code, - TGC caches the TB, - TGC tries to chain the block with others, - TGC copies the TB into the execution buffer - TGC runs it Am I right? Please correct me, whether I am wrong, as I wanna use that flow scheme for trying to understand the code.. Thank you very much in advance! Stefano B.