Re: [Qemu-devel] TCG flow vs dyngen

2011-01-25 Thread Stefano Bonifazi



That said, QEMU's currently working fairly well on this front too, so
studying either should work pretty well...

Mr Richard Henderson's patch on elfload.c says I was right.. at least 
the version I am working on (qemu-0.13.0) had some bugs and weaknesses 
though it worked smoothly for most cases..



And to be honest, the best way to get up to speed on this is to read this:

   http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html

uhmm seems a good piece.. maybe one of the last I still didn't have :)
Thank you!!
Stefano B.



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-25 Thread Stefano Bonifazi
Again wow!! Is that really possible? Some sort of callback triggered at 
every instruction execution?
Yes, this mechanism works. I have written a code to count different 
kinds of instructions. 

Great! that opens a lot of possibilities!.
It exists in file qemu/target-i386/translate.c 
Ops right! I checked target-ppc/translate.c as I need Power-PC as 
target.. I wonder what function replaces it there..
You are also talking about qemu source code privided here 
http://wiki.qemu.org/Download, right?

Yes I am using this http://wiki.qemu.org/download/qemu-0.13.0.tar.gz
If you need, I can give the source code of counting implementation 
with some documentation.

Hope this helps.


Wow that would be awesome! I'd really appreciate it very much! Thank you! :)
You are free of sending it to my address! :)

Best regards!!
Stefano B.




Re: [Qemu-devel] TCG flow vs dyngen

2011-01-25 Thread Stefano Bonifazi

On 01/25/2011 10:05 AM, Edgar E. Iglesias wrote:

On Tue, Jan 25, 2011 at 10:04:39AM +0100, Stefano Bonifazi wrote:

Again wow!! Is that really possible? Some sort of callback triggered at
every instruction execution?

Yes, this mechanism works. I have written a code to count different
kinds of instructions.

Great! that opens a lot of possibilities!.

It exists in file qemu/target-i386/translate.c

Ops right! I checked target-ppc/translate.c as I need Power-PC as
target.. I wonder what function replaces it there..

You are also talking about qemu source code privided here
http://wiki.qemu.org/Download, right?

Yes I am using this http://wiki.qemu.org/download/qemu-0.13.0.tar.gz

If you need, I can give the source code of counting implementation
with some documentation.
Hope this helps.


Wow that would be awesome! I'd really appreciate it very much! Thank you! :)
You are free of sending it to my address! :)

Hi,

If you are interested in instruction counting maybe you should take
a look at the -icount option as well.

Cheers

Thank you!
Already tried long ago, it doesn't work with qemu-user..If I remember 
fine its core was in files not used in qemu-user :(

Regards,
Stefano B.



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-25 Thread Edgar E. Iglesias
On Tue, Jan 25, 2011 at 10:04:39AM +0100, Stefano Bonifazi wrote:
 Again wow!! Is that really possible? Some sort of callback triggered at 
 every instruction execution?
  Yes, this mechanism works. I have written a code to count different 
  kinds of instructions. 
 Great! that opens a lot of possibilities!.
  It exists in file qemu/target-i386/translate.c 
 Ops right! I checked target-ppc/translate.c as I need Power-PC as 
 target.. I wonder what function replaces it there..
  You are also talking about qemu source code privided here 
  http://wiki.qemu.org/Download, right?
 Yes I am using this http://wiki.qemu.org/download/qemu-0.13.0.tar.gz
  If you need, I can give the source code of counting implementation 
  with some documentation.
  Hope this helps.
 
 Wow that would be awesome! I'd really appreciate it very much! Thank you! :)
 You are free of sending it to my address! :)

Hi,

If you are interested in instruction counting maybe you should take
a look at the -icount option as well.

Cheers



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Stefano Bonifazi

On 01/24/2011 12:40 AM, Rob Landley wrote:

On 01/23/2011 04:25 PM, Stefano Bonifazi wrote:

I am trying to shift in memory the target executable .. now the code is
supposed to be loaded by the elfloader at the exact start address set
at link time ..

Ah, elf loading.  That's a whole 'nother bag of worms.

Oddly enough, I was deling with this last year trying to debug the
uClibc dynamic linker.  I blogged a bit about it at the time:

   http://landley.net/notes-2010.html#12-07-2010

(And the next few days.  Sigh, I never did go back and fill in the
holes, did I?)


Inside elfloader there is even a check for verifying whether that
address range is busy.. but no action is taken in that case o.O
Maybe I'll post a new thread about this problem (bug?) .. anyway if you
think you can help me anyway I'll give you further details..

Tired right now, but if you post a clearer question (what are you trying
to _do_) and cc: me on it I'll try to respond.

Maybe I can find some decent documentation to point you at, or maybe
I'll write some...

Rob

Thank you!
 I read your post, and yup you also noticed the weird of load_bias.. 
and wondered how it can work on x86..

But I think your work was on qemu-system.. I am working on qemu-user..
Yup better to post a new thread, I'll cc: you there!
Thank you very much!
Stefano B



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Peter Maydell
2011/1/23 Rob Landley r...@landley.net:
 Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at
 answering:

...here's a couple of clarifications:

 2. how can I check the number of target cpu cycles or target
 instructions executed inside qemu-user (i.e. qemu-ppc)?

 You can't, because QEMU doesn't work that way. QEMU isn't an
 instruction level emulator, it's closer to a Java JIT.

Being a JIT doesn't prohibit counting target instructions executed.
It just means that counting them generally requires generating
code to do the counting at runtime, so it's a more complicated
change to make than it would be in a non-JIT emulator.

The major reason for not counting cycles is that for an emulation
of a modern CPU this is pretty nearly impossible: the number
of cycles an instruction takes can depend on whether it causes
a cache miss, which CPU internal pipeline it uses, whether it
needs to stall waiting for a result from an earlier insn, whether
the CPU correctly predicted the branch leading up to it or not,
and on and on. You would need to precisely model all the
internals of each variant of each CPU, which would be a
mammoth undertaking requiring probably unpublished internal
data, and if you ever managed to finish it then it would run
incredibly slowly and would probably contain enough bugs you
couldn't trust the data it gave you anyway.

 This means that QEMU can
 no longer run on a type of host it can't execute target code for

This isn't correct; for instance there's hppa support in TCG for hppa
hosts but no hppa target support, and there's sh4 target support
but no TCG backend for it. The two ends are cleanly separated in
qemu and don't generally depend on each other.

-- PMM



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Stefano Bonifazi

On 01/24/2011 03:32 PM, Peter Maydell wrote:


Being a JIT doesn't prohibit counting target instructions executed.
It just means that counting them generally requires generating
code to do the counting at runtime, so it's a more complicated
change to make than it would be in a non-JIT emulator.

What do you mean? Should I change the code of qemu-user for counting the 
instructions, or should I add code into the target binaries?

The major reason for not counting cycles is that for an emulation
of a modern CPU this is pretty nearly impossible: the number
of cycles an instruction takes can depend on whether it causes
a cache miss, which CPU internal pipeline it uses, whether it
needs to stall waiting for a result from an earlier insn, whether
the CPU correctly predicted the branch leading up to it or not,
and on and on. You would need to precisely model all the
internals of each variant of each CPU, which would be a
mammoth undertaking requiring probably unpublished internal
data, and if you ever managed to finish it then it would run
incredibly slowly and would probably contain enough bugs you
couldn't trust the data it gave you anyway.

Yup, I think it was just a silly mistake of mine when in the first post 
I wrote cycles.. that was because for me anything that can estimate how 
long it takes to do the work would be fine.. I can't simply check the 
time because that is host machine dependent... Number of executed 
instructions would be fine..

This means that QEMU can
no longer run on a type of host it can't execute target code for

This isn't correct; for instance there's hppa support in TCG for hppa
hosts but no hppa target support, and there's sh4 target support
but no TCG backend for it. The two ends are cleanly separated in
qemu and don't generally depend on each other.

Well I experienced a strange behavior some time ago that initially made 
me think mr Rob was right on that though I knew host support and target 
support were separated in qemu: I tried to make directly qemu-ppc on a 
x86_64 machine from inside ppc-linux-user folder (i can do fine onto x86 
machine) and it failed because there was no tgc/x86_64/tcg_target.h, 
whereas doing the make from within the main folder worked.
So I do not understand very well.. is there some required headers fix 
when using the main make file?

 Best regards!
Stefano B.



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Lluís
Stefano Bonifazi writes:

 On 01/24/2011 03:32 PM, Peter Maydell wrote:

 Being a JIT doesn't prohibit counting target instructions executed.
 It just means that counting them generally requires generating
 code to do the counting at runtime, so it's a more complicated
 change to make than it would be in a non-JIT emulator.

 What do you mean? Should I change the code of qemu-user for counting the
 instructions, or should I add code into the target binaries?


If I recall this correctly, target-i386 has a generic function (whose
name I don't remember) called whenever the rdtsc instruction is
executed.

This function rebuilds the counter that contains the number of executed
instructions (more or less, this number can be tuned from a variety of
sources).


Lluis

--
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Dushyant Bansal

On Monday 24 January 2011 08:26 PM, Stefano Bonifazi wrote:

On 01/24/2011 03:32 PM, Peter Maydell wrote:


Being a JIT doesn't prohibit counting target instructions executed.
It just means that counting them generally requires generating
code to do the counting at runtime, so it's a more complicated
change to make than it would be in a non-JIT emulator.

What do you mean? Should I change the code of qemu-user for counting 
the instructions, or should I add code into the target binaries?
You should see this pdf 
(www.ecs.syr.edu/faculty/yin/Teaching/TC2010/Proj4.pdf). It talks about 
tracing the instructions.


--
Dushyant


Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Rob Landley
On 01/24/2011 04:17 AM, Stefano Bonifazi wrote:
  I read your post, and yup you also noticed the weird of load_bias.. and
 wondered how it can work on x86..
 But I think your work was on qemu-system.. I am working on qemu-user..

My post wasn't on qemu-anything, it was while I was trying to debug the
uClibc dynamic loader on a new platform (the Qualcomm Hexagon) that
Linux support still hasn't gone upstream for yet.

The thing is, the kernel currently _does_ work, so studying the relevant
kernel code (and possibly the dynamic loader code) is one way to learn
how it currently works.

Rob



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Stefano Bonifazi

On 01/24/2011 07:02 PM, Dushyant Bansal wrote:

On Monday 24 January 2011 08:26 PM, Stefano Bonifazi wrote:

On 01/24/2011 03:32 PM, Peter Maydell wrote:


Being a JIT doesn't prohibit counting target instructions executed.
It just means that counting them generally requires generating
code to do the counting at runtime, so it's a more complicated
change to make than it would be in a non-JIT emulator.

What do you mean? Should I change the code of qemu-user for counting 
the instructions, or should I add code into the target binaries?
You should see this pdf 
(www.ecs.syr.edu/faculty/yin/Teaching/TC2010/Proj4.pdf). It talks 
about tracing the instructions.


--
Dushyant

Wow thank you! It sounds incredibly interesting!!

What we really need is to insert a function call into the
translated code, so when each instruction is executed at runtime, our 
inserted function will be

executed.
Again wow!! Is that really possible? Some sort of callback triggered at 
every instruction execution?

Do you have any another document explaining that?
This pdf just gives instructions on how to do it on an old version of 
qemu (disas_insn doesn't exist at all on my code now), and does not 
explain what it is, what's behind that suggested code ..
Also the code for single step would be of great help to me! I really 
needed that.. but when I tried it on qemu-user didn't work at all..

Thank you very much!
Best regards,
Stefano B.





Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Stefano Bonifazi

Hi! Thanks for replying me!

The thing is, the kernel currently _does_ work, so studying the relevant
kernel code (and possibly the dynamic loader code) is one way to learn
how it currently works.

Sorry what kernel? Qemu's? Linux's?




Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Rob Landley
On 01/24/2011 03:16 PM, Stefano Bonifazi wrote:
 Hi! Thanks for replying me!
 The thing is, the kernel currently _does_ work, so studying the relevant
 kernel code (and possibly the dynamic loader code) is one way to learn
 how it currently works.
 Sorry what kernel? Qemu's? Linux's?

QEMU isn't a kernel, it's an emulator.  Linux is a kernel.

I meant Linux loads and runs Linux ELF executables.  That's pretty much
the definition of how to do it.  So if there's ever a conflict between
how qemu does it and how the Linux kernel does it, the Linux kernel
is going to win.  (And yes, this has come up before, for me it was
http://www.mail-archive.com/qemu-devel@nongnu.org/msg25336.html )

That said, QEMU's currently working fairly well on this front too, so
studying either should work pretty well...

One advantage of the kernel is cat /proc/$PID/maps which lets you know
what the mappings are, and then you can look up the appropriate chunks
of the executable and read the elf spec:

  http://refspecs.freestandards.org/elf/elf.pdf

And to be honest, the best way to get up to speed on this is to read this:

  http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html

Where some guy asked ok, what do we actually NEED and then set out to
prove it.

This book is pretty good too, although so dry it's almost unreadable.
You might have better luck getting a paper copy out of the library:

  http://www.iecc.com/linker/

Rob



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-24 Thread Dushyant Bansal


You should see this pdf 
(www.ecs.syr.edu/faculty/yin/Teaching/TC2010/Proj4.pdf). It talks 
about tracing the instructions.


--
Dushyant

Wow thank you! It sounds incredibly interesting!!

What we really need is to insert a function call into the
translated code, so when each instruction is executed at runtime, our 
inserted function will be

executed.
Again wow!! Is that really possible? Some sort of callback triggered 
at every instruction execution?
Yes, this mechanism works. I have written a code to count different 
kinds of instructions.

Do you have any another document explaining that?
No. But maybe you can try to understand this through qemu source code. 
Here are some resources for that 
http://stackoverflow.com/questions/4501173/a-call-to-those-who-have-worked-with-qemu
This pdf just gives instructions on how to do it on an old version of 
qemu (disas_insn doesn't exist at all on my code now), and does not 
explain what it is, what's behind that suggested code ..
Also the code for single step would be of great help to me! I really 
needed that.. but when I tried it on qemu-user didn't work at all..
It exists in file qemu/target-i386/translate.c You are also talking 
about qemu source code privided here http://wiki.qemu.org/Download, right?
If you need, I can give the source code of counting implementation with 
some documentation.

Hope this helps.

--
Dushyant


Re: [Qemu-devel] TCG flow vs dyngen

2011-01-23 Thread Rob Landley
On 01/16/2011 10:01 AM, Raphaël Lefèvre wrote:
 On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi
 stefboombas...@gmail.com wrote:
 2. how can I check the number of target cpu cycles or target
 instructions executed inside qemu-user (i.e. qemu-ppc)?
 Is there any variable I can inspect for such informations? at Dec, 2010

Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at
answering:

You can't, because QEMU doesn't work that way.  QEMU isn't an
instruction level emulator, it's closer to a Java JIT.  It doesn't
translate one instruction at a time but instead translates large blocks
of code all at once, and keeps a cache of translated blocks around.
Execution jumps into each block and either waits for it to exit again
(meaning it jumped out of that page and QEMU's main execution loop has
to look up what page to execute next, possibly translating it first if
it's not in the cache yet), or else QEMU interrupts it after while to
fake an IRQ of some kind (such as a timer interrupt).

You may want to read Fabrice Bellard's original paper on the QEMU design:

http://www.usenix.org/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf

Since that was written, dyngen was replaced with tcg, but that does the
same thing in a slightly different way.

Building a QEMU with dyngen support used to use the host compiler to
compile chunks of code corresponding to the target operations it would
see at runtime, and then strip the machine language out of the resulting
.o files and save them in a table.  Then at runtime dyngen could
generate translated pages by gluing together the resulting saved machine
language snippets the host compiler had produced when qemu was built.
The problem was, beating the right kind of machine language snippets out
of the .o files the compiler produced from the example code turned out
to be VERY COMPILER DEPENDENT.  This is why you couldn't build qemu with
gcc 4.x for the longest time, gcc's code generator and the layout of the
.o files changed in a bunch of subtle ways which broke dyngen's ability
to extract usable machine code snippets to put 'em into the table so it
could translate pages at runtime.

TCG stands for Tiny Code Generator.  It just hardwires a code
generator into QEMU.  They wrote a mini-compiler in C, which knows what
instructions to output for each host qemu supports.  If QEMU understands
target instructions well enough to _read_ them, it's not a big stretch
to be able to _write_ them when running on that kind of host.  (It's
more or less the same operation in reverse.)  This means that QEMU can
no longer run on a type of host it can't execute target code for, but
the solution is to just add support for all the interesting machines out
there, on both sides.

So, when QEMU executes code, the virtual MMU faults a new page into the
virtual TLB, and goes I can't execute this, fix it up!  And the fixup
handler looks for a translation of the page in the cache of translated
pages, and if it can't find it it calls the translator to convert the
target code into a page of corresponding host code.  Which may involve
discarding an existing entry out of the cache, but this is how
instruction caches work on real hardware anyway so the delays in QEMU
are where they'd be on real hardware anyway, and optimizing for one is
pretty close to optimizing for the other, so life is good.

The chunk you found earlier is a function pointer typecast:

#define tcg_qemu_tb_exec(tb_ptr) \
  ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr)

Which looks like it's calling code_gen_prologue() with tp_ptr as its
argument (typecast to a void *), and it returns a long.  That calls a
translated page, and when the function returns that means the page of
code needs to jump to code somewhere outside of that page, and we go
back to the main loop to figure out where to go next.

The reason QEMU is as fast as it is is because once it has a page of
translated code, actually _running_ it is entirely native.  It jumps
into the page, and executes natively until it leaves the page.   Control
only goes back to QEMU to switch pages or to handle I/O and interrupts
and such.  So when you ask how many clock cycles did that instruction
take, the answer is it doesn't work that way.  QEMU emulates at
memory page level (generally 4k of target code), not at individual
instruction level.

(Oh, and the worst thing you can do to QEMU from a performance
perspective is self-modifying code.  Because the virtual MMU has to
strip the executable bit off the TLB entry and re-translate the entire
page next time something tries to execute it.  It _works_, it's just
slow.  But again, real hardware can hiccup a bit on this too.)

Does that answer your question?

Rob



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-23 Thread Stefano Bonifazi

On 01/23/2011 10:50 PM, Rob Landley wrote:

On 01/16/2011 10:01 AM, Raphaël Lefèvre wrote:

On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi
stefboombas...@gmail.com  wrote:
2. how can I check the number of target cpu cycles or target
instructions executed inside qemu-user (i.e. qemu-ppc)?
Is there any variable I can inspect for such informations? at Dec, 2010

Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at
answering:

You can't, because QEMU doesn't work that way.  QEMU isn't an
instruction level emulator, it's closer to a Java JIT.  It doesn't
translate one instruction at a time but instead translates large blocks
of code all at once, and keeps a cache of translated blocks around.
Execution jumps into each block and either waits for it to exit again
(meaning it jumped out of that page and QEMU's main execution loop has
to look up what page to execute next, possibly translating it first if
it's not in the cache yet), or else QEMU interrupts it after while to
fake an IRQ of some kind (such as a timer interrupt).

You may want to read Fabrice Bellard's original paper on the QEMU design:

http://www.usenix.org/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf

Since that was written, dyngen was replaced with tcg, but that does the
same thing in a slightly different way.

Building a QEMU with dyngen support used to use the host compiler to
compile chunks of code corresponding to the target operations it would
see at runtime, and then strip the machine language out of the resulting
.o files and save them in a table.  Then at runtime dyngen could
generate translated pages by gluing together the resulting saved machine
language snippets the host compiler had produced when qemu was built.
The problem was, beating the right kind of machine language snippets out
of the .o files the compiler produced from the example code turned out
to be VERY COMPILER DEPENDENT.  This is why you couldn't build qemu with
gcc 4.x for the longest time, gcc's code generator and the layout of the
.o files changed in a bunch of subtle ways which broke dyngen's ability
to extract usable machine code snippets to put 'em into the table so it
could translate pages at runtime.

TCG stands for Tiny Code Generator.  It just hardwires a code
generator into QEMU.  They wrote a mini-compiler in C, which knows what
instructions to output for each host qemu supports.  If QEMU understands
target instructions well enough to _read_ them, it's not a big stretch
to be able to _write_ them when running on that kind of host.  (It's
more or less the same operation in reverse.)  This means that QEMU can
no longer run on a type of host it can't execute target code for, but
the solution is to just add support for all the interesting machines out
there, on both sides.

So, when QEMU executes code, the virtual MMU faults a new page into the
virtual TLB, and goes I can't execute this, fix it up!  And the fixup
handler looks for a translation of the page in the cache of translated
pages, and if it can't find it it calls the translator to convert the
target code into a page of corresponding host code.  Which may involve
discarding an existing entry out of the cache, but this is how
instruction caches work on real hardware anyway so the delays in QEMU
are where they'd be on real hardware anyway, and optimizing for one is
pretty close to optimizing for the other, so life is good.

The chunk you found earlier is a function pointer typecast:

#define tcg_qemu_tb_exec(tb_ptr) \
   ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr)

Which looks like it's calling code_gen_prologue() with tp_ptr as its
argument (typecast to a void *), and it returns a long.  That calls a
translated page, and when the function returns that means the page of
code needs to jump to code somewhere outside of that page, and we go
back to the main loop to figure out where to go next.

The reason QEMU is as fast as it is is because once it has a page of
translated code, actually _running_ it is entirely native.  It jumps
into the page, and executes natively until it leaves the page.   Control
only goes back to QEMU to switch pages or to handle I/O and interrupts
and such.  So when you ask how many clock cycles did that instruction
take, the answer is it doesn't work that way.  QEMU emulates at
memory page level (generally 4k of target code), not at individual
instruction level.

(Oh, and the worst thing you can do to QEMU from a performance
perspective is self-modifying code.  Because the virtual MMU has to
strip the executable bit off the TLB entry and re-translate the entire
page next time something tries to execute it.  It _works_, it's just
slow.  But again, real hardware can hiccup a bit on this too.)

Does that answer your question?

Rob

Wow! Thank you! That's an ANSWER!
Gold for who's studying all of that! Though at the stage of my work I 
had to understand almost all of it, your perfect summary make 
everything much clearer..
About counting 

Re: [Qemu-devel] TCG flow vs dyngen

2011-01-23 Thread Rob Landley
On 01/23/2011 04:25 PM, Stefano Bonifazi wrote:
 I am trying to shift in memory the target executable .. now the code is
 supposed to be loaded by the elfloader at the exact start address set
 at link time ..

Ah, elf loading.  That's a whole 'nother bag of worms.

Oddly enough, I was deling with this last year trying to debug the
uClibc dynamic linker.  I blogged a bit about it at the time:

  http://landley.net/notes-2010.html#12-07-2010

(And the next few days.  Sigh, I never did go back and fill in the
holes, did I?)

 Inside elfloader there is even a check for verifying whether that
 address range is busy.. but no action is taken in that case o.O
 Maybe I'll post a new thread about this problem (bug?) .. anyway if you
 think you can help me anyway I'll give you further details..

Tired right now, but if you post a clearer question (what are you trying
to _do_) and cc: me on it I'll try to respond.

Maybe I can find some decent documentation to point you at, or maybe
I'll write some...

Rob



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-17 Thread Lluís
Stefano Bonifazi writes:

 Hi!
  In case you are interested in helping me, I'll give you a big piece of news
 I've just got (even my teacher is not informed yet! :) )

I still don't understand what is your high-level objective...


Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Raphael Lefevre
On Wed, Dec 15, 2010 at 4:17 AM, Stefano Bonifazi stefboombas...@gmail.com 
wrote:

 On 12/11/2010 03:44 PM, Blue Swirl wrote:

 

 Hi!

 Thank you very much! Knowing exactly where I should check, in a so big

 project helped me very much!!

 Anyway after having spent more than 2 days on that code I still can't

 understand how it works the real execution:

 

 in cpu-exec.c : cpu_exec_nocache i find:

 

 /* execute the generated code */

next_tb = tcg_qemu_tb_exec(tb-tc_ptr);

 

 and in cpu-exec.c : cpu_exec

 

 /* execute the generated code */

 

next_tb = tcg_qemu_tb_exec(tc_ptr);

 

 so I thought tcg_qemu_tb_exec function should do the work of executing the

 translated binary in the host.

 But then I found out it is just a define in tcg.h:

 

 #define tcg_qemu_tb_exec(tb_ptr) ((long REGPARM (*)(void

 *))code_gen_prologue)(tb_ptr)

 

 and again in exec.c

 

 uint8_t code_gen_prologue[1024] code_gen_section;

 

 Maybe I have some problems with that C syntax, but I really don't understand

 what happens there.. how the execution happens!

 

 Here instead  with QEMU/TCG I understood that at runtime the target binary

 is translated into host binary (somehow) .. but then.. how can this new host

 binary be run? Shall the host code at runtime do some sort of (assembly

 speaking) branch jump to an area of memory with new host binary instructions

 .. and then jump back to the old process binary code?

 

1. As I know, the host codes translated from the target instructions exist by 
the format of object file, that’s why they can be executed directly.

2. I think you catch the right concept in some point of view, one part of the 
internal of QEMU does such jump  back works certainly.

 

 If so, can you explain me how this happens in those lines of code?

 

I only can give a rough profile, the code you listed do a simple thing:

Modify the pointer of the host code execution to point the next address that 
the host processor should continue to execute.

 

 I am just a student.. unluckily at university they just tell you that a cpu

 follows some sort of fetch -decode-execute flow .. but then you open

 QEMU.. and wow there is a huge gap for understanding it, and no books where

 to study it! ;)

 

The QEMU is not used to simulate the every details of the processor should 
behave, it just try to approximate the necessary operations what a machine 
should be!

“fetch-decode-execute” flow only need to be concerned when you involve into 
the hardware design.

 

Raphaël Lefèvre



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Stefano Bonifazi

On 01/16/2011 03:46 PM, Raphael Lefevre wrote:


On Wed, Dec 15, 2010 at 4:17 AM, Stefano Bonifazi 
stefboombas...@gmail.com wrote:


 On 12/11/2010 03:44 PM, Blue Swirl wrote:



 Hi!

 Thank you very much! Knowing exactly where I should check, in a so big

 project helped me very much!!

 Anyway after having spent more than 2 days on that code I still can't

 understand how it works the real execution:



 in cpu-exec.c : cpu_exec_nocache i find:



 /* execute the generated code */

next_tb = tcg_qemu_tb_exec(tb-tc_ptr);



 and in cpu-exec.c : cpu_exec



 /* execute the generated code */



next_tb = tcg_qemu_tb_exec(tc_ptr);



 so I thought tcg_qemu_tb_exec function should do the work of 
executing the


 translated binary in the host.

 But then I found out it is just a define in tcg.h:



 #define tcg_qemu_tb_exec(tb_ptr) ((long REGPARM (*)(void

 *))code_gen_prologue)(tb_ptr)



 and again in exec.c



 uint8_t code_gen_prologue[1024] code_gen_section;



 Maybe I have some problems with that C syntax, but I really don't 
understand


 what happens there.. how the execution happens!



 Here instead  with QEMU/TCG I understood that at runtime the target 
binary


 is translated into host binary (somehow) .. but then.. how can this 
new host


 binary be run? Shall the host code at runtime do some sort of (assembly

 speaking) branch jump to an area of memory with new host binary 
instructions


 .. and then jump back to the old process binary code?

1. As I know, the host codes translated from the target instructions 
exist by the format of object file, that’s why they can be executed 
directly.


2. I think you catch the right concept in some point of view, one part 
of the internal of QEMU does such jump  back works certainly.


 If so, can you explain me how this happens in those lines of code?

I only can give a rough profile, the code you listed do a simple thing:

Modify the pointer of the host code execution to point the next 
address that the host processor should continue to execute.


 I am just a student.. unluckily at university they just tell you that 
a cpu


 follows some sort of fetch -decode-execute flow .. but then you open

 QEMU.. and wow there is a huge gap for understanding it, and no books 
where


 to study it! ;)

The QEMU is not used to simulate the every details of the processor 
should behave, it just try to approximate the necessary operations 
what a machine should be!


“fetch-decode-execute” flow only need to be concerned when you 
involve into the hardware design.


Raphaël Lefèvre


Thank you very much!
I've already solved this problem.. Right now I am fighting with the 
possibility of changing qemu-user code for making it run several 
binaries in succession .. But it seems to remember the first translated 
code.. Nobody answered to my post about it, do you have any idea?




Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Raphaël Lefèvre
On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi
stefboombas...@gmail.com wrote:

 Thank you very much!
 I've already solved this problem.. Right now I am fighting with the 
 possibility of changing qemu-user code for making it run several binaries in 
 succession .. But it seems to remember the first translated code.. Nobody 
 answered to my post about it, do you have any idea?


Sorry for my belated on this discussion, after I searched for the
topics you posted, it seems two main problems are unsolved? (Am I
right?? I'm not sure...)

1. I edited QEMU user, more exactly qemu-ppc launching the main function
(inside main.c) from another c function I created, passing it the
appropriate parameters. ...balabala at Jan, 2011

2. how can I check the number of target cpu cycles or target
instructions executed inside qemu-user (i.e. qemu-ppc)?
Is there any variable I can inspect for such informations? at Dec, 2010

If I'm not correct, please let me know where the problem is.

Raphaël Lefèvre



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Stefano Bonifazi



Sorry for my belated on this discussion, after I searched for the
topics you posted, it seems two main problems are unsolved? (Am I
right?? I'm not sure...)

1. I edited QEMU user, more exactly qemu-ppc launching the main function
(inside main.c) from another c function I created, passing it the
appropriate parameters. ...balabala at Jan, 2011

2. how can I check the number of target cpu cycles or target
instructions executed inside qemu-user (i.e. qemu-ppc)?
Is there any variable I can inspect for such informations? at Dec, 2010

If I'm not correct, please let me know where the problem is.

Raphaël Lefèvre

Hi!
Thank you very much for Your concern!
Honestly I had lost hope in any help, I even contacted directly some 
developers in this mailing list without luck!
I am a student who needs to use qemu for a project where it will be used 
for its capabilities of running PowerPC code.
As you can imagine qemu goes far beyond the knowledge in electronics and 
computer science of a student. Nevertheless I have to do that!
I have been studying all the possible technical documents available in 
the internet, but it is really not much at all , not sufficient for 
getting the code and being able of understanding it .. It is in C, even 
not modular C++
Anyway with some help from this mailing list, and a lot of studying 
about assembly, loaders, compilers.. I am going on, though there are 
still big problems due of the nature of the QEMU code..
First of all, I am starting from qemu-user, more specifically, qemu-ppc 
as I don't need the full system capabilities, and it is easier for me to 
control the binary target memory with qemu-user.
Originally I started with a lot of work on libqemu .. until some 
developer here told me it was deprecated (though still in the source) 
and not working fine.
I edited the code of qemu-ppc so that another function of mine calls 
qemu-user main, with the appropriate parameters.. The pursued goal was 
to launch it several times with different target binaries in succession..
For some reason, I still can't find out, qemu code remembers the old 
code, running it instead of the new loaded binary.. and if I flush the 
cache of translated code before loading a new binary it stops and can't 
go on!
My workaround to this problem was compiling qemu-ppc as a dynamic 
library and load it at runtime.. I also managed to load multiple copies 
of it (with dlmopen each at a different address space) ..in fact I need 
to run more than one qemu-ppc at the same time but a new big problem 
popped up now: the target binary is loaded always at a fixed address.. 
no matter if another qemu-ppc already loaded code there.. it is like the 
internal elf loader can't understand those addresses are not available, 
and then relocate them ..
I tried to link (ld) the binary target elf as position independent code, 
but then qemu-ppc complains it can't find  /usr/lib/libc.so.1 and  
/usr/lib/ld.so.1


To sum up the problems are (in order of importance):
 - making the elf loader relocate the target code into other addresses 
when the default ones (I guess those embedded into the target binary 
when it is not compiled as position independent code) are taken
 - making qemu-user able of running more than one target binary in 
succession

 - counting qemu-user executed instructions

My university is a public one, so my project will be open to the 
community, I will also upload the documentation I am writing about qemu 
coming from the knowledge I am acquiring working on it, so that, I hope, 
other people will find less frustrating the first steps into developing 
qemu!


Any help will be more than welcome!

Thank you in advance!
Stefano B.





Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Peter Maydell
2011/1/16 Stefano Bonifazi stefboombas...@gmail.com:
 My workaround to this problem was compiling qemu-ppc as a dynamic library
 and load it at runtime.. I also managed to load multiple copies of it (with
 dlmopen each at a different address space) ..in fact I need to run more than
 one qemu-ppc at the same time

This approach seems very unlikely to work -- in general qemu in
both system and user mode assumes that there is only one
instance running in the host process address space, and things
are bound to clash. (Linux doesn't seem to have dlmopen but
google suggests that it puts the library in its own namespace
but not its own address space.) Running each qemu as its own
process and using interprocess communication for whatever
coordination you need between the various instances seems
more likely to be workable to me. This will also fix your can't run
more than one binary in succession problem, because you can
just have the first qemu run and exit as normal and launch a
second qemu to run the second binary.

-- PMM



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Stefano Bonifazi

Thank you very much for Your fast reply!


On 01/16/2011 07:29 PM, Peter Maydell wrote:

Linux doesn't seem to have dlmopen

http://www.unix.com/man-page/All/3c/dlmopen/

#define __USE_GNU
#include dlfcn.h

lib_handle1 = dlmopen(LM_ID_NEWLM,./libqemu-ppc.so, RTLD_NOW);

I am developing that on a clean ubuntu 10.10

but google suggests that it puts the library in its own namespace
but not its own address space.
I need to make the different instances of qemu-user exchange data .. 
obviously keeping all of them in the same address space would be the 
easiest way (unless I have to change all qemu code ;) ) Running each 
qemu as its own

process and using interprocess communication for whatever
coordination you need between the various instances seems
more likely to be workable to me. This will also fix your can't run
more than one binary in succession problem, because you can
just have the first qemu run and exit as normal and launch a
second qemu to run the second binary.

-- PMM
Exactly, it was the easiest way also for me.. and I've already done it, 
works smoothly .. the only big problem is that it is not good for my 
teacher.. he says it should work the dynamic library way o.O
Working with libraries even solved the problem of consecutive runs, 
though according to me it is not good a software when you must reboot it 
for making it run again fine.. sounds more Windows style :D
Clearly it makes memory dirty and do not clean after the target 
process completes its execution.. leaving the OS care about it.
I tried zeroing all global variables before starting a new execution 
without results (other than making it stall) .. After very long time 
spent trying to find a solution I think the problem should be with the 
mmap' ings stuff in the loader .. the same reason why 2 different 
libraries with their own namespaces clash according to me.. the elf 
loaders work globally within the unique address space .. I think for a 
guru of loaders-linkers should not be so difficult to patch it.. but not 
for a student who almost heard about them for the first time  ;)

Any help is very appreciated :)
Thank you again!
Stefano B.









Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Raphaël Lefèvre
2011/1/17 Stefano Bonifazi stefboombas...@gmail.com:

 Hi!
 Thank you very much for Your concern!
 Honestly I had lost hope in any help, I even contacted directly some
 developers in this mailing list without luck!

I guess many good developers in mailing list are still try their best
to solve your problems, such as Blue Swirl, Paolo Bonzini, Stefan
Weil, Peter Maydell, Mulyadi Santosa, Andreas Färber and Alexander
Graf (hope I won't lost anyone that had helped you, and the order of
name list without any meaning) ...etc., every developer has his
expertises, and it is hard to recognize all of the activities of qemu.
Please trust one thing: you are not alone:).

 I am a student who needs to use qemu for a project where it will be used for
 its capabilities of running PowerPC code.
 As you can imagine qemu goes far beyond the knowledge in electronics and
 computer science of a student. Nevertheless I have to do that!
 I have been studying all the possible technical documents available in the
 internet, but it is really not much at all , not sufficient for getting the
 code and being able of understanding it .. It is in C, even not modular C++

Due to the lack of tehnical document of qemu and you are a student
(maybe study for master/phd degree?), some literatures that published
on IEEE/ACM may give you some inspiration and help (suppose that your
university have bought the authority for download). As I know, though
the issue of qemu is relative new for the academia, there still are
literatures have been discussed. Maybe you can find which research
domain categorized that is most approximative to your works. If any
literature has inspired you or related to your research, don't
hasitate to discuss.

 Anyway with some help from this mailing list, and a lot of studying about
 assembly, loaders, compilers.. I am going on, though there are still big
 problems due of the nature of the QEMU code..
 First of all, I am starting from qemu-user, more specifically, qemu-ppc as I
 don't need the full system capabilities, and it is easier for me to control
 the binary target memory with qemu-user.

Is there any reason why should you use the user mode of qemu, not the
system mode? Sometime, the system mode of qemu will release you from
the nightmare for managing the memory hierarchy. Maybe you can start
from talking about what is the original goal of the project instead of
falling into the hell of code tracing.

 Originally I started with a lot of work on libqemu .. until some developer
 here told me it was deprecated (though still in the source) and not working
 fine.
 I edited the code of qemu-ppc so that another function of mine calls
 qemu-user main, with the appropriate parameters.. The pursued goal was to
 launch it several times with different target binaries in succession..
 For some reason, I still can't find out, qemu code remembers the old code,
 running it instead of the new loaded binary.. and if I flush the cache of
 translated code before loading a new binary it stops and can't go on!
 My workaround to this problem was compiling qemu-ppc as a dynamic library
 and load it at runtime.. I also managed to load multiple copies of it (with
 dlmopen each at a different address space) ..in fact I need to run more than
 one qemu-ppc at the same time but a new big problem popped up now: the

I need to thanks the Peter Maydell explained the principle that I'm
not familiar with. And from your description, would you want to invoke
multi-cores? Because I cannot imagine which application need to run
multiple qemu-ppc at the same time.

 target binary is loaded always at a fixed address.. no matter if another
 qemu-ppc already loaded code there.. it is like the internal elf loader
 can't understand those addresses are not available, and then relocate them
 ..
 I tried to link (ld) the binary target elf as position independent code, but
 then qemu-ppc complains it can't find  /usr/lib/libc.so.1 and
  /usr/lib/ld.so.1


The above description seems to be out of my scope to answer, because I
only studied on system mode of qemu.

 To sum up the problems are (in order of importance):
  - making the elf loader relocate the target code into other addresses when
 the default ones (I guess those embedded into the target binary when it is
 not compiled as position independent code) are taken

Maybe the problem only can be solved by re-write the loader if you
insist to use user mode. (just as your response to Peter)

  - making qemu-user able of running more than one target binary in
 succession

Will more than one target binary in succession (assume A then B then
C) be achieved by compile ABC into one binary in sequence?

  - counting qemu-user executed instructions

I guess all the works before this are for the goal: counting
qemu-user executed instructions, am I right? If so, the paper
published in IEEE 2010 maybe give some help (I guess)
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5475901 (Make
sure that your university 

Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Peter Maydell
2011/1/16 Stefano Bonifazi stefboombas...@gmail.com:
 I need to make the different instances of qemu-user exchange data ..
 obviously keeping all of them in the same address space would be the easiest
 way (unless I have to change all qemu code ;) )

The problem is that you're trying to break a fundamental
assumption made by a lot of qemu code. That's a large
job which involves understanding, checking and possibly
changing lots of already written code. In contrast, the
code you need to exchange data between the instances is
going to be fairly small and self contained and you'll already
understand it because you've written it/will write it. I think
it's pretty clear which one is going to be easier.

 Running each qemu as its own
 process and using interprocess communication for whatever
 coordination you need between the various instances seems
 more likely to be workable to me.

 Exactly, it was the easiest way also for me.. and I've already done it,
 works smoothly .. the only big problem is that it is not good for my
 teacher.. he says it should work the dynamic library way o.O

I think he's wrong. (You might like to think about what happens
if the program being emulated in qemu user-mode does a fork()).

Basically you're trying to do things the hard way; maybe
you can get something that sort of works in the subset of
cases you care about, but why on earth put in that much
time and effort on something irrelevant to the actual problem
you're trying to work on?

-- PMM



Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Stefano Bonifazi

Hi!
 In case you are interested in helping me, I'll give you a big piece of 
news I've just got (even my teacher is not informed yet! :) )
I've just managed to make more than one instance of qemu-user run at the 
same time linking the target code with a specified address for the code 
section (-Ttext address of ld).
It works fine and this proves my idea that the problem is within the elf 
loader..

Making it relocate the target code properly would fix the problem ;)
Now let's work on it :)
Regards,
Stefano B.

On 01/16/2011 08:02 PM, Stefano Bonifazi wrote:

Thank you very much for Your fast reply!


On 01/16/2011 07:29 PM, Peter Maydell wrote:

Linux doesn't seem to have dlmopen

http://www.unix.com/man-page/All/3c/dlmopen/

#define __USE_GNU
#include dlfcn.h

lib_handle1 = dlmopen(LM_ID_NEWLM,./libqemu-ppc.so, RTLD_NOW);

I am developing that on a clean ubuntu 10.10

but google suggests that it puts the library in its own namespace
but not its own address space.
I need to make the different instances of qemu-user exchange data .. 
obviously keeping all of them in the same address space would be the 
easiest way (unless I have to change all qemu code ;) ) Running each 
qemu as its own

process and using interprocess communication for whatever
coordination you need between the various instances seems
more likely to be workable to me. This will also fix your can't run
more than one binary in succession problem, because you can
just have the first qemu run and exit as normal and launch a
second qemu to run the second binary.

-- PMM
Exactly, it was the easiest way also for me.. and I've already done 
it, works smoothly .. the only big problem is that it is not good for 
my teacher.. he says it should work the dynamic library way o.O
Working with libraries even solved the problem of consecutive runs, 
though according to me it is not good a software when you must reboot 
it for making it run again fine.. sounds more Windows style :D
Clearly it makes memory dirty and do not clean after the target 
process completes its execution.. leaving the OS care about it.
I tried zeroing all global variables before starting a new execution 
without results (other than making it stall) .. After very long time 
spent trying to find a solution I think the problem should be with the 
mmap' ings stuff in the loader .. the same reason why 2 different 
libraries with their own namespaces clash according to me.. the elf 
loaders work globally within the unique address space .. I think for a 
guru of loaders-linkers should not be so difficult to patch it.. but 
not for a student who almost heard about them for the first time  ;)

Any help is very appreciated :)
Thank you again!
Stefano B.








Re: [Qemu-devel] TCG flow vs dyngen

2011-01-16 Thread Raphaël Lefèvre
2011/1/17 Stefano Bonifazi stefboombas...@gmail.com:
 Hi!
  In case you are interested in helping me, I'll give you a big piece of news
 I've just got (even my teacher is not informed yet! :) )
 I've just managed to make more than one instance of qemu-user run at the
 same time linking the target code with a specified address for the code
 section (-Ttext address of ld).
 It works fine and this proves my idea that the problem is within the elf
 loader..
 Making it relocate the target code properly would fix the problem ;)
 Now let's work on it :)
 Regards,
 Stefano B.


Congratulation~ just keep going on~!

Raphaël Lefèvre



Re: [Qemu-devel] TCG flow vs dyngen

2010-12-14 Thread Stefano Bonifazi

On 12/11/2010 03:44 PM, Blue Swirl wrote:

On Sat, Dec 11, 2010 at 2:32 PM, Stefano Bonifazi
stefboombas...@gmail.com  wrote:

Where does the execution of host binary take place in the previous list of 
events?  Between point 5) and 6) ?
After 6) ? In what QEMU source code file/function does the final execution of 
host binary take place?

In the previous list of events, when does the translator try to chain the current TB with 
previous ones?  Before TCG generates the binary in order to feed it with linked 
micro code?

All of this happens in cpu-exec.c:581 to 618.

Hi!
Thank you very much! Knowing exactly where I should check, in a so big 
project helped me very much!!
Anyway after having spent more than 2 days on that code I still can't 
understand how it works the real execution:


in cpu-exec.c : cpu_exec_nocache i find:


/* execute the generated code */
next_tb = tcg_qemu_tb_exec(tb-tc_ptr);

and in cpu-exec.c : cpu_exec


/* execute the generated code */

next_tb = tcg_qemu_tb_exec(tc_ptr);
so I thought tcg_qemu_tb_exec function should do the work of executing 
the translated binary in the host.

But then I found out it is just a define in tcg.h:

#define tcg_qemu_tb_exec(tb_ptr) ((long REGPARM (*)(void 
*))code_gen_prologue)(tb_ptr)

and again in exec.c


uint8_t code_gen_prologue[1024] code_gen_section;
Maybe I have some problems with that C syntax, but I really don't 
understand what happens there.. how the execution happens!


Maybe I am too stuck to my idea of a common emulator fetch - decode - 
execute where an addition would be implemented simply as env-regC = 
env-regA +env-regB ... where this C instruction would be compiled 
offline into host machine binary by host compiler.. so the emulator 
would be a monolith block of host code just with branches for the 
different opcodes that would come from the target binary loaded at runtime..
Here instead  with QEMU/TCG I understood that at runtime the target 
binary is translated into host binary (somehow) .. but then.. how can 
this new host binary be run? Shall the host code at runtime do some sort 
of (assembly speaking) branch jump to an area of memory with new host 
binary instructions .. and then jump back to the old process binary code?

If so, can you explain me how this happens in those lines of code?
I am just a student.. unluckily at university they just tell you that a 
cpu follows some sort of fetch -decode-execute flow .. but then you 
open QEMU.. and wow there is a huge gap for understanding it, and no 
books where to study it! ;)

Please help me understanding it :)
Thank you very very much in advance!
Stefano B.











Re: [Qemu-devel] TCG flow vs dyngen

2010-12-11 Thread Blue Swirl
On Fri, Dec 10, 2010 at 9:26 PM, Stefano Bonifazi
stefboombas...@gmail.com wrote:
 Hi all!
  From the technical documentation
 (http://www.usenix.org/publications/library/proceedings/usenix05/tech/freenix/bellard.html)
 I read:

 The first step is to split each target CPU instruction into fewer simpler
 instructions called micro operations. Each micro operation is implemented by
 a small piece of C code. This small C source code is compiled by GCC to an
 object file. The micro operations are chosen so that their number is much
 smaller (typically a few hundreds) than all the combinations of instructions
 and operands of the target CPU. The translation from target CPU instructions
 to micro operations is done entirely with hand coded code.

 A compile time tool called dyngen uses the object file containing the micro
 operations as input to generate a dynamic code generator. This dynamic code
 generator is invoked at runtime to generate a complete host function which
 concatenates several micro operations.

 instead from wikipedia(http://en.wikipedia.org/wiki/QEMU) and other sources
 I read:

 The Tiny Code Generator (TCG) aims to remove the shortcoming of relying on a
 particular version of GCC or any compiler, instead incorporating the
 compiler (code generator) into other tasks performed by QEMU in run-time.
 The whole translation task thus consists of two parts: blocks of target code
 (TBs) being rewritten in TCG ops - a kind of machine-independent
 intermediate notation, and subsequently this notation being compiled for the
 host's architecture by TCG. Optional optimisation passes are performed
 between them.

 - So, I think that the technical documentation is now obsolete, isn't it?

At least we shouldn't link to that paper anymore. There's also
documentation generated from qemu-tech.texi that should be up to date.

 - The old way used much offline (compile time) work compiling the micro
 operations into host machine code, while if I understand well, TCG does
 everything in run-time(please correct me if I am wrong!).. so I wonder, how
 can it be as fast as the previous method (or even faster)?

The dyngen way was to extract machine instructions for each micro-op
from an object file (op.o) compiled by GCC during QEMU build. TCG
instead generates the instructions directly. Since the whole host
register set is available for the micro-ops (in contrast to fixed
T0/T1/T2 used by dyngen), TCG should outperform dyngen in some cases.
In other cases, GCC may have used some optimization when generating
the op which would be too complex to implement by the TCG generator so
the dyngen op may have been more optimal.

The old way was not portable to GCC 4.x series. Now it might be even
possible to replace GCC extensions with something else and use other
compilers.

 - If I understand well, TGC runtime flow is the following:
     - TCG takes the target binary, and splits it into target blocks
     - if the TB is not cached, TGC translates it (or better the target
 instructions it is composed by) into TCG micro ops,

The above is not the job of TCG (which is host specific), but the
target specific translators (target-*/translate.c).

     - TGC compiles TGC uops into host object code,

OK.

     - TGC caches the TB,
     - TGC tries to chain the block with others,

The above is part of the CPU execution loop (cpu-exec.c), TCG is not
involved anymore.

     - TGC copies the TB into the execution buffer

There is no copying.

     - TGC runs it
 Am I right? Please correct me, whether I am wrong, as I wanna use that flow
 scheme for trying to understand the code..

Otherwise right.



Re: [Qemu-devel] TCG flow vs dyngen

2010-12-11 Thread Stefano Bonifazi
Thank you very very much! I'd take months for understanding everything 
myself from the source code! :)


On 12/11/2010 12:02 PM, Blue Swirl wrote:

On Fri, Dec 10, 2010 at 9:26 PM, Stefano Bonifazi
stefboombas...@gmail.com  wrote:

[..]

- So, I think that the technical documentation is now obsolete, isn't it?

At least we shouldn't link to that paper anymore. There's also
documentation generated from qemu-tech.texi that should be up to date.

Do you mean this:
http://www.weilnetz.de/qemu-tech.html
?


- If I understand well, TCG runtime flow is the following:
 - TCG takes the target binary, and splits it into target blocks
 - if the TB is not cached, TCG translates it (or better the target
instructions it is composed by) into TCG micro ops,

The above is not the job of TCG (which is host specific), but the
target specific translators (target-*/translate.c).
Ok, then considering QEMU flow instead of simply TCG, do those steps 
take place in the order I considered?



 - TCG caches the TB,
 - TCG tries to chain the block with others,

The above is part of the CPU execution loop (cpu-exec.c), TCG is not
involved anymore.
Ok! Thank you, now I have a clearer idea of where different operations 
are implemented.. but again considering the whole QEMU flow, are the 
steps I reported executed in the order I put them?

 - TCG copies the TB into the execution buffer

There is no copying.
Does that mean TCG produces the host object code directly into the 
emulator's memory for it to fetch? Or does TCG make the emulator even 
execute that object code as soon as it is produced?
But, if the object code is consumed on the fly, it means there is no 
cashing of it, is it there?
What is actually cached? Only target blocks? Their translation into TCG 
uops? Host binary code generated by TCG?


Again many many thanks!!!
Stefano B.



Re: [Qemu-devel] TCG flow vs dyngen

2010-12-11 Thread Blue Swirl
On Sat, Dec 11, 2010 at 12:29 PM, Stefano Bonifazi
stefboombas...@gmail.com wrote:
 Thank you very very much! I'd take months for understanding everything
 myself from the source code! :)

 On 12/11/2010 12:02 PM, Blue Swirl wrote:

 On Fri, Dec 10, 2010 at 9:26 PM, Stefano Bonifazi
 stefboombas...@gmail.com  wrote:

 [..]

 - So, I think that the technical documentation is now obsolete, isn't it?

 At least we shouldn't link to that paper anymore. There's also
 documentation generated from qemu-tech.texi that should be up to date.

 Do you mean this:
 http://www.weilnetz.de/qemu-tech.html
 ?

Yes.

 - If I understand well, TCG runtime flow is the following:
     - TCG takes the target binary, and splits it into target blocks
     - if the TB is not cached, TCG translates it (or better the target
 instructions it is composed by) into TCG micro ops,

 The above is not the job of TCG (which is host specific), but the
 target specific translators (target-*/translate.c).

 Ok, then considering QEMU flow instead of simply TCG, do those steps take
 place in the order I considered?

Yes, that's about it.

     - TCG caches the TB,
     - TCG tries to chain the block with others,

 The above is part of the CPU execution loop (cpu-exec.c), TCG is not
 involved anymore.

 Ok! Thank you, now I have a clearer idea of where different operations are
 implemented.. but again considering the whole QEMU flow, are the steps I
 reported executed in the order I put them?

     - TCG copies the TB into the execution buffer

 There is no copying.

 Does that mean TCG produces the host object code directly into the
 emulator's memory for it to fetch? Or does TCG make the emulator even
 execute that object code as soon as it is produced?
 But, if the object code is consumed on the fly, it means there is no cashing
 of it, is it there?
 What is actually cached? Only target blocks? Their translation into TCG
 uops? Host binary code generated by TCG?

There's a large buffer for generated code, allocated in exec.c. This
is filled with host code by TCG, when full it is flushed. The CPU
execution loop generates new TBs when needed, otherwise the old code
can be executed.

TCG also uses intermediate ops but those are used only once during translation.



RE: [Qemu-devel] TCG flow vs dyngen

2010-12-11 Thread Stefano Bonifazi
-Original Message-
From: Blue Swirl [mailto:blauwir...@gmail.com] 
Sent: sabato 11 dicembre 2010 14:12
To: Stefano Bonifazi
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] TCG flow vs dyngen


There's a large buffer for generated code, allocated in exec.c. This is filled 
with host code by TCG, when full it is flushed. The CPU execution loop 
generates new TBs when needed, otherwise the old code can be executed.

TCG also uses intermediate ops but those are used only once during translation.

So if I understand well the flow is the following:

1) the CPU execution loop at runtime takes a new TB from the target code
2) I guess some hash function is computed on this TB for getting a key for 
searching into the buffer of generated code that probably should store the 
binary as a map key-binary
3) if the search is successful the binary is given to the translator(how? You 
said no copy involved) and we return to point 1) otherwise: 
4) the target specific translator generates TCG uops from the TB
5) TCG uses uops for generating  host binary code
6) this new binary code is cached by TGC if there is enough storage place 

Is that all correct?

Where does the execution of host binary take place in the previous list of 
events?  Between point 5) and 6) ?
After 6) ? In what QEMU source code file/function does the final execution of 
host binary take place?

In the previous list of events, when does the translator try to chain the 
current TB with previous ones?  Before TCG generates the binary in order to 
feed it with linked micro code?

Thank you very very much! :)
Stefano B.




Re: [Qemu-devel] TCG flow vs dyngen

2010-12-11 Thread Blue Swirl
On Sat, Dec 11, 2010 at 2:32 PM, Stefano Bonifazi
stefboombas...@gmail.com wrote:
 -Original Message-
 From: Blue Swirl [mailto:blauwir...@gmail.com]
 Sent: sabato 11 dicembre 2010 14:12
 To: Stefano Bonifazi
 Cc: qemu-devel@nongnu.org
 Subject: Re: [Qemu-devel] TCG flow vs dyngen


There's a large buffer for generated code, allocated in exec.c. This is 
filled with host code by TCG, when full it is flushed. The CPU execution loop 
generates new TBs when needed, otherwise the old code can be executed.

TCG also uses intermediate ops but those are used only once during 
translation.

 So if I understand well the flow is the following:

 1) the CPU execution loop at runtime takes a new TB from the target code
 2) I guess some hash function is computed on this TB for getting a key for 
 searching into the buffer of generated code that probably should store the 
 binary as a map key-binary
 3) if the search is successful the binary is given to the translator(how? You 
 said no copy involved) and we return to point 1) otherwise:

1-3) Please see tb_find_fast() and its caller in cpu-exec.c. Only
pointer passing is involved.

 4) the target specific translator generates TCG uops from the TB
 5) TCG uses uops for generating  host binary code
 6) this new binary code is cached by TGC if there is enough storage place

 Is that all correct?

4-5) OK.
6) If there is no space, all previously generated code is thrown away.


 Where does the execution of host binary take place in the previous list of 
 events?  Between point 5) and 6) ?
 After 6) ? In what QEMU source code file/function does the final execution of 
 host binary take place?

 In the previous list of events, when does the translator try to chain the 
 current TB with previous ones?  Before TCG generates the binary in order to 
 feed it with linked micro code?

All of this happens in cpu-exec.c:581 to 618.



[Qemu-devel] TCG flow vs dyngen

2010-12-10 Thread Stefano Bonifazi

Hi all!
 From the technical documentation 
(http://www.usenix.org/publications/library/proceedings/usenix05/tech/freenix/bellard.html) 
I read:


The first step is to split each target CPU instruction into fewer 
simpler instructions called /micro operations/. Each micro operation 
is implemented by a small piece of C code. This small C source code is 
compiled by GCC to an object file. The micro operations are chosen so 
that their number is much smaller (typically a few hundreds) than all 
the combinations of instructions and operands of the target CPU. The 
translation from target CPU instructions to micro operations is done 
entirely with hand coded code. 
A compile time tool called dyngen uses the object file containing the 
micro operations as input to generate a dynamic code generator. This 
dynamic code generator is invoked at runtime to generate a complete 
host function which concatenates several micro operations. 
instead from wikipedia(http://en.wikipedia.org/wiki/QEMU) and other 
sources I read:


The Tiny Code Generator (TCG) aims to remove the shortcoming of 
relying on a particular version of GCC 
http://en.wikipedia.org/wiki/GNU_Compiler_Collection or any 
compiler, instead incorporating the compiler (code generator) into 
other tasks performed by QEMU in run-time. The whole translation task 
thus consists of two parts: blocks of target code (/TBs/) being 
rewritten in *TCG ops* - a kind of machine-independent intermediate 
notation, and subsequently this notation being compiled for the host's 
architecture by TCG. Optional optimisation passes are performed 
between them.

- So, I think that the technical documentation is now obsolete, isn't it?

- The old way used much offline (compile time) work compiling the 
micro operations into host machine code, while if I understand well, TCG 
does everything in run-time(please correct me if I am wrong!).. so I 
wonder, how can it be as fast as the previous method (or even faster)?


- If I understand well, TGC runtime flow is the following:
- TCG takes the target binary, and splits it into target blocks
- if the TB is not cached, TGC translates it (or better the target 
instructions it is composed by) into TCG micro ops,

- TGC compiles TGC uops into host object code,
- TGC caches the TB,
- TGC tries to chain the block with others,
- TGC copies the TB into the execution buffer
- TGC runs it
Am I right? Please correct me, whether I am wrong, as I wanna use that 
flow scheme for trying to understand the code..

Thank you very much in advance!
Stefano B.