Re: [PATCH RFC 0/3] Static calls
On Mon, Nov 12, 2018 at 12:03:34PM -0500, Steven Rostedt wrote: > On Sun, 11 Nov 2018 23:30:55 -0600 > Josh Poimboeuf wrote: > > > > How much of that slowdown is reversed? > > > > In theory, it should reverse all of the slowdown, and actually may even > > speed it up a little. Steve is working on measuring that now. > > When I'm able to get it to work! Hopefully that last patch snippet you > posted will help. If not, I'm assuming you'll be in Vancouver this > week, and we could sit down and work it out. Sure, I'm already in Vancouver. Just grab me if you see me, or ping me on IRC/email. Or feel free to send me your patches if it's still giving you trouble. > That said, I don't expect a 100% improvement. Because the retpoline > causes slow down in other areas than just tracing, which is not being > fixed by this. I'm expecting a substantial improvement (which I see > good improvement with the unoptimized static calls), and hoping for > much more with the optimized one (when I get it working). But not 100%, > as stated above. Ah, ok. Makes sense. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Mon, Nov 12, 2018 at 12:03:34PM -0500, Steven Rostedt wrote: > On Sun, 11 Nov 2018 23:30:55 -0600 > Josh Poimboeuf wrote: > > > > How much of that slowdown is reversed? > > > > In theory, it should reverse all of the slowdown, and actually may even > > speed it up a little. Steve is working on measuring that now. > > When I'm able to get it to work! Hopefully that last patch snippet you > posted will help. If not, I'm assuming you'll be in Vancouver this > week, and we could sit down and work it out. Sure, I'm already in Vancouver. Just grab me if you see me, or ping me on IRC/email. Or feel free to send me your patches if it's still giving you trouble. > That said, I don't expect a 100% improvement. Because the retpoline > causes slow down in other areas than just tracing, which is not being > fixed by this. I'm expecting a substantial improvement (which I see > good improvement with the unoptimized static calls), and hoping for > much more with the optimized one (when I get it working). But not 100%, > as stated above. Ah, ok. Makes sense. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Mon, Nov 12, 2018 at 10:39:52AM +0100, Ard Biesheuvel wrote: > On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf wrote: > > > > On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote: > > > > > > * Josh Poimboeuf wrote: > > > > > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > > > > > I'd prefer the objtool approach. It's a pretty reliable > > > > > first-principles > > > > > approach while GCC plugin would have to be replicated for Clang and > > > > > any > > > > > other compilers, etc. > > > > > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > > > Clang. And presumably, they'd share a lot of code. > > > > > > Having looked into this, I don't think they will share any code at > all, to be honest. Perhaps some macros and string templates, that's > all. Oh well. That should still be easier to maintain than objtool across all arches at this point. > > > > The prospect of porting objtool to all architectures is going to be much > > > > more of a daunting task (though we are at least already considering it > > > > for some arches). > > > > > > Which architectures would benefit from ORC support the most? > > > > According to my (limited and potentially flawed) knowledge, I think > > arm64 would benefit the most performance-wise, whereas powerpc and s390 > > gains would be quite a bit less. > > > > What would arm64 gain from ORC and/or objtool? Other than live patching, the biggest benefit would be an across-the-board performance improvement from disabling frame pointers. It would be interesting to see some arm64 performance numbers there, for a kernel compiled with -fomit-frame-pointer. For more details (and benefits of) ORC see Documentation/x86/orc-unwinder.txt. Objtool has also come in handy for other cases, like ensuring retpolines are used everywhere. Over time, I would like to move some objtool functionality to compiler plugins, such that it would be easier to port it to other arches. > > We may have to port objtool to arm64 anyway, for live patching. > > Is this about the reliable stack traces, i.e., the ability to detect > non-leaf functions that don't create stack frames? I think we should > be able to manage this without objtool on arm64 tbh. Hm? How else would you ensure all functions honor CONFIG_FRAME_POINTER, and continue to do so indefinitely? > > But > > that will be a lot more work than it took for Ard to write a GCC plugin. > > > > > I really think that hard reliance on GCC plugins is foolish > > > > Funny, I feel the same way about hard reliance on objtool :-) > > > > I tend to agree here. I think objtool is a necessary evil (as are > compiler plugins, for that matter) which I hope does not spread to > other architectures. I agree that it's a necessary evil, but it may be necessary on arm64 for live patching. > But the main difference is that the GCC plugin is only ~50 lines (for > this particular use case, and minus another 50 lines of boilerplate), > whereas objtool (AIUI) duplicates lots and lots of functionality of > the compiler, assembler and/or linker, to mangle relocations, create > new sections etc etc. Porting this to other architectures is going to > be a major maintenance effort, especially when I think of, e.g., > 32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are > currently hidden in the toolchain. Other architectures should be first > class citizens if objtool gains support for them, which means that the > x86 people that own it currently are on the hook for testing their > changes against architectures they are not familiar with. Sounds like we could use you as a co-maintainer then :-) BTW, AFAIK, there are no plans to support live patching for 32-bit ARM. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Mon, Nov 12, 2018 at 10:39:52AM +0100, Ard Biesheuvel wrote: > On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf wrote: > > > > On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote: > > > > > > * Josh Poimboeuf wrote: > > > > > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > > > > > I'd prefer the objtool approach. It's a pretty reliable > > > > > first-principles > > > > > approach while GCC plugin would have to be replicated for Clang and > > > > > any > > > > > other compilers, etc. > > > > > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > > > Clang. And presumably, they'd share a lot of code. > > > > > > Having looked into this, I don't think they will share any code at > all, to be honest. Perhaps some macros and string templates, that's > all. Oh well. That should still be easier to maintain than objtool across all arches at this point. > > > > The prospect of porting objtool to all architectures is going to be much > > > > more of a daunting task (though we are at least already considering it > > > > for some arches). > > > > > > Which architectures would benefit from ORC support the most? > > > > According to my (limited and potentially flawed) knowledge, I think > > arm64 would benefit the most performance-wise, whereas powerpc and s390 > > gains would be quite a bit less. > > > > What would arm64 gain from ORC and/or objtool? Other than live patching, the biggest benefit would be an across-the-board performance improvement from disabling frame pointers. It would be interesting to see some arm64 performance numbers there, for a kernel compiled with -fomit-frame-pointer. For more details (and benefits of) ORC see Documentation/x86/orc-unwinder.txt. Objtool has also come in handy for other cases, like ensuring retpolines are used everywhere. Over time, I would like to move some objtool functionality to compiler plugins, such that it would be easier to port it to other arches. > > We may have to port objtool to arm64 anyway, for live patching. > > Is this about the reliable stack traces, i.e., the ability to detect > non-leaf functions that don't create stack frames? I think we should > be able to manage this without objtool on arm64 tbh. Hm? How else would you ensure all functions honor CONFIG_FRAME_POINTER, and continue to do so indefinitely? > > But > > that will be a lot more work than it took for Ard to write a GCC plugin. > > > > > I really think that hard reliance on GCC plugins is foolish > > > > Funny, I feel the same way about hard reliance on objtool :-) > > > > I tend to agree here. I think objtool is a necessary evil (as are > compiler plugins, for that matter) which I hope does not spread to > other architectures. I agree that it's a necessary evil, but it may be necessary on arm64 for live patching. > But the main difference is that the GCC plugin is only ~50 lines (for > this particular use case, and minus another 50 lines of boilerplate), > whereas objtool (AIUI) duplicates lots and lots of functionality of > the compiler, assembler and/or linker, to mangle relocations, create > new sections etc etc. Porting this to other architectures is going to > be a major maintenance effort, especially when I think of, e.g., > 32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are > currently hidden in the toolchain. Other architectures should be first > class citizens if objtool gains support for them, which means that the > x86 people that own it currently are on the hook for testing their > changes against architectures they are not familiar with. Sounds like we could use you as a co-maintainer then :-) BTW, AFAIK, there are no plans to support live patching for 32-bit ARM. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Sun, 11 Nov 2018 23:30:55 -0600 Josh Poimboeuf wrote: > > How much of that slowdown is reversed? > > In theory, it should reverse all of the slowdown, and actually may even > speed it up a little. Steve is working on measuring that now. When I'm able to get it to work! Hopefully that last patch snippet you posted will help. If not, I'm assuming you'll be in Vancouver this week, and we could sit down and work it out. That said, I don't expect a 100% improvement. Because the retpoline causes slow down in other areas than just tracing, which is not being fixed by this. I'm expecting a substantial improvement (which I see good improvement with the unoptimized static calls), and hoping for much more with the optimized one (when I get it working). But not 100%, as stated above. -- Steve
Re: [PATCH RFC 0/3] Static calls
On Sun, 11 Nov 2018 23:30:55 -0600 Josh Poimboeuf wrote: > > How much of that slowdown is reversed? > > In theory, it should reverse all of the slowdown, and actually may even > speed it up a little. Steve is working on measuring that now. When I'm able to get it to work! Hopefully that last patch snippet you posted will help. If not, I'm assuming you'll be in Vancouver this week, and we could sit down and work it out. That said, I don't expect a 100% improvement. Because the retpoline causes slow down in other areas than just tracing, which is not being fixed by this. I'm expecting a substantial improvement (which I see good improvement with the unoptimized static calls), and hoping for much more with the optimized one (when I get it working). But not 100%, as stated above. -- Steve
Re: [PATCH RFC 0/3] Static calls
On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf wrote: > > On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote: > > > > * Josh Poimboeuf wrote: > > > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > > > approach while GCC plugin would have to be replicated for Clang and any > > > > other compilers, etc. > > > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > > Clang. And presumably, they'd share a lot of code. > > > Having looked into this, I don't think they will share any code at all, to be honest. Perhaps some macros and string templates, that's all. > > > The prospect of porting objtool to all architectures is going to be much > > > more of a daunting task (though we are at least already considering it > > > for some arches). > > > > Which architectures would benefit from ORC support the most? > > According to my (limited and potentially flawed) knowledge, I think > arm64 would benefit the most performance-wise, whereas powerpc and s390 > gains would be quite a bit less. > What would arm64 gain from ORC and/or objtool? > We may have to port objtool to arm64 anyway, for live patching. Is this about the reliable stack traces, i.e., the ability to detect non-leaf functions that don't create stack frames? I think we should be able to manage this without objtool on arm64 tbh. > But > that will be a lot more work than it took for Ard to write a GCC plugin. > > > I really think that hard reliance on GCC plugins is foolish > > Funny, I feel the same way about hard reliance on objtool :-) > I tend to agree here. I think objtool is a necessary evil (as are compiler plugins, for that matter) which I hope does not spread to other architectures. But the main difference is that the GCC plugin is only ~50 lines (for this particular use case, and minus another 50 lines of boilerplate), whereas objtool (AIUI) duplicates lots and lots of functionality of the compiler, assembler and/or linker, to mangle relocations, create new sections etc etc. Porting this to other architectures is going to be a major maintenance effort, especially when I think of, e.g., 32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are currently hidden in the toolchain. Other architectures should be first class citizens if objtool gains support for them, which means that the x86 people that own it currently are on the hook for testing their changes against architectures they are not familiar with. This obviously applies equally to compiler plugins, but those have a lot more focus. > > - but maybe Clang's plugin infrastructure is a guarantee that it > > remains a sane and usable interface. > > Hopefully so. If it breaks, we could always write another tool, as the > work is straightforward. Or we could make it an objtool subcommand > which works on all arches. > > > > > All other usecases are bonus, but it would certainly be interesting to > > > > investigate the impact of using these APIs for tracing: that too is a > > > > feature enabled everywhere but utilized only by a small fraction of > > > > Linux > > > > users - so literally every single cycle or instruction saved or hot-path > > > > shortened is a major win. > > > > > > With retpolines, and with tracepoints enabled, it's definitely a major > > > win. Steve measured an 8.9% general slowdown on hackbench caused by > > > retpolines. > > > > How much of that slowdown is reversed? > > In theory, it should reverse all of the slowdown, and actually may even > speed it up a little. Steve is working on measuring that now. > > -- > Josh
Re: [PATCH RFC 0/3] Static calls
On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf wrote: > > On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote: > > > > * Josh Poimboeuf wrote: > > > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > > > approach while GCC plugin would have to be replicated for Clang and any > > > > other compilers, etc. > > > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > > Clang. And presumably, they'd share a lot of code. > > > Having looked into this, I don't think they will share any code at all, to be honest. Perhaps some macros and string templates, that's all. > > > The prospect of porting objtool to all architectures is going to be much > > > more of a daunting task (though we are at least already considering it > > > for some arches). > > > > Which architectures would benefit from ORC support the most? > > According to my (limited and potentially flawed) knowledge, I think > arm64 would benefit the most performance-wise, whereas powerpc and s390 > gains would be quite a bit less. > What would arm64 gain from ORC and/or objtool? > We may have to port objtool to arm64 anyway, for live patching. Is this about the reliable stack traces, i.e., the ability to detect non-leaf functions that don't create stack frames? I think we should be able to manage this without objtool on arm64 tbh. > But > that will be a lot more work than it took for Ard to write a GCC plugin. > > > I really think that hard reliance on GCC plugins is foolish > > Funny, I feel the same way about hard reliance on objtool :-) > I tend to agree here. I think objtool is a necessary evil (as are compiler plugins, for that matter) which I hope does not spread to other architectures. But the main difference is that the GCC plugin is only ~50 lines (for this particular use case, and minus another 50 lines of boilerplate), whereas objtool (AIUI) duplicates lots and lots of functionality of the compiler, assembler and/or linker, to mangle relocations, create new sections etc etc. Porting this to other architectures is going to be a major maintenance effort, especially when I think of, e.g., 32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are currently hidden in the toolchain. Other architectures should be first class citizens if objtool gains support for them, which means that the x86 people that own it currently are on the hook for testing their changes against architectures they are not familiar with. This obviously applies equally to compiler plugins, but those have a lot more focus. > > - but maybe Clang's plugin infrastructure is a guarantee that it > > remains a sane and usable interface. > > Hopefully so. If it breaks, we could always write another tool, as the > work is straightforward. Or we could make it an objtool subcommand > which works on all arches. > > > > > All other usecases are bonus, but it would certainly be interesting to > > > > investigate the impact of using these APIs for tracing: that too is a > > > > feature enabled everywhere but utilized only by a small fraction of > > > > Linux > > > > users - so literally every single cycle or instruction saved or hot-path > > > > shortened is a major win. > > > > > > With retpolines, and with tracepoints enabled, it's definitely a major > > > win. Steve measured an 8.9% general slowdown on hackbench caused by > > > retpolines. > > > > How much of that slowdown is reversed? > > In theory, it should reverse all of the slowdown, and actually may even > speed it up a little. Steve is working on measuring that now. > > -- > Josh
Re: [PATCH RFC 0/3] Static calls
On Sun, Nov 11, 2018 at 9:02 PM Ingo Molnar wrote: > > > * Josh Poimboeuf wrote: > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > > approach while GCC plugin would have to be replicated for Clang and any > > > other compilers, etc. > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > Clang. And presumably, they'd share a lot of code. > > > > The prospect of porting objtool to all architectures is going to be much > > more of a daunting task (though we are at least already considering it > > for some arches). > > Which architectures would benefit from ORC support the most? > > I really think that hard reliance on GCC plugins is foolish - but maybe > Clang's plugin infrastructure is a guarantee that it remains a sane and > usable interface. > > > > I'd be very happy with a demonstrated paravirt optimization already - > > > i.e. seeing the before/after effect on the vmlinux with an x86 distro > > > config. > > > > > > All major Linux distributions enable CONFIG_PARAVIRT=y and > > > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > > > as possible in the 99.999% cases where it's not used is a primary > > > concern. > > > > For paravirt, I was thinking of it as more of a cleanup than an > > optimization. The paravirt patching code already replaces indirect > > branches with direct ones -- see paravirt_patch_default(). > > > > Though it *would* reduce the instruction footprint a bit, as the 7-byte > > indirect calls (later patched to 5-byte direct + 2-byte nop) would > > instead be 5-byte direct calls to begin with. > > Yes. It would be a huge cleanup IMO -- the existing PVOP call stuff is really quite ugly IMO. Also, the existing stuff tries to emulate the semantics of passing parameters of unknown types using asm constraints, and I just don't believe that GCC does what we want it to do. In general, passing the *value* of a pointer to asm doesn't seem to convince gcc that the pointed-to value is used by the asm, and this makes me nervous. See commit 715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b as an example of this. In a similar vein, the existing PVOP calls have a "memory" clobber, and that's not free.
Re: [PATCH RFC 0/3] Static calls
On Sun, Nov 11, 2018 at 9:02 PM Ingo Molnar wrote: > > > * Josh Poimboeuf wrote: > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > > approach while GCC plugin would have to be replicated for Clang and any > > > other compilers, etc. > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > Clang. And presumably, they'd share a lot of code. > > > > The prospect of porting objtool to all architectures is going to be much > > more of a daunting task (though we are at least already considering it > > for some arches). > > Which architectures would benefit from ORC support the most? > > I really think that hard reliance on GCC plugins is foolish - but maybe > Clang's plugin infrastructure is a guarantee that it remains a sane and > usable interface. > > > > I'd be very happy with a demonstrated paravirt optimization already - > > > i.e. seeing the before/after effect on the vmlinux with an x86 distro > > > config. > > > > > > All major Linux distributions enable CONFIG_PARAVIRT=y and > > > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > > > as possible in the 99.999% cases where it's not used is a primary > > > concern. > > > > For paravirt, I was thinking of it as more of a cleanup than an > > optimization. The paravirt patching code already replaces indirect > > branches with direct ones -- see paravirt_patch_default(). > > > > Though it *would* reduce the instruction footprint a bit, as the 7-byte > > indirect calls (later patched to 5-byte direct + 2-byte nop) would > > instead be 5-byte direct calls to begin with. > > Yes. It would be a huge cleanup IMO -- the existing PVOP call stuff is really quite ugly IMO. Also, the existing stuff tries to emulate the semantics of passing parameters of unknown types using asm constraints, and I just don't believe that GCC does what we want it to do. In general, passing the *value* of a pointer to asm doesn't seem to convince gcc that the pointed-to value is used by the asm, and this makes me nervous. See commit 715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b as an example of this. In a similar vein, the existing PVOP calls have a "memory" clobber, and that's not free.
Re: [PATCH RFC 0/3] Static calls
On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote: > > * Josh Poimboeuf wrote: > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > > approach while GCC plugin would have to be replicated for Clang and any > > > other compilers, etc. > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > Clang. And presumably, they'd share a lot of code. > > > > The prospect of porting objtool to all architectures is going to be much > > more of a daunting task (though we are at least already considering it > > for some arches). > > Which architectures would benefit from ORC support the most? According to my (limited and potentially flawed) knowledge, I think arm64 would benefit the most performance-wise, whereas powerpc and s390 gains would be quite a bit less. We may have to port objtool to arm64 anyway, for live patching. But that will be a lot more work than it took for Ard to write a GCC plugin. > I really think that hard reliance on GCC plugins is foolish Funny, I feel the same way about hard reliance on objtool :-) > - but maybe Clang's plugin infrastructure is a guarantee that it > remains a sane and usable interface. Hopefully so. If it breaks, we could always write another tool, as the work is straightforward. Or we could make it an objtool subcommand which works on all arches. > > > All other usecases are bonus, but it would certainly be interesting to > > > investigate the impact of using these APIs for tracing: that too is a > > > feature enabled everywhere but utilized only by a small fraction of Linux > > > users - so literally every single cycle or instruction saved or hot-path > > > shortened is a major win. > > > > With retpolines, and with tracepoints enabled, it's definitely a major > > win. Steve measured an 8.9% general slowdown on hackbench caused by > > retpolines. > > How much of that slowdown is reversed? In theory, it should reverse all of the slowdown, and actually may even speed it up a little. Steve is working on measuring that now. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote: > > * Josh Poimboeuf wrote: > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > > x86-64 only, which means we have to use the "unoptimized" version > > > > everywhere else. I may experiment with a GCC plugin instead. > > > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > > approach while GCC plugin would have to be replicated for Clang and any > > > other compilers, etc. > > > > The benefit of a plugin is that we'd only need two of them: GCC and > > Clang. And presumably, they'd share a lot of code. > > > > The prospect of porting objtool to all architectures is going to be much > > more of a daunting task (though we are at least already considering it > > for some arches). > > Which architectures would benefit from ORC support the most? According to my (limited and potentially flawed) knowledge, I think arm64 would benefit the most performance-wise, whereas powerpc and s390 gains would be quite a bit less. We may have to port objtool to arm64 anyway, for live patching. But that will be a lot more work than it took for Ard to write a GCC plugin. > I really think that hard reliance on GCC plugins is foolish Funny, I feel the same way about hard reliance on objtool :-) > - but maybe Clang's plugin infrastructure is a guarantee that it > remains a sane and usable interface. Hopefully so. If it breaks, we could always write another tool, as the work is straightforward. Or we could make it an objtool subcommand which works on all arches. > > > All other usecases are bonus, but it would certainly be interesting to > > > investigate the impact of using these APIs for tracing: that too is a > > > feature enabled everywhere but utilized only by a small fraction of Linux > > > users - so literally every single cycle or instruction saved or hot-path > > > shortened is a major win. > > > > With retpolines, and with tracepoints enabled, it's definitely a major > > win. Steve measured an 8.9% general slowdown on hackbench caused by > > retpolines. > > How much of that slowdown is reversed? In theory, it should reverse all of the slowdown, and actually may even speed it up a little. Steve is working on measuring that now. -- Josh
Re: [PATCH RFC 0/3] Static calls
* Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > x86-64 only, which means we have to use the "unoptimized" version > > > everywhere else. I may experiment with a GCC plugin instead. > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > approach while GCC plugin would have to be replicated for Clang and any > > other compilers, etc. > > The benefit of a plugin is that we'd only need two of them: GCC and > Clang. And presumably, they'd share a lot of code. > > The prospect of porting objtool to all architectures is going to be much > more of a daunting task (though we are at least already considering it > for some arches). Which architectures would benefit from ORC support the most? I really think that hard reliance on GCC plugins is foolish - but maybe Clang's plugin infrastructure is a guarantee that it remains a sane and usable interface. > > I'd be very happy with a demonstrated paravirt optimization already - > > i.e. seeing the before/after effect on the vmlinux with an x86 distro > > config. > > > > All major Linux distributions enable CONFIG_PARAVIRT=y and > > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > > as possible in the 99.999% cases where it's not used is a primary > > concern. > > For paravirt, I was thinking of it as more of a cleanup than an > optimization. The paravirt patching code already replaces indirect > branches with direct ones -- see paravirt_patch_default(). > > Though it *would* reduce the instruction footprint a bit, as the 7-byte > indirect calls (later patched to 5-byte direct + 2-byte nop) would > instead be 5-byte direct calls to begin with. Yes. > > All other usecases are bonus, but it would certainly be interesting to > > investigate the impact of using these APIs for tracing: that too is a > > feature enabled everywhere but utilized only by a small fraction of Linux > > users - so literally every single cycle or instruction saved or hot-path > > shortened is a major win. > > With retpolines, and with tracepoints enabled, it's definitely a major > win. Steve measured an 8.9% general slowdown on hackbench caused by > retpolines. How much of that slowdown is reversed? > But with tracepoints disabled, I believe static jumps are used, which > already minimizes the impact on hot paths. Yeah. Thanks, Ing
Re: [PATCH RFC 0/3] Static calls
* Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > > - I'm not sure about the objtool approach. Objtool is (currently) > > > x86-64 only, which means we have to use the "unoptimized" version > > > everywhere else. I may experiment with a GCC plugin instead. > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > approach while GCC plugin would have to be replicated for Clang and any > > other compilers, etc. > > The benefit of a plugin is that we'd only need two of them: GCC and > Clang. And presumably, they'd share a lot of code. > > The prospect of porting objtool to all architectures is going to be much > more of a daunting task (though we are at least already considering it > for some arches). Which architectures would benefit from ORC support the most? I really think that hard reliance on GCC plugins is foolish - but maybe Clang's plugin infrastructure is a guarantee that it remains a sane and usable interface. > > I'd be very happy with a demonstrated paravirt optimization already - > > i.e. seeing the before/after effect on the vmlinux with an x86 distro > > config. > > > > All major Linux distributions enable CONFIG_PARAVIRT=y and > > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > > as possible in the 99.999% cases where it's not used is a primary > > concern. > > For paravirt, I was thinking of it as more of a cleanup than an > optimization. The paravirt patching code already replaces indirect > branches with direct ones -- see paravirt_patch_default(). > > Though it *would* reduce the instruction footprint a bit, as the 7-byte > indirect calls (later patched to 5-byte direct + 2-byte nop) would > instead be 5-byte direct calls to begin with. Yes. > > All other usecases are bonus, but it would certainly be interesting to > > investigate the impact of using these APIs for tracing: that too is a > > feature enabled everywhere but utilized only by a small fraction of Linux > > users - so literally every single cycle or instruction saved or hot-path > > shortened is a major win. > > With retpolines, and with tracepoints enabled, it's definitely a major > win. Steve measured an 8.9% general slowdown on hackbench caused by > retpolines. How much of that slowdown is reversed? > But with tracepoints disabled, I believe static jumps are used, which > already minimizes the impact on hot paths. Yeah. Thanks, Ing
Re: [PATCH RFC 0/3] Static calls
On Sun, Nov 11, 2018 at 02:42:55PM +0100, Ard Biesheuvel wrote: > On 11 November 2018 at 00:20, Peter Zijlstra wrote: > > On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: > >> On 9 November 2018 at 08:28, Ingo Molnar wrote: > >> >> - I'm not sure about the objtool approach. Objtool is (currently) > >> >> x86-64 only, which means we have to use the "unoptimized" version > >> >> everywhere else. I may experiment with a GCC plugin instead. > >> > > >> > I'd prefer the objtool approach. It's a pretty reliable first-principles > >> > approach while GCC plugin would have to be replicated for Clang and any > >> > other compilers, etc. > >> > > >> > >> I implemented the GCC plugin approach here for arm64 > > > > I'm confused; I though we only needed objtool for variable instruction > > length architectures, because we can't reliably decode our instruction > > stream. Otherwise we can fairly trivially use the DWARF relocation data, > > no? > > How would that work? We could build vmlinux with --emit-relocs, filter > out the static jump/call relocations and resolve the symbol names to > filter the ones associated with calls to trampolines. But then, we > have to build the static_call_sites section and reinject it back into > the image in some way, which is essentially objtool, no? It's a _much_ simpler tool than objtool, but yes, we need a tool that reads the relocation stuff and (re)injects it in a new section -- we don't need it on a vmlinux level, it can be done per TU. Anyway, a GCC plugin (I still have to have a peek at your thing) sounds like it should work just fine too.
Re: [PATCH RFC 0/3] Static calls
On Sun, Nov 11, 2018 at 02:42:55PM +0100, Ard Biesheuvel wrote: > On 11 November 2018 at 00:20, Peter Zijlstra wrote: > > On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: > >> On 9 November 2018 at 08:28, Ingo Molnar wrote: > >> >> - I'm not sure about the objtool approach. Objtool is (currently) > >> >> x86-64 only, which means we have to use the "unoptimized" version > >> >> everywhere else. I may experiment with a GCC plugin instead. > >> > > >> > I'd prefer the objtool approach. It's a pretty reliable first-principles > >> > approach while GCC plugin would have to be replicated for Clang and any > >> > other compilers, etc. > >> > > >> > >> I implemented the GCC plugin approach here for arm64 > > > > I'm confused; I though we only needed objtool for variable instruction > > length architectures, because we can't reliably decode our instruction > > stream. Otherwise we can fairly trivially use the DWARF relocation data, > > no? > > How would that work? We could build vmlinux with --emit-relocs, filter > out the static jump/call relocations and resolve the symbol names to > filter the ones associated with calls to trampolines. But then, we > have to build the static_call_sites section and reinject it back into > the image in some way, which is essentially objtool, no? It's a _much_ simpler tool than objtool, but yes, we need a tool that reads the relocation stuff and (re)injects it in a new section -- we don't need it on a vmlinux level, it can be done per TU. Anyway, a GCC plugin (I still have to have a peek at your thing) sounds like it should work just fine too.
Re: [PATCH RFC 0/3] Static calls
On 11 November 2018 at 00:20, Peter Zijlstra wrote: > On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: >> On 9 November 2018 at 08:28, Ingo Molnar wrote: >> >> - I'm not sure about the objtool approach. Objtool is (currently) >> >> x86-64 only, which means we have to use the "unoptimized" version >> >> everywhere else. I may experiment with a GCC plugin instead. >> > >> > I'd prefer the objtool approach. It's a pretty reliable first-principles >> > approach while GCC plugin would have to be replicated for Clang and any >> > other compilers, etc. >> > >> >> I implemented the GCC plugin approach here for arm64 > > I'm confused; I though we only needed objtool for variable instruction > length architectures, because we can't reliably decode our instruction > stream. Otherwise we can fairly trivially use the DWARF relocation data, > no? How would that work? We could build vmlinux with --emit-relocs, filter out the static jump/call relocations and resolve the symbol names to filter the ones associated with calls to trampolines. But then, we have to build the static_call_sites section and reinject it back into the image in some way, which is essentially objtool, no?
Re: [PATCH RFC 0/3] Static calls
On 11 November 2018 at 00:20, Peter Zijlstra wrote: > On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: >> On 9 November 2018 at 08:28, Ingo Molnar wrote: >> >> - I'm not sure about the objtool approach. Objtool is (currently) >> >> x86-64 only, which means we have to use the "unoptimized" version >> >> everywhere else. I may experiment with a GCC plugin instead. >> > >> > I'd prefer the objtool approach. It's a pretty reliable first-principles >> > approach while GCC plugin would have to be replicated for Clang and any >> > other compilers, etc. >> > >> >> I implemented the GCC plugin approach here for arm64 > > I'm confused; I though we only needed objtool for variable instruction > length architectures, because we can't reliably decode our instruction > stream. Otherwise we can fairly trivially use the DWARF relocation data, > no? How would that work? We could build vmlinux with --emit-relocs, filter out the static jump/call relocations and resolve the symbol names to filter the ones associated with calls to trampolines. But then, we have to build the static_call_sites section and reinject it back into the image in some way, which is essentially objtool, no?
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: > On 9 November 2018 at 08:28, Ingo Molnar wrote: > >> - I'm not sure about the objtool approach. Objtool is (currently) > >> x86-64 only, which means we have to use the "unoptimized" version > >> everywhere else. I may experiment with a GCC plugin instead. > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > approach while GCC plugin would have to be replicated for Clang and any > > other compilers, etc. > > > > I implemented the GCC plugin approach here for arm64 I'm confused; I though we only needed objtool for variable instruction length architectures, because we can't reliably decode our instruction stream. Otherwise we can fairly trivially use the DWARF relocation data, no?
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: > On 9 November 2018 at 08:28, Ingo Molnar wrote: > >> - I'm not sure about the objtool approach. Objtool is (currently) > >> x86-64 only, which means we have to use the "unoptimized" version > >> everywhere else. I may experiment with a GCC plugin instead. > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > approach while GCC plugin would have to be replicated for Clang and any > > other compilers, etc. > > > > I implemented the GCC plugin approach here for arm64 I'm confused; I though we only needed objtool for variable instruction length architectures, because we can't reliably decode our instruction stream. Otherwise we can fairly trivially use the DWARF relocation data, no?
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 11:05:51 -0800 Andy Lutomirski wrote: > > > > On Nov 9, 2018, at 10:42 AM, Steven Rostedt wrote: > > > > On Fri, 9 Nov 2018 10:41:37 -0600 > > Josh Poimboeuf wrote: > > > >>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > All other usecases are bonus, but it would certainly be interesting to > > investigate the impact of using these APIs for tracing: that too is a > > feature enabled everywhere but utilized only by a small fraction of > > Linux > > users - so literally every single cycle or instruction saved or hot-path > > shortened is a major win. > > For tracing, we'd want static_call_set_to_nop() or something like that, > right? > >>> > >>> Are we talking about tracepoints? Or ftrace? > >> > >> Since ftrace changes calls to nops, and vice versa, I assume you meant > >> ftrace. I don't think ftrace is a good candidate for this, as it's > >> inherently more flexible than this API would reasonably allow. > >> > > > > Not sure what Andy was talking about, but I'm currently implementing > > tracepoints to use this, as tracepoints use indirect calls, and are a > > prime candidate for static calls, as I showed in my original RFC of > > this feature. > > > > > > Indeed. > > Although I had assumed that tracepoints already had appropriate jump label > magic. As far as I know, the jump label magic is for reducing the overhead when the tracepoint is OFF (because it can skip function parameter preparation), and this static call will be good when the tracepoint is ON (enabled) because of this can avoid retpoline performance degradation. Thank you, -- Masami Hiramatsu
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 11:05:51 -0800 Andy Lutomirski wrote: > > > > On Nov 9, 2018, at 10:42 AM, Steven Rostedt wrote: > > > > On Fri, 9 Nov 2018 10:41:37 -0600 > > Josh Poimboeuf wrote: > > > >>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > All other usecases are bonus, but it would certainly be interesting to > > investigate the impact of using these APIs for tracing: that too is a > > feature enabled everywhere but utilized only by a small fraction of > > Linux > > users - so literally every single cycle or instruction saved or hot-path > > shortened is a major win. > > For tracing, we'd want static_call_set_to_nop() or something like that, > right? > >>> > >>> Are we talking about tracepoints? Or ftrace? > >> > >> Since ftrace changes calls to nops, and vice versa, I assume you meant > >> ftrace. I don't think ftrace is a good candidate for this, as it's > >> inherently more flexible than this API would reasonably allow. > >> > > > > Not sure what Andy was talking about, but I'm currently implementing > > tracepoints to use this, as tracepoints use indirect calls, and are a > > prime candidate for static calls, as I showed in my original RFC of > > this feature. > > > > > > Indeed. > > Although I had assumed that tracepoints already had appropriate jump label > magic. As far as I know, the jump label magic is for reducing the overhead when the tracepoint is OFF (because it can skip function parameter preparation), and this static call will be good when the tracepoint is ON (enabled) because of this can avoid retpoline performance degradation. Thank you, -- Masami Hiramatsu
Re: [PATCH RFC 0/3] Static calls
On 09/11/2018 16.16, Andy Lutomirski wrote: > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: >> >> >> All other usecases are bonus, but it would certainly be interesting to >> investigate the impact of using these APIs for tracing: that too is a >> feature enabled everywhere but utilized only by a small fraction of Linux >> users - so literally every single cycle or instruction saved or hot-path >> shortened is a major win. > > For tracing, we'd want static_call_set_to_nop() or something like that, right? > Hm. IIUC, when gcc sees static_call(key)(...), it has to generate code to put the right values in %rdi, %rsi etc.. Even if the function is void (*)(void), gcc would still need to shuffle things around (either spill and reload, or move %rdi to some callee saved register). So if the static_call is noop'ed out most of the time, that seems like a net loss? With an unlikely static_key, gcc can do all the parameter setup and reloading in an out-of-line chunk of code. static calls seems like a quite useful concept, but only/mostly if _some_ function needs to be called at that spot. Aside: there should be some compile-time check that static_call_set_to_nop can only be used if the return type is void. Rasmus
Re: [PATCH RFC 0/3] Static calls
On 09/11/2018 16.16, Andy Lutomirski wrote: > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: >> >> >> All other usecases are bonus, but it would certainly be interesting to >> investigate the impact of using these APIs for tracing: that too is a >> feature enabled everywhere but utilized only by a small fraction of Linux >> users - so literally every single cycle or instruction saved or hot-path >> shortened is a major win. > > For tracing, we'd want static_call_set_to_nop() or something like that, right? > Hm. IIUC, when gcc sees static_call(key)(...), it has to generate code to put the right values in %rdi, %rsi etc.. Even if the function is void (*)(void), gcc would still need to shuffle things around (either spill and reload, or move %rdi to some callee saved register). So if the static_call is noop'ed out most of the time, that seems like a net loss? With an unlikely static_key, gcc can do all the parameter setup and reloading in an out-of-line chunk of code. static calls seems like a quite useful concept, but only/mostly if _some_ function needs to be called at that spot. Aside: there should be some compile-time check that static_call_set_to_nop can only be used if the return type is void. Rasmus
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:59:18PM -0500, Steven Rostedt wrote: > On Fri, 9 Nov 2018 13:44:09 -0600 > Josh Poimboeuf wrote: > > > On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote: > > > On Fri, 9 Nov 2018 11:05:51 -0800 > > > Andy Lutomirski wrote: > > > > > > > > Not sure what Andy was talking about, but I'm currently implementing > > > > > tracepoints to use this, as tracepoints use indirect calls, and are a > > > > > prime candidate for static calls, as I showed in my original RFC of > > > > > this feature. > > > > > > > > > > > > > > > > > > Indeed. > > > > > > > > Although I had assumed that tracepoints already had appropriate jump > > > > label magic. > > > > > > It does. But that's not the problem I was trying to solve. It's that > > > tracing took a 8% noise dive with retpolines when enabled (hackbench > > > slowed down by 8% with all the trace events enabled compared to all > > > trace events enabled without retpoline). That is, normal users (those > > > not tracinng) are not affected by trace events slowing down by > > > retpoline. Those that care about performance when they are tracing, are > > > affected by retpoline, quite drastically. > > > > > > I'm doing another test run and measurements, to see how the unoptimized > > > trampolines help, followed by the trampoline case. > > > > Are you sure you're using unoptimized? Optimized is the default on > > x86-64 (with my third patch). > > > > Yes, because I haven't applied that third patch yet ;-) > > Then I'll apply it and see how much that improves things. Ah, good. That will be interesting to see the difference between optimized/unoptimized. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:59:18PM -0500, Steven Rostedt wrote: > On Fri, 9 Nov 2018 13:44:09 -0600 > Josh Poimboeuf wrote: > > > On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote: > > > On Fri, 9 Nov 2018 11:05:51 -0800 > > > Andy Lutomirski wrote: > > > > > > > > Not sure what Andy was talking about, but I'm currently implementing > > > > > tracepoints to use this, as tracepoints use indirect calls, and are a > > > > > prime candidate for static calls, as I showed in my original RFC of > > > > > this feature. > > > > > > > > > > > > > > > > > > Indeed. > > > > > > > > Although I had assumed that tracepoints already had appropriate jump > > > > label magic. > > > > > > It does. But that's not the problem I was trying to solve. It's that > > > tracing took a 8% noise dive with retpolines when enabled (hackbench > > > slowed down by 8% with all the trace events enabled compared to all > > > trace events enabled without retpoline). That is, normal users (those > > > not tracinng) are not affected by trace events slowing down by > > > retpoline. Those that care about performance when they are tracing, are > > > affected by retpoline, quite drastically. > > > > > > I'm doing another test run and measurements, to see how the unoptimized > > > trampolines help, followed by the trampoline case. > > > > Are you sure you're using unoptimized? Optimized is the default on > > x86-64 (with my third patch). > > > > Yes, because I haven't applied that third patch yet ;-) > > Then I'll apply it and see how much that improves things. Ah, good. That will be interesting to see the difference between optimized/unoptimized. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 13:44:09 -0600 Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote: > > On Fri, 9 Nov 2018 11:05:51 -0800 > > Andy Lutomirski wrote: > > > > > > Not sure what Andy was talking about, but I'm currently implementing > > > > tracepoints to use this, as tracepoints use indirect calls, and are a > > > > prime candidate for static calls, as I showed in my original RFC of > > > > this feature. > > > > > > > > > > > > > > Indeed. > > > > > > Although I had assumed that tracepoints already had appropriate jump > > > label magic. > > > > It does. But that's not the problem I was trying to solve. It's that > > tracing took a 8% noise dive with retpolines when enabled (hackbench > > slowed down by 8% with all the trace events enabled compared to all > > trace events enabled without retpoline). That is, normal users (those > > not tracinng) are not affected by trace events slowing down by > > retpoline. Those that care about performance when they are tracing, are > > affected by retpoline, quite drastically. > > > > I'm doing another test run and measurements, to see how the unoptimized > > trampolines help, followed by the trampoline case. > > Are you sure you're using unoptimized? Optimized is the default on > x86-64 (with my third patch). > Yes, because I haven't applied that third patch yet ;-) Then I'll apply it and see how much that improves things. -- Steve
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 13:44:09 -0600 Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote: > > On Fri, 9 Nov 2018 11:05:51 -0800 > > Andy Lutomirski wrote: > > > > > > Not sure what Andy was talking about, but I'm currently implementing > > > > tracepoints to use this, as tracepoints use indirect calls, and are a > > > > prime candidate for static calls, as I showed in my original RFC of > > > > this feature. > > > > > > > > > > > > > > Indeed. > > > > > > Although I had assumed that tracepoints already had appropriate jump > > > label magic. > > > > It does. But that's not the problem I was trying to solve. It's that > > tracing took a 8% noise dive with retpolines when enabled (hackbench > > slowed down by 8% with all the trace events enabled compared to all > > trace events enabled without retpoline). That is, normal users (those > > not tracinng) are not affected by trace events slowing down by > > retpoline. Those that care about performance when they are tracing, are > > affected by retpoline, quite drastically. > > > > I'm doing another test run and measurements, to see how the unoptimized > > trampolines help, followed by the trampoline case. > > Are you sure you're using unoptimized? Optimized is the default on > x86-64 (with my third patch). > Yes, because I haven't applied that third patch yet ;-) Then I'll apply it and see how much that improves things. -- Steve
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote: > On Fri, 9 Nov 2018 11:05:51 -0800 > Andy Lutomirski wrote: > > > > Not sure what Andy was talking about, but I'm currently implementing > > > tracepoints to use this, as tracepoints use indirect calls, and are a > > > prime candidate for static calls, as I showed in my original RFC of > > > this feature. > > > > > > > > > > Indeed. > > > > Although I had assumed that tracepoints already had appropriate jump label > > magic. > > It does. But that's not the problem I was trying to solve. It's that > tracing took a 8% noise dive with retpolines when enabled (hackbench > slowed down by 8% with all the trace events enabled compared to all > trace events enabled without retpoline). That is, normal users (those > not tracinng) are not affected by trace events slowing down by > retpoline. Those that care about performance when they are tracing, are > affected by retpoline, quite drastically. > > I'm doing another test run and measurements, to see how the unoptimized > trampolines help, followed by the trampoline case. Are you sure you're using unoptimized? Optimized is the default on x86-64 (with my third patch). -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote: > On Fri, 9 Nov 2018 11:05:51 -0800 > Andy Lutomirski wrote: > > > > Not sure what Andy was talking about, but I'm currently implementing > > > tracepoints to use this, as tracepoints use indirect calls, and are a > > > prime candidate for static calls, as I showed in my original RFC of > > > this feature. > > > > > > > > > > Indeed. > > > > Although I had assumed that tracepoints already had appropriate jump label > > magic. > > It does. But that's not the problem I was trying to solve. It's that > tracing took a 8% noise dive with retpolines when enabled (hackbench > slowed down by 8% with all the trace events enabled compared to all > trace events enabled without retpoline). That is, normal users (those > not tracinng) are not affected by trace events slowing down by > retpoline. Those that care about performance when they are tracing, are > affected by retpoline, quite drastically. > > I'm doing another test run and measurements, to see how the unoptimized > trampolines help, followed by the trampoline case. Are you sure you're using unoptimized? Optimized is the default on x86-64 (with my third patch). -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 11:05:51 -0800 Andy Lutomirski wrote: > > Not sure what Andy was talking about, but I'm currently implementing > > tracepoints to use this, as tracepoints use indirect calls, and are a > > prime candidate for static calls, as I showed in my original RFC of > > this feature. > > > > > > Indeed. > > Although I had assumed that tracepoints already had appropriate jump label > magic. It does. But that's not the problem I was trying to solve. It's that tracing took a 8% noise dive with retpolines when enabled (hackbench slowed down by 8% with all the trace events enabled compared to all trace events enabled without retpoline). That is, normal users (those not tracinng) are not affected by trace events slowing down by retpoline. Those that care about performance when they are tracing, are affected by retpoline, quite drastically. I'm doing another test run and measurements, to see how the unoptimized trampolines help, followed by the trampoline case. -- Steve
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 11:05:51 -0800 Andy Lutomirski wrote: > > Not sure what Andy was talking about, but I'm currently implementing > > tracepoints to use this, as tracepoints use indirect calls, and are a > > prime candidate for static calls, as I showed in my original RFC of > > this feature. > > > > > > Indeed. > > Although I had assumed that tracepoints already had appropriate jump label > magic. It does. But that's not the problem I was trying to solve. It's that tracing took a 8% noise dive with retpolines when enabled (hackbench slowed down by 8% with all the trace events enabled compared to all trace events enabled without retpoline). That is, normal users (those not tracinng) are not affected by trace events slowing down by retpoline. Those that care about performance when they are tracing, are affected by retpoline, quite drastically. I'm doing another test run and measurements, to see how the unoptimized trampolines help, followed by the trampoline case. -- Steve
Re: [PATCH RFC 0/3] Static calls
> On Nov 9, 2018, at 10:42 AM, Steven Rostedt wrote: > > On Fri, 9 Nov 2018 10:41:37 -0600 > Josh Poimboeuf wrote: > >>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. For tracing, we'd want static_call_set_to_nop() or something like that, right? >>> >>> Are we talking about tracepoints? Or ftrace? >> >> Since ftrace changes calls to nops, and vice versa, I assume you meant >> ftrace. I don't think ftrace is a good candidate for this, as it's >> inherently more flexible than this API would reasonably allow. >> > > Not sure what Andy was talking about, but I'm currently implementing > tracepoints to use this, as tracepoints use indirect calls, and are a > prime candidate for static calls, as I showed in my original RFC of > this feature. > > Indeed. Although I had assumed that tracepoints already had appropriate jump label magic.
Re: [PATCH RFC 0/3] Static calls
> On Nov 9, 2018, at 10:42 AM, Steven Rostedt wrote: > > On Fri, 9 Nov 2018 10:41:37 -0600 > Josh Poimboeuf wrote: > >>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. For tracing, we'd want static_call_set_to_nop() or something like that, right? >>> >>> Are we talking about tracepoints? Or ftrace? >> >> Since ftrace changes calls to nops, and vice versa, I assume you meant >> ftrace. I don't think ftrace is a good candidate for this, as it's >> inherently more flexible than this API would reasonably allow. >> > > Not sure what Andy was talking about, but I'm currently implementing > tracepoints to use this, as tracepoints use indirect calls, and are a > prime candidate for static calls, as I showed in my original RFC of > this feature. > > Indeed. Although I had assumed that tracepoints already had appropriate jump label magic.
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 10:41:37 -0600 Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: > > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > > > > > > > All other usecases are bonus, but it would certainly be interesting to > > > > investigate the impact of using these APIs for tracing: that too is a > > > > feature enabled everywhere but utilized only by a small fraction of > > > > Linux > > > > users - so literally every single cycle or instruction saved or hot-path > > > > shortened is a major win. > > > > > > For tracing, we'd want static_call_set_to_nop() or something like that, > > > right? > > > > Are we talking about tracepoints? Or ftrace? > > Since ftrace changes calls to nops, and vice versa, I assume you meant > ftrace. I don't think ftrace is a good candidate for this, as it's > inherently more flexible than this API would reasonably allow. > Not sure what Andy was talking about, but I'm currently implementing tracepoints to use this, as tracepoints use indirect calls, and are a prime candidate for static calls, as I showed in my original RFC of this feature. -- Steve
Re: [PATCH RFC 0/3] Static calls
On Fri, 9 Nov 2018 10:41:37 -0600 Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: > > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > > > > > > > All other usecases are bonus, but it would certainly be interesting to > > > > investigate the impact of using these APIs for tracing: that too is a > > > > feature enabled everywhere but utilized only by a small fraction of > > > > Linux > > > > users - so literally every single cycle or instruction saved or hot-path > > > > shortened is a major win. > > > > > > For tracing, we'd want static_call_set_to_nop() or something like that, > > > right? > > > > Are we talking about tracepoints? Or ftrace? > > Since ftrace changes calls to nops, and vice versa, I assume you meant > ftrace. I don't think ftrace is a good candidate for this, as it's > inherently more flexible than this API would reasonably allow. > Not sure what Andy was talking about, but I'm currently implementing tracepoints to use this, as tracepoints use indirect calls, and are a prime candidate for static calls, as I showed in my original RFC of this feature. -- Steve
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > > > > All other usecases are bonus, but it would certainly be interesting to > > > investigate the impact of using these APIs for tracing: that too is a > > > feature enabled everywhere but utilized only by a small fraction of Linux > > > users - so literally every single cycle or instruction saved or hot-path > > > shortened is a major win. > > > > For tracing, we'd want static_call_set_to_nop() or something like that, > > right? > > Are we talking about tracepoints? Or ftrace? Since ftrace changes calls to nops, and vice versa, I assume you meant ftrace. I don't think ftrace is a good candidate for this, as it's inherently more flexible than this API would reasonably allow. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote: > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > > > > All other usecases are bonus, but it would certainly be interesting to > > > investigate the impact of using these APIs for tracing: that too is a > > > feature enabled everywhere but utilized only by a small fraction of Linux > > > users - so literally every single cycle or instruction saved or hot-path > > > shortened is a major win. > > > > For tracing, we'd want static_call_set_to_nop() or something like that, > > right? > > Are we talking about tracepoints? Or ftrace? Since ftrace changes calls to nops, and vice versa, I assume you meant ftrace. I don't think ftrace is a good candidate for this, as it's inherently more flexible than this API would reasonably allow. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > All other usecases are bonus, but it would certainly be interesting to > > investigate the impact of using these APIs for tracing: that too is a > > feature enabled everywhere but utilized only by a small fraction of Linux > > users - so literally every single cycle or instruction saved or hot-path > > shortened is a major win. > > For tracing, we'd want static_call_set_to_nop() or something like that, right? Are we talking about tracepoints? Or ftrace? -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote: > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > > > > All other usecases are bonus, but it would certainly be interesting to > > investigate the impact of using these APIs for tracing: that too is a > > feature enabled everywhere but utilized only by a small fraction of Linux > > users - so literally every single cycle or instruction saved or hot-path > > shortened is a major win. > > For tracing, we'd want static_call_set_to_nop() or something like that, right? Are we talking about tracepoints? Or ftrace? -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: > On 9 November 2018 at 08:28, Ingo Molnar wrote: > > > > * Josh Poimboeuf wrote: > > > >> These patches are related to two similar patch sets from Ard and Steve: > >> > >> - > >> https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org > >> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org > >> > >> The code is also heavily inspired by the jump label code, as some of the > >> concepts are very similar. > >> > >> There are three separate implementations, depending on what the arch > >> supports: > >> > >> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > >> objtool and a small amount of arch code > >> > >> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > >> a small amount of arch code > >> > >> 3) If no arch support, fall back to regular function pointers > >> > >> > >> TODO: > >> > >> - I'm not sure about the objtool approach. Objtool is (currently) > >> x86-64 only, which means we have to use the "unoptimized" version > >> everywhere else. I may experiment with a GCC plugin instead. > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > approach while GCC plugin would have to be replicated for Clang and any > > other compilers, etc. > > > > I implemented the GCC plugin approach here for arm64 > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls > > That implements both the unoptimized and the optimized versions. Nice! That was fast :-) > I do take your point about GCC and other compilers, but on arm64 we > don't have a lot of choice. > > As far as I can tell, the GCC plugin is generic (i.e., it does not > rely on any ARM specific passes, but obviously, this requires a *lot* > of testing and validation to be taken seriously. Yeah. I haven't had a chance to try your plugin on x86 yet, but in theory it should be arch-independent. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote: > On 9 November 2018 at 08:28, Ingo Molnar wrote: > > > > * Josh Poimboeuf wrote: > > > >> These patches are related to two similar patch sets from Ard and Steve: > >> > >> - > >> https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org > >> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org > >> > >> The code is also heavily inspired by the jump label code, as some of the > >> concepts are very similar. > >> > >> There are three separate implementations, depending on what the arch > >> supports: > >> > >> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > >> objtool and a small amount of arch code > >> > >> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > >> a small amount of arch code > >> > >> 3) If no arch support, fall back to regular function pointers > >> > >> > >> TODO: > >> > >> - I'm not sure about the objtool approach. Objtool is (currently) > >> x86-64 only, which means we have to use the "unoptimized" version > >> everywhere else. I may experiment with a GCC plugin instead. > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles > > approach while GCC plugin would have to be replicated for Clang and any > > other compilers, etc. > > > > I implemented the GCC plugin approach here for arm64 > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls > > That implements both the unoptimized and the optimized versions. Nice! That was fast :-) > I do take your point about GCC and other compilers, but on arm64 we > don't have a lot of choice. > > As far as I can tell, the GCC plugin is generic (i.e., it does not > rely on any ARM specific passes, but obviously, this requires a *lot* > of testing and validation to be taken seriously. Yeah. I haven't had a chance to try your plugin on x86 yet, but in theory it should be arch-independent. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. For tracing, we'd want static_call_set_to_nop() or something like that, right? --Andy
Re: [PATCH RFC 0/3] Static calls
On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar wrote: > > > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. For tracing, we'd want static_call_set_to_nop() or something like that, right? --Andy
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > - I'm not sure about the objtool approach. Objtool is (currently) > > x86-64 only, which means we have to use the "unoptimized" version > > everywhere else. I may experiment with a GCC plugin instead. > > I'd prefer the objtool approach. It's a pretty reliable first-principles > approach while GCC plugin would have to be replicated for Clang and any > other compilers, etc. The benefit of a plugin is that we'd only need two of them: GCC and Clang. And presumably, they'd share a lot of code. The prospect of porting objtool to all architectures is going to be much more of a daunting task (though we are at least already considering it for some arches). > > - Does this feature have much value without retpolines? If not, should > > we make it depend on retpolines somehow? > > Paravirt patching, as you mention in your later reply? > > > - Find some actual users of the interfaces (tracepoints? crypto?) > > I'd be very happy with a demonstrated paravirt optimization already - > i.e. seeing the before/after effect on the vmlinux with an x86 distro > config. > > All major Linux distributions enable CONFIG_PARAVIRT=y and > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > as possible in the 99.999% cases where it's not used is a primary > concern. For paravirt, I was thinking of it as more of a cleanup than an optimization. The paravirt patching code already replaces indirect branches with direct ones -- see paravirt_patch_default(). Though it *would* reduce the instruction footprint a bit, as the 7-byte indirect calls (later patched to 5-byte direct + 2-byte nop) would instead be 5-byte direct calls to begin with. > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. With retpolines, and with tracepoints enabled, it's definitely a major win. Steve measured an 8.9% general slowdown on hackbench caused by retpolines. But with tracepoints disabled, I believe static jumps are used, which already minimizes the impact on hot paths. -- Josh
Re: [PATCH RFC 0/3] Static calls
On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote: > > - I'm not sure about the objtool approach. Objtool is (currently) > > x86-64 only, which means we have to use the "unoptimized" version > > everywhere else. I may experiment with a GCC plugin instead. > > I'd prefer the objtool approach. It's a pretty reliable first-principles > approach while GCC plugin would have to be replicated for Clang and any > other compilers, etc. The benefit of a plugin is that we'd only need two of them: GCC and Clang. And presumably, they'd share a lot of code. The prospect of porting objtool to all architectures is going to be much more of a daunting task (though we are at least already considering it for some arches). > > - Does this feature have much value without retpolines? If not, should > > we make it depend on retpolines somehow? > > Paravirt patching, as you mention in your later reply? > > > - Find some actual users of the interfaces (tracepoints? crypto?) > > I'd be very happy with a demonstrated paravirt optimization already - > i.e. seeing the before/after effect on the vmlinux with an x86 distro > config. > > All major Linux distributions enable CONFIG_PARAVIRT=y and > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > as possible in the 99.999% cases where it's not used is a primary > concern. For paravirt, I was thinking of it as more of a cleanup than an optimization. The paravirt patching code already replaces indirect branches with direct ones -- see paravirt_patch_default(). Though it *would* reduce the instruction footprint a bit, as the 7-byte indirect calls (later patched to 5-byte direct + 2-byte nop) would instead be 5-byte direct calls to begin with. > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. With retpolines, and with tracepoints enabled, it's definitely a major win. Steve measured an 8.9% general slowdown on hackbench caused by retpolines. But with tracepoints disabled, I believe static jumps are used, which already minimizes the impact on hot paths. -- Josh
Re: [PATCH RFC 0/3] Static calls
On 9 November 2018 at 08:28, Ingo Molnar wrote: > > * Josh Poimboeuf wrote: > >> These patches are related to two similar patch sets from Ard and Steve: >> >> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org >> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org >> >> The code is also heavily inspired by the jump label code, as some of the >> concepts are very similar. >> >> There are three separate implementations, depending on what the arch >> supports: >> >> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires >> objtool and a small amount of arch code >> >> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires >> a small amount of arch code >> >> 3) If no arch support, fall back to regular function pointers >> >> >> TODO: >> >> - I'm not sure about the objtool approach. Objtool is (currently) >> x86-64 only, which means we have to use the "unoptimized" version >> everywhere else. I may experiment with a GCC plugin instead. > > I'd prefer the objtool approach. It's a pretty reliable first-principles > approach while GCC plugin would have to be replicated for Clang and any > other compilers, etc. > I implemented the GCC plugin approach here for arm64 https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls That implements both the unoptimized and the optimized versions. I do take your point about GCC and other compilers, but on arm64 we don't have a lot of choice. As far as I can tell, the GCC plugin is generic (i.e., it does not rely on any ARM specific passes, but obviously, this requires a *lot* of testing and validation to be taken seriously. >> - Does this feature have much value without retpolines? If not, should >> we make it depend on retpolines somehow? > > Paravirt patching, as you mention in your later reply? > >> - Find some actual users of the interfaces (tracepoints? crypto?) > > I'd be very happy with a demonstrated paravirt optimization already - > i.e. seeing the before/after effect on the vmlinux with an x86 distro > config. > > All major Linux distributions enable CONFIG_PARAVIRT=y and > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > as possible in the 99.999% cases where it's not used is a primary > concern. > > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. > > Thanks, > > Ingo
Re: [PATCH RFC 0/3] Static calls
On 9 November 2018 at 08:28, Ingo Molnar wrote: > > * Josh Poimboeuf wrote: > >> These patches are related to two similar patch sets from Ard and Steve: >> >> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org >> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org >> >> The code is also heavily inspired by the jump label code, as some of the >> concepts are very similar. >> >> There are three separate implementations, depending on what the arch >> supports: >> >> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires >> objtool and a small amount of arch code >> >> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires >> a small amount of arch code >> >> 3) If no arch support, fall back to regular function pointers >> >> >> TODO: >> >> - I'm not sure about the objtool approach. Objtool is (currently) >> x86-64 only, which means we have to use the "unoptimized" version >> everywhere else. I may experiment with a GCC plugin instead. > > I'd prefer the objtool approach. It's a pretty reliable first-principles > approach while GCC plugin would have to be replicated for Clang and any > other compilers, etc. > I implemented the GCC plugin approach here for arm64 https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls That implements both the unoptimized and the optimized versions. I do take your point about GCC and other compilers, but on arm64 we don't have a lot of choice. As far as I can tell, the GCC plugin is generic (i.e., it does not rely on any ARM specific passes, but obviously, this requires a *lot* of testing and validation to be taken seriously. >> - Does this feature have much value without retpolines? If not, should >> we make it depend on retpolines somehow? > > Paravirt patching, as you mention in your later reply? > >> - Find some actual users of the interfaces (tracepoints? crypto?) > > I'd be very happy with a demonstrated paravirt optimization already - > i.e. seeing the before/after effect on the vmlinux with an x86 distro > config. > > All major Linux distributions enable CONFIG_PARAVIRT=y and > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much > as possible in the 99.999% cases where it's not used is a primary > concern. > > All other usecases are bonus, but it would certainly be interesting to > investigate the impact of using these APIs for tracing: that too is a > feature enabled everywhere but utilized only by a small fraction of Linux > users - so literally every single cycle or instruction saved or hot-path > shortened is a major win. > > Thanks, > > Ingo
Re: [PATCH RFC 0/3] Static calls
* Ingo Molnar wrote: > > - Does this feature have much value without retpolines? If not, should > > we make it depend on retpolines somehow? > > Paravirt patching, as you mention in your later reply? BTW., to look for candidates of this API, I'd suggest looking at the function call frequency of my (almost-)distro kernel vmlinux: $ objdump -d vmlinux | grep -w callq | cut -f3- | sort | uniq -c | sort -n | tail -100 which gives: 502 callq 8157d050 522 callq 81aaf420 536 callq 81547e60 <_copy_to_user> 615 callq 81a97700 624 callq *0x82648428 624 callq 810cc810 <__might_sleep> 625 callq 81a93b90 649 callq 81547dd0 <_copy_from_user> 651 callq 811ba930 654 callq 8170b6f0 <_dev_warn> 691 callq 81a93790 693 callq 81a88dc0 709 callq *0x82648438 723 callq 811bdbd0 735 callq 810feac0 750 callq 8163e9f0 768 callq *0x82648430 814 callq 81ab2710 <_raw_spin_lock_irq> 841 callq 81a9e680 <__memcpy> 863 callq 812ae3d0 <__kmalloc> 899 callq 8126ac80 <__might_fault> 912 callq 81ab2970 <_raw_spin_unlock_irq> 939 callq 81aaaf10 <_cond_resched> 966 callq 811bda00 1069 callq 81126f50 1078 callq 81097760 <__warn_printk> 1081 callq 8157b140 <__dynamic_dev_dbg> 1351 callq 8170b630 <_dev_err> 1365 callq 811050c0 1373 callq 81a977f0 1390 callq 8157b090 <__dynamic_pr_debug> 1453 callq 8155c650 <__list_add_valid> 1501 callq 812ad6f0 1509 callq 8155c6c0 <__list_del_entry_valid> 1513 callq 81310ce0 1571 callq 81ab2780 <_raw_spin_lock_irqsave> 1624 callq 81ab29b0 <_raw_spin_unlock_irqrestore> 1661 callq 81126fd0 1986 callq 81104940 2050 callq 811c5110 2133 callq 81102c70 2507 callq 81ab2560 <_raw_spin_lock> 2676 callq 81aadc40 3056 callq 81ab2900 <_raw_spin_unlock> 3294 callq 81aac610 3628 callq 81129100 4462 callq 812ac2c0 6454 callq 8111a51e 6676 callq 81101420 7328 callq 81e014b0 <__x86_indirect_thunk_rax> 7598 callq 81126f30 9065 callq 810979f0 <__stack_chk_fail> The most prominent callers which are already function call pointers today are: $ objdump -d vmlinux | grep -w callq | grep \* | cut -f3- | sort | uniq -c | sort -n | tail -10 109 callq *0x82648530 134 callq *0x82648568 154 callq *0x826483d0 260 callq *0x826483d8 297 callq *0x826483e0 345 callq *0x82648440 345 callq *0x82648558 624 callq *0x82648428 709 callq *0x82648438 768 callq *0x82648430 That's all pv_ops->*() method calls: 82648300 D pv_ops 826485d0 D pv_info Optimizing those thousands of function pointer calls would already be a nice improvement. But retpolines: 7328 callq 81e014b0 <__x86_indirect_thunk_rax> 81e014b0 <__x86_indirect_thunk_rax>: 81e014b0: ff e0 jmpq *%rax ... are even more prominent, and turned on in every distro as well, obviously. Thanks, Ingo
Re: [PATCH RFC 0/3] Static calls
* Ingo Molnar wrote: > > - Does this feature have much value without retpolines? If not, should > > we make it depend on retpolines somehow? > > Paravirt patching, as you mention in your later reply? BTW., to look for candidates of this API, I'd suggest looking at the function call frequency of my (almost-)distro kernel vmlinux: $ objdump -d vmlinux | grep -w callq | cut -f3- | sort | uniq -c | sort -n | tail -100 which gives: 502 callq 8157d050 522 callq 81aaf420 536 callq 81547e60 <_copy_to_user> 615 callq 81a97700 624 callq *0x82648428 624 callq 810cc810 <__might_sleep> 625 callq 81a93b90 649 callq 81547dd0 <_copy_from_user> 651 callq 811ba930 654 callq 8170b6f0 <_dev_warn> 691 callq 81a93790 693 callq 81a88dc0 709 callq *0x82648438 723 callq 811bdbd0 735 callq 810feac0 750 callq 8163e9f0 768 callq *0x82648430 814 callq 81ab2710 <_raw_spin_lock_irq> 841 callq 81a9e680 <__memcpy> 863 callq 812ae3d0 <__kmalloc> 899 callq 8126ac80 <__might_fault> 912 callq 81ab2970 <_raw_spin_unlock_irq> 939 callq 81aaaf10 <_cond_resched> 966 callq 811bda00 1069 callq 81126f50 1078 callq 81097760 <__warn_printk> 1081 callq 8157b140 <__dynamic_dev_dbg> 1351 callq 8170b630 <_dev_err> 1365 callq 811050c0 1373 callq 81a977f0 1390 callq 8157b090 <__dynamic_pr_debug> 1453 callq 8155c650 <__list_add_valid> 1501 callq 812ad6f0 1509 callq 8155c6c0 <__list_del_entry_valid> 1513 callq 81310ce0 1571 callq 81ab2780 <_raw_spin_lock_irqsave> 1624 callq 81ab29b0 <_raw_spin_unlock_irqrestore> 1661 callq 81126fd0 1986 callq 81104940 2050 callq 811c5110 2133 callq 81102c70 2507 callq 81ab2560 <_raw_spin_lock> 2676 callq 81aadc40 3056 callq 81ab2900 <_raw_spin_unlock> 3294 callq 81aac610 3628 callq 81129100 4462 callq 812ac2c0 6454 callq 8111a51e 6676 callq 81101420 7328 callq 81e014b0 <__x86_indirect_thunk_rax> 7598 callq 81126f30 9065 callq 810979f0 <__stack_chk_fail> The most prominent callers which are already function call pointers today are: $ objdump -d vmlinux | grep -w callq | grep \* | cut -f3- | sort | uniq -c | sort -n | tail -10 109 callq *0x82648530 134 callq *0x82648568 154 callq *0x826483d0 260 callq *0x826483d8 297 callq *0x826483e0 345 callq *0x82648440 345 callq *0x82648558 624 callq *0x82648428 709 callq *0x82648438 768 callq *0x82648430 That's all pv_ops->*() method calls: 82648300 D pv_ops 826485d0 D pv_info Optimizing those thousands of function pointer calls would already be a nice improvement. But retpolines: 7328 callq 81e014b0 <__x86_indirect_thunk_rax> 81e014b0 <__x86_indirect_thunk_rax>: 81e014b0: ff e0 jmpq *%rax ... are even more prominent, and turned on in every distro as well, obviously. Thanks, Ingo
Re: [PATCH RFC 0/3] Static calls
* Josh Poimboeuf wrote: > These patches are related to two similar patch sets from Ard and Steve: > > - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org > - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org > > The code is also heavily inspired by the jump label code, as some of the > concepts are very similar. > > There are three separate implementations, depending on what the arch > supports: > > 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > objtool and a small amount of arch code > > 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > a small amount of arch code > > 3) If no arch support, fall back to regular function pointers > > > TODO: > > - I'm not sure about the objtool approach. Objtool is (currently) > x86-64 only, which means we have to use the "unoptimized" version > everywhere else. I may experiment with a GCC plugin instead. I'd prefer the objtool approach. It's a pretty reliable first-principles approach while GCC plugin would have to be replicated for Clang and any other compilers, etc. > - Does this feature have much value without retpolines? If not, should > we make it depend on retpolines somehow? Paravirt patching, as you mention in your later reply? > - Find some actual users of the interfaces (tracepoints? crypto?) I'd be very happy with a demonstrated paravirt optimization already - i.e. seeing the before/after effect on the vmlinux with an x86 distro config. All major Linux distributions enable CONFIG_PARAVIRT=y and CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much as possible in the 99.999% cases where it's not used is a primary concern. All other usecases are bonus, but it would certainly be interesting to investigate the impact of using these APIs for tracing: that too is a feature enabled everywhere but utilized only by a small fraction of Linux users - so literally every single cycle or instruction saved or hot-path shortened is a major win. Thanks, Ingo
Re: [PATCH RFC 0/3] Static calls
* Josh Poimboeuf wrote: > These patches are related to two similar patch sets from Ard and Steve: > > - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org > - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org > > The code is also heavily inspired by the jump label code, as some of the > concepts are very similar. > > There are three separate implementations, depending on what the arch > supports: > > 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > objtool and a small amount of arch code > > 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > a small amount of arch code > > 3) If no arch support, fall back to regular function pointers > > > TODO: > > - I'm not sure about the objtool approach. Objtool is (currently) > x86-64 only, which means we have to use the "unoptimized" version > everywhere else. I may experiment with a GCC plugin instead. I'd prefer the objtool approach. It's a pretty reliable first-principles approach while GCC plugin would have to be replicated for Clang and any other compilers, etc. > - Does this feature have much value without retpolines? If not, should > we make it depend on retpolines somehow? Paravirt patching, as you mention in your later reply? > - Find some actual users of the interfaces (tracepoints? crypto?) I'd be very happy with a demonstrated paravirt optimization already - i.e. seeing the before/after effect on the vmlinux with an x86 distro config. All major Linux distributions enable CONFIG_PARAVIRT=y and CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much as possible in the 99.999% cases where it's not used is a primary concern. All other usecases are bonus, but it would certainly be interesting to investigate the impact of using these APIs for tracing: that too is a feature enabled everywhere but utilized only by a small fraction of Linux users - so literally every single cycle or instruction saved or hot-path shortened is a major win. Thanks, Ingo
Re: [PATCH RFC 0/3] Static calls
On Thu, Nov 08, 2018 at 03:15:50PM -0600, Josh Poimboeuf wrote: > - Does this feature have much value without retpolines? If not, should > we make it depend on retpolines somehow? I forgot Andy mentioned that we might be able to use this to clean up paravirt patching, in which case it would have a lot of value, retpolines or not... -- Josh
Re: [PATCH RFC 0/3] Static calls
On Thu, Nov 08, 2018 at 03:15:50PM -0600, Josh Poimboeuf wrote: > - Does this feature have much value without retpolines? If not, should > we make it depend on retpolines somehow? I forgot Andy mentioned that we might be able to use this to clean up paravirt patching, in which case it would have a lot of value, retpolines or not... -- Josh