Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Nuno Diegueswrites: > Hello everyone, > > gently pinging to bring this back to life given the last patch I emailed. The patch is fine for me, but I cannot approve it. -Andi
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Hello everyone, gently pinging to bring this back to life given the last patch I emailed. Best regards, -- Nuno Diegues On Mon, Aug 24, 2015 at 12:51 PM, Nuno Diegueswrote: > Hello everyone, > > after a summer internship and some backlog catching up in the past > weeks, I have finally got around to review the patch according to the > latest discussion. > > The changes have not caused regression, and the latest speedup results > are coherent with what we had before. > > > In the following I add some comments inline to the discussion, and > after that paste the updated patch. > > >> > >> > So let's iterate on this in parallel with the other changes that we need >> > to get in place. I'd prefer to have some more confidence that measuring >> > txn throughput in epochs is the best way forward. >> > >> > Here are a few thoughts: >> > >> > Why do we actually need to measure succeeded transactions? If a HW txn >> > is just getting committed without interfering with anything else, is >> > this really different from, say, two HW txns getting committed? Aren't >> > we really just interested in wasted work, or delayed txns? That may >> > help taking care of the nontxnal vs. txnal problem. >> > >> > Measuring committed txns during a time that might otherwise be spent by >> > a serial txns could be useful to figure out how much other useful work a >> > serial txn prevents. But we'd see that as well if we'd just go serial >> > during the auto-tuning because then concurrent txns will have to wait; >> > and in this case we could measure it in the slow path of those >> > concurrent txns (ie, during waiting, where measurement overhead wouldn't >> > matter as much). >> > >> > If a serial txn is running, concurrent txns (that wait) are free to sync >> > and tell the serial how much cost it creates for the concurrent txns. >> > There, txn length could matter, but we won't find out for real until >> > after the concurrent ones have run (they could be pretty short, so we >> > can't simply assume that the concurrent ones are as long as the serial >> > one, so that simply the number of concurrent ones can be used to >> > calculate delayed work). > > > I understand you concern: measuring failure in a slow path rather than > success in > the fast/common path. > > My problem is that the approach taken in our work is tailored for accessing > one > given metric, in our case some form of throughput of successfully > executed atomic > blocks. > However, to measure failure, we seem to need two things: > 1) the rate of failed hardware transactions > 2) the time spent waiting for the serial lock > > In short, the performance will be bad if there is a mix of those two > issues that is bad > enough. The problem is how to define a metric that considers those two > together? > It is not obvious to me that there is one. Furthermore, that deviates > significantly from > our approach, and in the end --- even if there is a solution that > works in that way --- > that is a different approach, not one that can be applied to our > self-tuning framework. > > What we are doing right now is incrementing a counter in a (most > likely cached line) > that is single-writer and multiple-reader. Most often, it will be > owned by the single-writer. > As such, the cost should be quite negligible compared to the execution > of the atomic > block, more so in the case that a Hardware transaction was used, as > the commit itself > should be a hundred cycles or more. > > A good way to measure this impact is to use the new "htm_notune" flag > and compare > its performance with the non-changed libitm that I use as a baseline above. > The average results are similar, without any trend noticeable outside > the standard > deviation for one side or the other. > > > > >> >> > >> >> >> >> > Also, note that the mitigation strategy for rdtsc >> >> > short-comings that you mention in the paper is not applicable in >> >> > general, specifically not in libitm. >> >> >> >> >> >> I suppose you mean the preemption of threads inflating the cycles >> >> measured? >> > >> > Yes, preemption and migration of threads (in case there's no global >> > sync'ed TSC or similar) -- you mentioned in the paper that you pin >> > threads to cores... >> > >> >> This would be similarly a problem to any time source that tries to >> >> measure the >> >> amount of work performed; not sure how we can avoid it in general. Any >> >> thoughts? >> > >> > Not really as long as we keep depending on measuring time in a >> > light-weight way. Measuring smaller time intervals could make it less >> > likely that preemption happens during such an interval, though. >> > > > > Note that in this patch we use no such pinning. In fact, I've measured > that it does not > have a noticeable impact in performance. Maybe that could be the case with > heavy > over-subscription of the cores with lots of threads. > Still, it is not a correctness issue, so I believe that the fact this >
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Hello everyone, after a summer internship and some backlog catching up in the past weeks, I have finally got around to review the patch according to the latest discussion. The changes have not caused regression, and the latest speedup results are coherent with what we had before. In the following I add some comments inline to the discussion, and after that paste the updated patch. So let's iterate on this in parallel with the other changes that we need to get in place. I'd prefer to have some more confidence that measuring txn throughput in epochs is the best way forward. Here are a few thoughts: Why do we actually need to measure succeeded transactions? If a HW txn is just getting committed without interfering with anything else, is this really different from, say, two HW txns getting committed? Aren't we really just interested in wasted work, or delayed txns? That may help taking care of the nontxnal vs. txnal problem. Measuring committed txns during a time that might otherwise be spent by a serial txns could be useful to figure out how much other useful work a serial txn prevents. But we'd see that as well if we'd just go serial during the auto-tuning because then concurrent txns will have to wait; and in this case we could measure it in the slow path of those concurrent txns (ie, during waiting, where measurement overhead wouldn't matter as much). If a serial txn is running, concurrent txns (that wait) are free to sync and tell the serial how much cost it creates for the concurrent txns. There, txn length could matter, but we won't find out for real until after the concurrent ones have run (they could be pretty short, so we can't simply assume that the concurrent ones are as long as the serial one, so that simply the number of concurrent ones can be used to calculate delayed work). I understand you concern: measuring failure in a slow path rather than success in the fast/common path. My problem is that the approach taken in our work is tailored for accessing one given metric, in our case some form of throughput of successfully executed atomic blocks. However, to measure failure, we seem to need two things: 1) the rate of failed hardware transactions 2) the time spent waiting for the serial lock In short, the performance will be bad if there is a mix of those two issues that is bad enough. The problem is how to define a metric that considers those two together? It is not obvious to me that there is one. Furthermore, that deviates significantly from our approach, and in the end --- even if there is a solution that works in that way --- that is a different approach, not one that can be applied to our self-tuning framework. What we are doing right now is incrementing a counter in a (most likely cached line) that is single-writer and multiple-reader. Most often, it will be owned by the single-writer. As such, the cost should be quite negligible compared to the execution of the atomic block, more so in the case that a Hardware transaction was used, as the commit itself should be a hundred cycles or more. A good way to measure this impact is to use the new htm_notune flag and compare its performance with the non-changed libitm that I use as a baseline above. The average results are similar, without any trend noticeable outside the standard deviation for one side or the other. Also, note that the mitigation strategy for rdtsc short-comings that you mention in the paper is not applicable in general, specifically not in libitm. I suppose you mean the preemption of threads inflating the cycles measured? Yes, preemption and migration of threads (in case there's no global sync'ed TSC or similar) -- you mentioned in the paper that you pin threads to cores... This would be similarly a problem to any time source that tries to measure the amount of work performed; not sure how we can avoid it in general. Any thoughts? Not really as long as we keep depending on measuring time in a light-weight way. Measuring smaller time intervals could make it less likely that preemption happens during such an interval, though. Note that in this patch we use no such pinning. In fact, I've measured that it does not have a noticeable impact in performance. Maybe that could be the case with heavy over-subscription of the cores with lots of threads. Still, it is not a correctness issue, so I believe that the fact this performs great in the common case should make it prevail. Another issue is that we need to implement the tuning in a portable way. You currently make it depend on whether the assembler knows about RTM, whereas the rest of the code makes this more portable and defers to arch-specific files. I'd prefer if we could use the tuning on other archs too. But for that, we need to cleanly separate generic from arch-specific parts. That affects magic numbers as well as things
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Hello everyone, just wanted to ping back to say that I have been overwhelmed with work and will be back on this as soon as possible, most likely during July. Best regards, -- Nuno Diegues On Tue, May 19, 2015 at 3:17 PM, Torvald Riegel trie...@redhat.com wrote: On Mon, 2015-05-18 at 23:27 -0400, Nuno Diegues wrote: On Mon, May 18, 2015 at 5:29 PM, Torvald Riegel trie...@redhat.com wrote: Are there better options for the utility function, or can we tune it to be less affected by varying txn length and likelihood of txnal vs. nontxnal code? What are the things we want to avoid with the tuning? I can think of: * Not needing to wait for serial txns, or being aborted by a serial txn. * Not retrying using HTM too much so that the retry time is larger than the scalability we actually gain by having other txns commit concurrently. Yes, those are the key points we want to make sure that do not happen. Anything else? Did you consider other utility functions during your research? The txnal vs nontxnal is indeed a completely different story. To account for this we would need extra book-keeping to count only cycles spent inside txnal code. So this would require every thread (or a sample of threads) to perform a rdtsc (or equivalent) on every begin/end call rather than the current approach of a single rdtsc per optimization round. With this type of online optimization we found that the algorithm had to be very simple and cheap to execute. RDTSC was a good finding to fit this, and it enabled us to obtain gains. Other time sources failed to do so. I do not have, out of the box, a good alternative to offer. I suppose it would take some iterations of thinking/experimenting with, just like with any research problem :) So let's iterate on this in parallel with the other changes that we need to get in place. I'd prefer to have some more confidence that measuring txn throughput in epochs is the best way forward. Here are a few thoughts: Why do we actually need to measure succeeded transactions? If a HW txn is just getting committed without interfering with anything else, is this really different from, say, two HW txns getting committed? Aren't we really just interested in wasted work, or delayed txns? That may help taking care of the nontxnal vs. txnal problem. Measuring committed txns during a time that might otherwise be spent by a serial txns could be useful to figure out how much other useful work a serial txn prevents. But we'd see that as well if we'd just go serial during the auto-tuning because then concurrent txns will have to wait; and in this case we could measure it in the slow path of those concurrent txns (ie, during waiting, where measurement overhead wouldn't matter as much). If a serial txn is running, concurrent txns (that wait) are free to sync and tell the serial how much cost it creates for the concurrent txns. There, txn length could matter, but we won't find out for real until after the concurrent ones have run (they could be pretty short, so we can't simply assume that the concurrent ones are as long as the serial one, so that simply the number of concurrent ones can be used to calculate delayed work). Also, note that the mitigation strategy for rdtsc short-comings that you mention in the paper is not applicable in general, specifically not in libitm. I suppose you mean the preemption of threads inflating the cycles measured? Yes, preemption and migration of threads (in case there's no global sync'ed TSC or similar) -- you mentioned in the paper that you pin threads to cores... This would be similarly a problem to any time source that tries to measure the amount of work performed; not sure how we can avoid it in general. Any thoughts? Not really as long as we keep depending on measuring time in a light-weight way. Measuring smaller time intervals could make it less likely that preemption happens during such an interval, though. Another issue is that we need to implement the tuning in a portable way. You currently make it depend on whether the assembler knows about RTM, whereas the rest of the code makes this more portable and defers to arch-specific files. I'd prefer if we could use the tuning on other archs too. But for that, we need to cleanly separate generic from arch-specific parts. That affects magic numbers as well as things like rdtsc(). Yes, I refrained from adding new calls to the arch-specific files, to contain the changes mainly. But that is possible and that's part of the feedback I was hoping to get. OK. Let me know if you want further input regarding this. I'm wondering about whether it really makes sense to treat XABORT like conflicts and other abort reasons, instead of like capacity aborts. Perhaps we need to differentiate between the explicit abort codes glibc uses, so that we can distinguish between cases where an abort is
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
On Mon, 2015-05-18 at 23:27 -0400, Nuno Diegues wrote: On Mon, May 18, 2015 at 5:29 PM, Torvald Riegel trie...@redhat.com wrote: Are there better options for the utility function, or can we tune it to be less affected by varying txn length and likelihood of txnal vs. nontxnal code? What are the things we want to avoid with the tuning? I can think of: * Not needing to wait for serial txns, or being aborted by a serial txn. * Not retrying using HTM too much so that the retry time is larger than the scalability we actually gain by having other txns commit concurrently. Yes, those are the key points we want to make sure that do not happen. Anything else? Did you consider other utility functions during your research? The txnal vs nontxnal is indeed a completely different story. To account for this we would need extra book-keeping to count only cycles spent inside txnal code. So this would require every thread (or a sample of threads) to perform a rdtsc (or equivalent) on every begin/end call rather than the current approach of a single rdtsc per optimization round. With this type of online optimization we found that the algorithm had to be very simple and cheap to execute. RDTSC was a good finding to fit this, and it enabled us to obtain gains. Other time sources failed to do so. I do not have, out of the box, a good alternative to offer. I suppose it would take some iterations of thinking/experimenting with, just like with any research problem :) So let's iterate on this in parallel with the other changes that we need to get in place. I'd prefer to have some more confidence that measuring txn throughput in epochs is the best way forward. Here are a few thoughts: Why do we actually need to measure succeeded transactions? If a HW txn is just getting committed without interfering with anything else, is this really different from, say, two HW txns getting committed? Aren't we really just interested in wasted work, or delayed txns? That may help taking care of the nontxnal vs. txnal problem. Measuring committed txns during a time that might otherwise be spent by a serial txns could be useful to figure out how much other useful work a serial txn prevents. But we'd see that as well if we'd just go serial during the auto-tuning because then concurrent txns will have to wait; and in this case we could measure it in the slow path of those concurrent txns (ie, during waiting, where measurement overhead wouldn't matter as much). If a serial txn is running, concurrent txns (that wait) are free to sync and tell the serial how much cost it creates for the concurrent txns. There, txn length could matter, but we won't find out for real until after the concurrent ones have run (they could be pretty short, so we can't simply assume that the concurrent ones are as long as the serial one, so that simply the number of concurrent ones can be used to calculate delayed work). Also, note that the mitigation strategy for rdtsc short-comings that you mention in the paper is not applicable in general, specifically not in libitm. I suppose you mean the preemption of threads inflating the cycles measured? Yes, preemption and migration of threads (in case there's no global sync'ed TSC or similar) -- you mentioned in the paper that you pin threads to cores... This would be similarly a problem to any time source that tries to measure the amount of work performed; not sure how we can avoid it in general. Any thoughts? Not really as long as we keep depending on measuring time in a light-weight way. Measuring smaller time intervals could make it less likely that preemption happens during such an interval, though. Another issue is that we need to implement the tuning in a portable way. You currently make it depend on whether the assembler knows about RTM, whereas the rest of the code makes this more portable and defers to arch-specific files. I'd prefer if we could use the tuning on other archs too. But for that, we need to cleanly separate generic from arch-specific parts. That affects magic numbers as well as things like rdtsc(). Yes, I refrained from adding new calls to the arch-specific files, to contain the changes mainly. But that is possible and that's part of the feedback I was hoping to get. OK. Let me know if you want further input regarding this. I'm wondering about whether it really makes sense to treat XABORT like conflicts and other abort reasons, instead of like capacity aborts. Perhaps we need to differentiate between the explicit abort codes glibc uses, so that we can distinguish between cases where an abort is supposed to signal incompatibility with txnal execution and cases where it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h in current glibc): #define _ABORT_LOCK_BUSY0xff #define _ABORT_LOCK_IS_LOCKED 0xfe #define _ABORT_NESTED_TRYLOCK 0xfd I am
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
On Wed, 2015-04-29 at 23:23 -0400, Nuno Diegues wrote: Hello, I have taken the chance to improve the patch by addressing the comments above in this thread. Namely: - to use a simple random generator managed inside the library only - removed floating point usage and replaced by fixed arithmetic - added some comments where relevant First of all, sorry for taking so long to review this. Thank you for the contribution. There are several high-level changes that I'd like to see (or be convinced of that those aren't necessary). I'll discuss these first. I'll have more detailed comments inlined in the patch. My major concern is about rdtsc being used. The relation to frequency adaption that Andi mentioned is one issue, but the bigger issue to me is that the runtime of transactions might vary a lot, and that the relation of txnal vs. nontxnal code executed in a period might vary a lot too. Are there better options for the utility function, or can we tune it to be less affected by varying txn length and likelihood of txnal vs. nontxnal code? What are the things we want to avoid with the tuning? I can think of: * Not needing to wait for serial txns, or being aborted by a serial txn. * Not retrying using HTM too much so that the retry time is larger than the scalability we actually gain by having other txns commit concurrently. Anything else? Did you consider other utility functions during your research? Also, note that the mitigation strategy for rdtsc short-comings that you mention in the paper is not applicable in general, specifically not in libitm. Another issue is that we need to implement the tuning in a portable way. You currently make it depend on whether the assembler knows about RTM, whereas the rest of the code makes this more portable and defers to arch-specific files. I'd prefer if we could use the tuning on other archs too. But for that, we need to cleanly separate generic from arch-specific parts. That affects magic numbers as well as things like rdtsc(). Generally, the patch needs to include more comments. Most importantly, we need to document why we implemented things the way we did, or do tuning the way we do. In cases where the intent isn't trivial to see nor the code is trivial, explain the intent or reference the place where you document the big picture. A good rule of thumb is adding comments whenever it's not simple to see why we use one of several possible implementation alternatives. The magic numbers being used are a good example. Make them constants and don't just put them directly in the code, and then add documentation why you chose this number and why it's okay to make that choice. If it isn't necessarily okay (eg, other archs, future systems), but you don't have a better solution right now, add something like ??? Not ideal because of XYZ.. If there's a source for some of the magic numbers, cite it (e.g., [28] in your paper might be one for the tuning thresholds, I guess). Reoptimizing only in a specific, fixed thread is insufficient in the general case. There may be a thread that only runs an initial txn and then never another one; if this happens to be the last thread registered, you'll never get any tuning. If we want the tuning to be effective, one of the threads *actively* running txns needs to tune eventually, always. Depending on that, you'll probably have to change the sync between the tuning thread and others. Serial mode may be a simple way to do that. It may be possible to only check for tuning being necessary when in serial mode. It could be possible that we end up trying HTM too often yet never go to serial mode; however, that seems unlikely to me (but I haven't thought thoroughly about it). Also, please remember that only data-race-free code is strictly correct C++ code (the only exception being the validated-loads special case in the STM implementations, for which C++ doesn't have support yet (but for which proposals exist)). Thus, use atomic operations accordingly. That might create another issue though in that you can't assume 64b atomic ops to be supported on all archs. Maybe using serial mode for tuning doesn't trigger that issue though. I'm wondering about whether it really makes sense to treat XABORT like conflicts and other abort reasons, instead of like capacity aborts. Perhaps we need to differentiate between the explicit abort codes glibc uses, so that we can distinguish between cases where an abort is supposed to signal incompatibility with txnal execution and cases where it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h in current glibc): #define _ABORT_LOCK_BUSY0xff #define _ABORT_LOCK_IS_LOCKED 0xfe #define _ABORT_NESTED_TRYLOCK 0xfd Andi said that it would be nice to have an env var turning tuning on/off. I agree in principle, yet would prefer to have the tuning be the default. What about adding an htm_notune option in parse_default_method? It would be even nicer to
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Are there better options for the utility function, or can we tune it to There is nothing better that isn't a lot slower. -Andi
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
On Mon, May 18, 2015 at 5:29 PM, Torvald Riegel trie...@redhat.com wrote: First of all, sorry for taking so long to review this. Thank you for the contribution. Hello Torvald, thanks for taking the time to look into this! My major concern is about rdtsc being used. The relation to frequency adaption that Andi mentioned is one issue, but the bigger issue to me is that the runtime of transactions might vary a lot, and that the relation of txnal vs. nontxnal code executed in a period might vary a lot too. Once again, I believe that frequency scaling should not be a concern: recent CPUs use a constant rate TSC that matches that of the maximum frequency of the CPU. Are there better options for the utility function, or can we tune it to be less affected by varying txn length and likelihood of txnal vs. nontxnal code? What are the things we want to avoid with the tuning? I can think of: * Not needing to wait for serial txns, or being aborted by a serial txn. * Not retrying using HTM too much so that the retry time is larger than the scalability we actually gain by having other txns commit concurrently. Yes, those are the key points we want to make sure that do not happen. Anything else? Did you consider other utility functions during your research? The txnal vs nontxnal is indeed a completely different story. To account for this we would need extra book-keeping to count only cycles spent inside txnal code. So this would require every thread (or a sample of threads) to perform a rdtsc (or equivalent) on every begin/end call rather than the current approach of a single rdtsc per optimization round. With this type of online optimization we found that the algorithm had to be very simple and cheap to execute. RDTSC was a good finding to fit this, and it enabled us to obtain gains. Other time sources failed to do so. I do not have, out of the box, a good alternative to offer. I suppose it would take some iterations of thinking/experimenting with, just like with any research problem :) Also, note that the mitigation strategy for rdtsc short-comings that you mention in the paper is not applicable in general, specifically not in libitm. I suppose you mean the preemption of threads inflating the cycles measured? This would be similarly a problem to any time source that tries to measure the amount of work performed; not sure how we can avoid it in general. Any thoughts? Another issue is that we need to implement the tuning in a portable way. You currently make it depend on whether the assembler knows about RTM, whereas the rest of the code makes this more portable and defers to arch-specific files. I'd prefer if we could use the tuning on other archs too. But for that, we need to cleanly separate generic from arch-specific parts. That affects magic numbers as well as things like rdtsc(). Yes, I refrained from adding new calls to the arch-specific files, to contain the changes mainly. But that is possible and that's part of the feedback I was hoping to get. Generally, the patch needs to include more comments. Most importantly, we need to document why we implemented things the way we did, or do tuning the way we do. In cases where the intent isn't trivial to see nor the code is trivial, explain the intent or reference the place where you document the big picture. A good rule of thumb is adding comments whenever it's not simple to see why we use one of several possible implementation alternatives. The magic numbers being used are a good example. Make them constants and don't just put them directly in the code, and then add documentation why you chose this number and why it's okay to make that choice. If it isn't necessarily okay (eg, other archs, future systems), but you don't have a better solution right now, add something like ??? Not ideal because of XYZ.. If there's a source for some of the magic numbers, cite it (e.g., [28] in your paper might be one for the tuning thresholds, I guess). Ack, makes perfect sense. Reoptimizing only in a specific, fixed thread is insufficient in the general case. There may be a thread that only runs an initial txn and then never another one; if this happens to be the last thread registered, you'll never get any tuning. If we want the tuning to be effective, one of the threads *actively* running txns needs to tune eventually, always. Depending on that, you'll probably have to change the sync between the tuning thread and others. Serial mode may be a simple way to do that. It may be possible to only check for tuning being necessary when in serial mode. It could be possible that we end up trying HTM too often yet never go to serial mode; however, that seems unlikely to me (but I haven't thought thoroughly about it). Also, please remember that only data-race-free code is strictly correct C++ code (the only exception being the validated-loads special case in the STM implementations,
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
On Mon, 2015-05-18 at 23:39 +0200, Andi Kleen wrote: Are there better options for the utility function, or can we tune it to There is nothing better that isn't a lot slower. Do you care to elaborate why? As-is, I find this statement to not be convincing; at the very least we need to document why we think that something time-based is the best option. Other tuning attempts have looked at rates of aborted, attempted, and committed txns, for example. Why do we measure nb. of transactions in a whole period, and can't get the same info through measuring smaller but more specific time intervals (e.g., how long we wait for a serial txn)?
[PING] Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Ping! Could someone who can approve please review the patch? Thanks, -Andi Nuno Diegues n...@ist.utl.pt writes: Hello, I have taken the chance to improve the patch by addressing the comments above in this thread. Namely: - to use a simple random generator managed inside the library only - removed floating point usage and replaced by fixed arithmetic - added some comments where relevant Re-running the STAMP benchmarks shows similar gains (to those shown above) with respect to an unmodified GCC 5.0.0: benchmark: speedup genome: 1.58 intruder: 1.78 labyrinth: 1.0 ssca2: 1.01 yada: 1.0 kmeans-high: 1.16 kmeans-low: 1.16 vacation-high: 2.05 vacation-low: 2.81 I appreciate any feedback and comments! Index: libitm/libitm_i.h === --- libitm/libitm_i.h (revision 219316) +++ libitm/libitm_i.h (working copy) @@ -242,6 +242,9 @@ struct gtm_thread uint32_t restart_reason[NUM_RESTARTS]; uint32_t restart_total; + // Keeps track of how many transactions were successfully executed. + uint64_t number_executed_txns; + // *** The shared part of gtm_thread starts here. *** // Shared state is on separate cachelines to avoid false sharing with // thread-local parts of gtm_thread. @@ -286,6 +289,8 @@ struct gtm_thread static void *operator new(size_t); static void operator delete(void *); + static void reoptimize_htm_execution(); + // Invoked from assembly language, thus the asm specifier on // the name, avoiding complex name mangling. static uint32_t begin_transaction(uint32_t, const gtm_jmpbuf *) @@ -309,6 +314,59 @@ struct gtm_thread void commit_user_actions (); }; +// Different ways to deal with capacity aborts in HTM execution. +enum gtm_capacity_abort_mode +{ + STUBBORN, + HALVEN, + GIVEUP +}; + +// Definition of fixed point arithmetic types. +// Half the bits are dedicated to the fractional type, and the rest to the +// whole part. +#define FIXED_PT_WIDTH 64 +#define FIXED_PT_INTEGER_WIDTH 32 +typedef uint64_t fixed_pt_t; +typedef __uint128_t fixed_pt_td; + +#define FIXED_PT_FRAC_WIDTH (FIXED_PT_WIDTH - FIXED_PT_INTEGER_WIDTH) +#define FIXED_PT_ONE((fixed_pt_t)((fixed_pt_t)1 FIXED_PT_FRAC_WIDTH)) +#define FIXED_PT_TWO(FIXED_PT_ONE + FIXED_PT_ONE) + +#define fixed_mul(A,B) \ + ((fixed_pt_t)(((fixed_pt_td)(A) * (fixed_pt_td)(B)) FIXED_PT_FRAC_WIDTH)) +#define fixed_div(A,B) \ + ((fixed_pt_t)(((fixed_pt_td)(A) FIXED_PT_FRAC_WIDTH) / (fixed_pt_td)(B))) +#define fixed_const(R) \ + ((fixed_pt_t)((R) * FIXED_PT_ONE + ((R) = 0 ? 0.5 : -0.5))) + +// Maintains the current values optimized for HTM execution and the +// corresponding statistics gathered for the decision-making. +struct gtm_global_optimizer +{ + // Mode chosen to currently deal with capacity aborts. + gtm_capacity_abort_mode optimized_mode; + // Number of attempts chosen to currently insist on HTM execution. + uint32_t optimized_attempts; + + uint64_t last_cycles; + uint64_t last_total_txs_executed; + + fixed_pt_t last_throughput; + uint32_t last_attempts; + + fixed_pt_t best_ever_throughput; + uint32_t best_ever_attempts; + + uint64_t txns_while_stubborn; + uint64_t cycles_while_stubborn; + uint64_t txns_while_halven; + uint64_t cycles_while_halven; + uint64_t txns_while_giveup; + uint64_t cycles_while_giveup; +}; + } // namespace GTM #include tls.h @@ -346,6 +404,9 @@ extern gtm_cacheline_mask gtm_mask_stack(gtm_cache // the name, avoiding complex name mangling. extern uint32_t htm_fastpath __asm__(UPFX gtm_htm_fastpath); +// Maintains the optimization for HTM execution. +extern gtm_global_optimizer optimizer; + } // namespace GTM #endif // LIBITM_I_H Index: libitm/libitm.h === --- libitm/libitm.h (revision 219316) +++ libitm/libitm.h (working copy) @@ -101,7 +101,8 @@ typedef enum /* These are not part of the ABI but used for custom HTM fast paths. See ITM_beginTransaction and gtm_thread::begin_transaction. */ pr_HTMRetryableAbort = 0x80, - pr_HTMRetriedAfterAbort = 0x100 + pr_HTMRetriedAfterAbort = 0x100, + pr_HTMCapacityAbort = 0x200 } _ITM_codeProperties; /* Result from startTransaction that describes what actions to take. Index: libitm/method-serial.cc === --- libitm/method-serial.cc (revision 219316) +++ libitm/method-serial.cc (working copy) @@ -223,7 +223,23 @@ struct htm_mg : public method_group // initially disabled. #ifdef USE_HTM_FASTPATH htm_fastpath = htm_init(); +#ifdef HAVE_AS_RTM +optimizer.optimized_mode = STUBBORN; +optimizer.optimized_attempts = htm_fastpath; +optimizer.last_cycles = rdtsc(); +optimizer.last_total_txs_executed = 0; +
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Today I have received the news that the Copyright Assignment was completed with the FSF. On Thu, Apr 30, 2015 at 8:10 AM, Nuno Diegues n...@ist.utl.pt wrote: Patch looks good to me now. It would be perhaps nice to have an environment variable to turn the adaptive algorithm off for tests, but that's not critical. Yes, that makes perfect sense. It would be also nice to test it on something else, but I understand it's difficult to find other software using the STM syntax. Indeed. I'll try to find some time to work on that, but it may take a while. I can't approve the patch however. I believe it's big enough that you may need a copy right assignment. I have signed a Form Assignment from the Free Software Foundation to deal exactly with those matters for this patch to the libitm. Torvald Riegel had advised me to do so. I have not, however, received any further information; so I'm left wondering if it went through or if it is still hanging. I will ping back to FSF to check that out perhaps? Best regards, -- Nuno Diegues -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Patch looks good to me now. It would be perhaps nice to have an environment variable to turn the adaptive algorithm off for tests, but that's not critical. Yes, that makes perfect sense. It would be also nice to test it on something else, but I understand it's difficult to find other software using the STM syntax. Indeed. I'll try to find some time to work on that, but it may take a while. I can't approve the patch however. I believe it's big enough that you may need a copy right assignment. I have signed a Form Assignment from the Free Software Foundation to deal exactly with those matters for this patch to the libitm. Torvald Riegel had advised me to do so. I have not, however, received any further information; so I'm left wondering if it went through or if it is still hanging. I will ping back to FSF to check that out perhaps? Best regards, -- Nuno Diegues -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Nuno Diegues n...@ist.utl.pt writes: Hello, I have taken the chance to improve the patch by addressing the comments above in this thread. Namely: - to use a simple random generator managed inside the library only - removed floating point usage and replaced by fixed arithmetic - added some comments where relevant Thanks. Patch looks good to me now. It would be perhaps nice to have an environment variable to turn the adaptive algorithm off for tests, but that's not critical. It would be also nice to test it on something else, but I understand it's difficult to find other software using the STM syntax. I can't approve the patch however. I believe it's big enough that you may need a copy right assignment. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Hello, I have taken the chance to improve the patch by addressing the comments above in this thread. Namely: - to use a simple random generator managed inside the library only - removed floating point usage and replaced by fixed arithmetic - added some comments where relevant Re-running the STAMP benchmarks shows similar gains (to those shown above) with respect to an unmodified GCC 5.0.0: benchmark: speedup genome: 1.58 intruder: 1.78 labyrinth: 1.0 ssca2: 1.01 yada: 1.0 kmeans-high: 1.16 kmeans-low: 1.16 vacation-high: 2.05 vacation-low: 2.81 I appreciate any feedback and comments! Index: libitm/libitm_i.h === --- libitm/libitm_i.h (revision 219316) +++ libitm/libitm_i.h (working copy) @@ -242,6 +242,9 @@ struct gtm_thread uint32_t restart_reason[NUM_RESTARTS]; uint32_t restart_total; + // Keeps track of how many transactions were successfully executed. + uint64_t number_executed_txns; + // *** The shared part of gtm_thread starts here. *** // Shared state is on separate cachelines to avoid false sharing with // thread-local parts of gtm_thread. @@ -286,6 +289,8 @@ struct gtm_thread static void *operator new(size_t); static void operator delete(void *); + static void reoptimize_htm_execution(); + // Invoked from assembly language, thus the asm specifier on // the name, avoiding complex name mangling. static uint32_t begin_transaction(uint32_t, const gtm_jmpbuf *) @@ -309,6 +314,59 @@ struct gtm_thread void commit_user_actions (); }; +// Different ways to deal with capacity aborts in HTM execution. +enum gtm_capacity_abort_mode +{ + STUBBORN, + HALVEN, + GIVEUP +}; + +// Definition of fixed point arithmetic types. +// Half the bits are dedicated to the fractional type, and the rest to the +// whole part. +#define FIXED_PT_WIDTH 64 +#define FIXED_PT_INTEGER_WIDTH 32 +typedef uint64_t fixed_pt_t; +typedef __uint128_t fixed_pt_td; + +#define FIXED_PT_FRAC_WIDTH (FIXED_PT_WIDTH - FIXED_PT_INTEGER_WIDTH) +#define FIXED_PT_ONE((fixed_pt_t)((fixed_pt_t)1 FIXED_PT_FRAC_WIDTH)) +#define FIXED_PT_TWO(FIXED_PT_ONE + FIXED_PT_ONE) + +#define fixed_mul(A,B) \ + ((fixed_pt_t)(((fixed_pt_td)(A) * (fixed_pt_td)(B)) FIXED_PT_FRAC_WIDTH)) +#define fixed_div(A,B) \ + ((fixed_pt_t)(((fixed_pt_td)(A) FIXED_PT_FRAC_WIDTH) / (fixed_pt_td)(B))) +#define fixed_const(R) \ + ((fixed_pt_t)((R) * FIXED_PT_ONE + ((R) = 0 ? 0.5 : -0.5))) + +// Maintains the current values optimized for HTM execution and the +// corresponding statistics gathered for the decision-making. +struct gtm_global_optimizer +{ + // Mode chosen to currently deal with capacity aborts. + gtm_capacity_abort_mode optimized_mode; + // Number of attempts chosen to currently insist on HTM execution. + uint32_t optimized_attempts; + + uint64_t last_cycles; + uint64_t last_total_txs_executed; + + fixed_pt_t last_throughput; + uint32_t last_attempts; + + fixed_pt_t best_ever_throughput; + uint32_t best_ever_attempts; + + uint64_t txns_while_stubborn; + uint64_t cycles_while_stubborn; + uint64_t txns_while_halven; + uint64_t cycles_while_halven; + uint64_t txns_while_giveup; + uint64_t cycles_while_giveup; +}; + } // namespace GTM #include tls.h @@ -346,6 +404,9 @@ extern gtm_cacheline_mask gtm_mask_stack(gtm_cache // the name, avoiding complex name mangling. extern uint32_t htm_fastpath __asm__(UPFX gtm_htm_fastpath); +// Maintains the optimization for HTM execution. +extern gtm_global_optimizer optimizer; + } // namespace GTM #endif // LIBITM_I_H Index: libitm/libitm.h === --- libitm/libitm.h (revision 219316) +++ libitm/libitm.h (working copy) @@ -101,7 +101,8 @@ typedef enum /* These are not part of the ABI but used for custom HTM fast paths. See ITM_beginTransaction and gtm_thread::begin_transaction. */ pr_HTMRetryableAbort = 0x80, - pr_HTMRetriedAfterAbort = 0x100 + pr_HTMRetriedAfterAbort = 0x100, + pr_HTMCapacityAbort = 0x200 } _ITM_codeProperties; /* Result from startTransaction that describes what actions to take. Index: libitm/method-serial.cc === --- libitm/method-serial.cc (revision 219316) +++ libitm/method-serial.cc (working copy) @@ -223,7 +223,23 @@ struct htm_mg : public method_group // initially disabled. #ifdef USE_HTM_FASTPATH htm_fastpath = htm_init(); +#ifdef HAVE_AS_RTM +optimizer.optimized_mode = STUBBORN; +optimizer.optimized_attempts = htm_fastpath; +optimizer.last_cycles = rdtsc(); +optimizer.last_total_txs_executed = 0; +optimizer.last_throughput = fixed_const(0.0001); +optimizer.last_attempts = htm_fastpath 0 ? htm_fastpath - 1 : 1; +optimizer.best_ever_throughput = optimizer.last_throughput; +optimizer.best_ever_attempts = htm_fastpath; +
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
On Wed, Apr 8, 2015 at 6:54 PM, Andi Kleen a...@firstfloor.org wrote: On the STAMP suite of benchmarks for transactional memory (described here [1]). I have ran an unmodified GCC 5.0.0 against the patched GCC with these modifications and obtain the following speedups in STAMP with 4 threads (on a Haswell with 4 cores, average 10 runs): I expect you'll need different tunings on larger systems. I did not quite understand the extent of your comment: what specifically would need different tuning? The idea is exactly that this proposal does not have any attachment to the workload/deployment; there are some parameters (aka, the magic numbers we discussed) but they are quite reasonable, i.e., each one of them has a sensible value with some meaning we understand. That is a good point. While I haven't ever used fixed point arithmetic, a cursory inspection reveals that it does make sense and seems applicable to this case. Are you aware of some place where this is being done already within GCC that I could use as inspiration, or should I craft some macros from scratch for this? I believe the inliner uses fixed point. Own macros should be fine too. Thanks, will try this out. + int32_t last_attempts = optimizer.last_attempts; + int32_t current_attempts = optimizer.optimized_attempts; + int32_t new_attempts = current_attempts; + if (unlikely(change_for_worse 1.40)) +{ + optimizer.optimized_attempts = optimizer.best_ever_attempts; + optimizer.last_throughput = current_throughput; + optimizer.last_attempts = current_attempts; + return; +} + + if (unlikely(random() % 100 1)) +{ So where is the seed for that random stored? Could you corrupt some user's random state? Is the state per thread or global? If it's per thread how do you initialize so that they threads do start with different seeds. If it's global what synchronizes it? As I do not specify any seed, I was under the impression that there would be a default initialization. Furthermore, the posix documentation specifies random() to be MT-safe, so I assumed its internal state to be per-thread. Did I mis-interpret this? Yes, that's right. But it's very nasty to change the users RNG state. A common pattern for repeatable benchmarks is to start with srand(1) and then use the random numbers to run the benchmark, so it always does the same thing. If you non deterministically (transaction aborts are not deterministic) change the random state it will make the benchmark not repeatable anymore. You'll need to use an own RNG state that it independent. I understand your concern, thanks for raising it. One general question on how to proceed: given that I make some further changes, should I post the whole patch again? Best regards, -- Nuno Diegues It would be good to see if any parts of the algorithm can be simplified. In general in production software the goal is to have the simplest algorithm that does the job. -Andi -- a...@linux.intel.com -- Speaking for myself only.
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Nuno Diegues n...@ist.utl.pt writes: What workloads did you test this on? +static inline float fastLog(float x) +{ + union { float f; uint32_t i; } vx = { x }; + float y = vx.i; + y *= 8.2629582881927490e-8f; + return y - 87.989971088f; +} + +static inline float fastSqrt(float x) +{ + union + { +int i; +float x; + } u; + + u.x = x; + u.i = (129) + (u.i 1) - (122); + return u.x; +} Are you sure you need floating point here? If the program does not use it in any other ways faulting in the floating point state can be quite expensive. I bet fixed point would work for such simple purposes too. + serial_lock.read_unlock(tx); + + // Obtain the delta performance with respect to the last period. + uint64_t current_cycles = rdtsc(); + uint64_t cycles_used = current_cycles - optimizer.last_cycles; It may be worth pointing out that rdtsc does not return cycles. In fact the ratio to real cycles is variable depending on the changing frequency. I hope your algorithms can handle that. + + // Compute gradient descent for the number of retries. + double change_for_better = current_throughput / optimizer.last_throughput; + double change_for_worse = optimizer.last_throughput / current_throughput; + int32_t last_attempts = optimizer.last_attempts; + int32_t current_attempts = optimizer.optimized_attempts; + int32_t new_attempts = current_attempts; + if (unlikely(change_for_worse 1.40)) +{ + optimizer.optimized_attempts = optimizer.best_ever_attempts; + optimizer.last_throughput = current_throughput; + optimizer.last_attempts = current_attempts; + return; +} + + if (unlikely(random() % 100 1)) +{ So where is the seed for that random stored? Could you corrupt some user's random state? Is the state per thread or global? If it's per thread how do you initialize so that they threads do start with different seeds. If it's global what synchronizes it? Overall the algorithm looks very complicated with many mysterious magic numbers. Are there simplifications possible? While the retry path is not extremely critical it should be at least somewhat optimized, otherwise it will dominate the cost of short transactions. One problems with so many magic numbers is that they may be good on one system, but bad on another. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
Thank you for the feedback. Comments inline. On Wed, Apr 8, 2015 at 3:05 PM, Andi Kleen a...@firstfloor.org wrote: Nuno Diegues n...@ist.utl.pt writes: What workloads did you test this on? On the STAMP suite of benchmarks for transactional memory (described here [1]). I have ran an unmodified GCC 5.0.0 against the patched GCC with these modifications and obtain the following speedups in STAMP with 4 threads (on a Haswell with 4 cores, average 10 runs): benchmarks: speedup genome: 1.32 intruder: 1.66 labyrinth: 1.00 ssca2: 1.02 yada: 1.00 kmeans-high: 1.13 kmeans-low: 1.10 vacation-high: 2.27 vacation-low: 1.88 [1] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, Kunle Olukotun: STAMP: Stanford Transactional Applications for Multi-Processing. IISWC 2008: 35-46 Are you sure you need floating point here? If the program does not use it in any other ways faulting in the floating point state can be quite expensive. I bet fixed point would work for such simple purposes too. That is a good point. While I haven't ever used fixed point arithmetic, a cursory inspection reveals that it does make sense and seems applicable to this case. Are you aware of some place where this is being done already within GCC that I could use as inspiration, or should I craft some macros from scratch for this? + serial_lock.read_unlock(tx); + + // Obtain the delta performance with respect to the last period. + uint64_t current_cycles = rdtsc(); + uint64_t cycles_used = current_cycles - optimizer.last_cycles; It may be worth pointing out that rdtsc does not return cycles. In fact the ratio to real cycles is variable depending on the changing frequency. I hope your algorithms can handle that. The intent here is to obtain some notion of time passed with a low cost. RDTSC seemed to be the best choice around: it is not critical that the frequency of the processor may change the relativity of the returned value with respect to actual cpu cycles. + + // Compute gradient descent for the number of retries. + double change_for_better = current_throughput / optimizer.last_throughput; + double change_for_worse = optimizer.last_throughput / current_throughput; + int32_t last_attempts = optimizer.last_attempts; + int32_t current_attempts = optimizer.optimized_attempts; + int32_t new_attempts = current_attempts; + if (unlikely(change_for_worse 1.40)) +{ + optimizer.optimized_attempts = optimizer.best_ever_attempts; + optimizer.last_throughput = current_throughput; + optimizer.last_attempts = current_attempts; + return; +} + + if (unlikely(random() % 100 1)) +{ So where is the seed for that random stored? Could you corrupt some user's random state? Is the state per thread or global? If it's per thread how do you initialize so that they threads do start with different seeds. If it's global what synchronizes it? As I do not specify any seed, I was under the impression that there would be a default initialization. Furthermore, the posix documentation specifies random() to be MT-safe, so I assumed its internal state to be per-thread. Did I mis-interpret this? With regard to the self-tuning state, it is kept within the gtm_global_optimizer optimizer struct, which is in essence multi-reader (any thread running transactions can check the struct to use the parameters optimized in it) and single-writer (notice that the void GTM::gtm_thread::reoptimize_htm_execution() function is called only by one thread, the one at the end of the list of threads, i.e., whose tx-next_thread == NULL). Overall the algorithm looks very complicated with many mysterious magic numbers. Are there simplifications possible? While the retry path is not extremely critical it should be at least somewhat optimized, otherwise it will dominate the cost of short transactions. One problems with so many magic numbers is that they may be good on one system, but bad on another. Notice that the retry path is barely changed in the common case: only a designated thread (the last one in the list of threads registered in libitm) will periodically execute the re-optimization. Hence, most of the patch that you can see here is code execute in that (uncommon case). I understand the concern with magic numbers: we could self-tune them as well, but that would surely increase the complexity of the patch :) In essence, we have the following numbers at the moment: * how often we re-optimize (every 500 successful transactions for the designated thread) * how many maximum attempts we can have in hardware (20) * how much better and worse the performance must change for the gradient descent to move to a new configuration (5%) * how terrible the performance must change for the gradient descent to rollback to the best known configuration so far (40%) * how often the gradient descent can explore randomly to avoid local optima (1%) Once again, thank you for your time
Re: [PATCH] add self-tuning to x86 hardware fast path in libitm
On the STAMP suite of benchmarks for transactional memory (described here [1]). I have ran an unmodified GCC 5.0.0 against the patched GCC with these modifications and obtain the following speedups in STAMP with 4 threads (on a Haswell with 4 cores, average 10 runs): I expect you'll need different tunings on larger systems. That is a good point. While I haven't ever used fixed point arithmetic, a cursory inspection reveals that it does make sense and seems applicable to this case. Are you aware of some place where this is being done already within GCC that I could use as inspiration, or should I craft some macros from scratch for this? I believe the inliner uses fixed point. Own macros should be fine too. + int32_t last_attempts = optimizer.last_attempts; + int32_t current_attempts = optimizer.optimized_attempts; + int32_t new_attempts = current_attempts; + if (unlikely(change_for_worse 1.40)) +{ + optimizer.optimized_attempts = optimizer.best_ever_attempts; + optimizer.last_throughput = current_throughput; + optimizer.last_attempts = current_attempts; + return; +} + + if (unlikely(random() % 100 1)) +{ So where is the seed for that random stored? Could you corrupt some user's random state? Is the state per thread or global? If it's per thread how do you initialize so that they threads do start with different seeds. If it's global what synchronizes it? As I do not specify any seed, I was under the impression that there would be a default initialization. Furthermore, the posix documentation specifies random() to be MT-safe, so I assumed its internal state to be per-thread. Did I mis-interpret this? Yes, that's right. But it's very nasty to change the users RNG state. A common pattern for repeatable benchmarks is to start with srand(1) and then use the random numbers to run the benchmark, so it always does the same thing. If you non deterministically (transaction aborts are not deterministic) change the random state it will make the benchmark not repeatable anymore. You'll need to use an own RNG state that it independent. It would be good to see if any parts of the algorithm can be simplified. In general in production software the goal is to have the simplest algorithm that does the job. -Andi -- a...@linux.intel.com -- Speaking for myself only.
[PATCH] add self-tuning to x86 hardware fast path in libitm
Hi, the libitm package contains a fast path for x86 to use Intel Restricted Transactional Memory (RTM) when available. This Hardware Transactional Memory (HTM) requires a software-based fallback to execute the atomic blocks when the hardware fails. This may happen because the transaction is too large, for instance. In fact, the performance of this HTM (and others that are equivalently best-effort) depends significantly on that interplay between the hardware and the software fallback. Wide experimentation showed that there does not seem to exist a single configuration for that fallback that can perform best independently of the application and workload [2]. Hence, in the scope of my work I have developed a self-tuning approach that exploits lightweight reinforcement learning techniques to identify the optimal RTM configuration in a workload-oblivious manner, i.e., not requiring any off-line sampling of the application. The implementation in this patch follows closely the following work: [1] Self-Tuning Intel Transactional Synchronization Extensions, Nuno Diegues and Paolo Romano, Proceedings of the International Conference on Autonomic Computing, ICAC 2014 http://homepages.gsd.inesc-id.pt/~ndiegues/files/icac14-ndiegues.pdf [2] Virtues and Limitations of Commodity Hardware Transactional Memory, Nuno Diegues, Paolo Romano, and Luis Rodrigues, Proceedings of the International Conference on Parallel Architectures and Compiler Techniques, PACT 2014 The copyright assignment for this is still in progress. Also, please bear in mind that this is my first (ever) attempt to contribute to GCC. As such, any suggestion (as simple as it may seem) will be most welcome. Output of: svn diff --extensions -u Index: libitm/method-serial.cc === --- libitm/method-serial.cc (revision 219316) +++ libitm/method-serial.cc (working copy) @@ -223,7 +223,23 @@ struct htm_mg : public method_group // initially disabled. #ifdef USE_HTM_FASTPATH htm_fastpath = htm_init(); +#ifdef HAVE_AS_RTM +optimizer.optimized_mode = STUBBORN; +optimizer.optimized_attempts = htm_fastpath; +optimizer.last_cycles = rdtsc(); +optimizer.last_total_txs_executed = 0; +optimizer.last_throughput = 0.0; +optimizer.last_attempts = htm_fastpath 0 ? htm_fastpath - 1 : 1; +optimizer.best_ever_throughput = 0.0; +optimizer.best_ever_attempts = htm_fastpath; +optimizer.txns_while_stubborn = 1; +optimizer.cycles_while_stubborn = 100; +optimizer.txns_while_halven = 1; +optimizer.cycles_while_halven = 100; +optimizer.txns_while_giveup = 1; +optimizer.cycles_while_giveup = 100; #endif +#endif } virtual void fini() { Index: libitm/config/x86/sjlj.S === --- libitm/config/x86/sjlj.S (revision 219316) +++ libitm/config/x86/sjlj.S (working copy) @@ -59,12 +59,14 @@ #define pr_hasNoAbort 0x08 #define pr_HTMRetryableAbort 0x80 #define pr_HTMRetriedAfterAbort 0x100 +#define pr_HTMCapacityAbort 0x200 #define a_runInstrumentedCode 0x01 #define a_runUninstrumentedCode 0x02 #define a_tryHTMFastPath 0x20 #define _XABORT_EXPLICIT (1 0) #define _XABORT_RETRY (1 1) +#define _XABORT_CAPACITY (1 3) .text @@ -108,9 +110,12 @@ SYM(_ITM_beginTransaction): .Ltxn_abort: /* If it might make sense to retry the HTM fast path, let the C++ code decide. */ - testl $(_XABORT_RETRY|_XABORT_EXPLICIT), %eax + testl $(_XABORT_RETRY|_XABORT_EXPLICIT|_XABORT_CAPACITY), %eax jz .Lno_htm orl $pr_HTMRetryableAbort, %edi + testl $(_XABORT_CAPACITY), %eax + jz .Lno_htm + orl $pr_HTMCapacityAbort, %edi /* Let the C++ code handle the retry policy. */ .Lno_htm: #endif Index: libitm/config/x86/target.h === --- libitm/config/x86/target.h (revision 219316) +++ libitm/config/x86/target.h (working copy) @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret) return begin_ret _XABORT_RETRY; } +static inline bool +htm_is_capacity_abort(uint32_t begin_ret) +{ + return begin_ret _XABORT_CAPACITY; +} + /* Returns true iff a hardware transaction is currently being executed. */ static inline bool htm_transaction_active () { return _xtest() != 0; } + +static inline uint64_t rdtsc() +{ +uint32_t hi, lo; +__asm__ __volatile__ (rdtsc : =a(lo), =d(hi)); +return ( (unsigned long long)lo)|( ((unsigned long long)hi)32 ); +} #endif Index: libitm/libitm_i.h === --- libitm/libitm_i.h (revision 219316) +++ libitm/libitm_i.h (working copy) @@ -242,6 +242,9 @@ struct gtm_thread uint32_t restart_reason[NUM_RESTARTS]; uint32_t restart_total; + // Keeps track of how many transactions were successfully executed. + uint64_t number_executed_txns; + // *** The shared part of gtm_thread starts here.