Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Mon, Apr 19, 2021 at 3:15 PM Borislav Petkov wrote: > All of a sudden you have *every* thread sporting a fat 8K buffer because > the library decided to use a fat buffer feature for it. > > Nope, I don't want that to happen. For this to happen, every thread would not only have to include/link-with code that uses AMX, but that code would have to *run*. I'm sure that the AI guys are super excited about matrix multiplication, but I have a hard time imagining why grep(1) would find a use for it. Indeed, if anyone expected AMX to be used by every task, we would have never gone to the trouble of inventing the XFD hardware to support the kernel's lazy 8KB buffer allocation. cheers, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Mon, Apr 19, 2021 at 10:15 AM Borislav Petkov wrote: > > Tasks are created without an 8KB AMX buffer. > > Tasks have to actually touch the AMX TILE registers for us to allocate > > one for them. > > When tasks do that it doesn't matter too much - for the library it does! > > If the library does that by default and the processes which comprise > of that pipe I mentioned earlier, get all 8K buffers because the > underlying library decided so and swinging those buffers around when > saving/restoring contexts turns out to be a performance penalty, then we > have lost. > > Lost because if that thing goes upstream in this way of use of AMX is > allowed implicitly, there ain't fixing it anymore once it becomes an > ABI. > > So, that library should ask the kernel whether it supports AMX and only > use it if has gotten a positive answer. Right, the library *does* ask the kernel whether it supports AMX (below). > And by default that answer > should be "no" because the majority of processes - that same pipe I keep > mentioning - don't need it. Indeed, the default is "no" because most libraries will *not* ask the system for AMX support (below). However, if they *did* probe for it, and they *did* use it, the kernel would not stand in the way of any of those requests. > I have no good idea yet how granulary that should be - per process, per > thread, whatever, but there should be a way for the kernel to control > whether the library uses AMX, AVX512 or whatever fat state is out there > available. > > Then, if a process wants the library to use AMX on its behalf, then it > can say so and the library can do that but only after having asked for > explicitly. The ABI works like this: 0. App or library author decides AMX is useful at build-time. 1. App checks CPUID for AMX CPU feature 2. App checks XCR0 for AMX OS support (if app touches AMX without these two being TRUE, it will suffer the consequence of a #UD when it touches an AMX instruction) This ABI is how AVX works today. What is new with AMX is the ability of the hardware and the OS to delay the allocation of the context switch buffer until if/when it is actually needed. This is transparent, and thus not part of the ABI, unless you count the absence of a mandated system call to be an ABI. 3. the application then touches an AMX register, triggering... 4. #NM handled by the kernel, which allocates a context switch buffer for that task, and dis-arms XFD. Yes, we could invent a new system call and mandate that it be called between #2 and #3. However, we'd still do #4 in response, so I don't see any value to that system call. Indeed, I would advocate that glibc replace it with a return statement. So back to the example: | grep | awk | sed ... Sure, if grep grows support for some AI feature that we haven't imaged yet, then something in its code flow is fully empowered to probe for AMX and use AMX on AMX hardware. Sort of hard to imagine with the programs above that we know today, but future programs certainly could do this if they chose to. thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Fri, Apr 16, 2021 at 6:14 PM Andy Lutomirski wrote: > My point is that every ... I encourage you to continue to question everything and trust nobody. While it may cost you a lot in counseling, it is certainly valuable, at least to me! :-) I do request, however, that feedback stay specific, stay technical, and stay on-topic. We all have plenty of real challenges we can be tackling with our limited time. > Is there any authoritative guidance at all on what actually happens, > performance-wise, when someone does AMX math? Obviously, I can't speak to the performance of AMX itself pre-production, and somebody who does that for a living will release stuff on or before release day. What I've told you about the performance side-effects on the system (and lack thereof) from running AMX code is an authoritative answer, and is as much as I can tell you today. If I failed to answer a question about AMX, my apologies, please re-ask it. And if we learn something new between now and release day that is relevant to this discussion, I will certainly request to share it. Our team (Intel Open Source Technology Center) advocated getting the existing public AMX documentation published as early as possible. However, if you are really nto the details of how AMX works, you may also be interested to know that the AMX hardware patent filings are fully public ;-) cheers, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
> I get it. That does not explain why LDMXCSR and VLDMXCSR cause > pipelines stalls. Sorry, I thought this thread was about AMX. I don't know the answer to your LDMXCSR and VLDMXCSR question.
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Thu, Apr 15, 2021 at 1:47 AM Borislav Petkov wrote: > What I'd like to see is 0-overhead for current use cases and only > overhead for those who want to use it. If that can't be done > automagically, then users should request it explicitly. So basically you > blow up the xsave buffer only for processes which want to do AMX. Indeed, expanding the xsave buffer happens only for tasks that touch AMX TILE registers. > And this brings the question about libraries which, if they start using > AMX by default - which doesn't sound like they will want to because AMX > reportedly will have only a limited? set of users - if libraries start > using it by default, then it better be worth the handling of the 8kb > buffer per process. I'm not aware of any intent to transparently use AMX for bcopy, like what happened with AVX-512. (didn't they undo that mistake?) > If not, this should also be requestable per process so that a simple > pipe in Linux: > > | grep | awk | sed ... > > and so on is not penalized to allocate and handle by default 8kb for > *each* process' buffer in that pipe just because each is linking against > glibc which has detected AMX support in CPUID and is using it too for > some weird reason like some microbenchmark saying so. Tasks are created without an 8KB AMX buffer. Tasks have to actually touch the AMX TILE registers for us to allocate one for them. > But my initial question was on the "establishing" part and was asking > where we have established anything wrt AMX. The patch set on LKML establishes working AMX Linux support in public. I am thankful for your and other public review and feedback on that series. I can think of 3 actual bugs that were found in the process. thanks, Len Brown Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Thu, Apr 15, 2021 at 12:24 PM Andy Lutomirski wrote: > On Wed, Apr 14, 2021 at 2:48 PM Len Brown wrote: > > > ... the transition penalty into and out of AMX code The concept of 'transition' exists between AVX and SSE instructions because it is possible to mix both instruction sets and touch different parts of the same registers. The "unused" parts of those registers need to be tracked to assure that data is not lost when mixing. This concept is moot with AMX, which has its own dedicated registers. > What is the actual impact of a trivial function that initializes the > tile config, does one tiny math op, and then does TILERELEASE? 1. Task takes #NM on first touch of TILE registers 2. Kernel allocates 8KB for that task and dis-arms XFD 3. Kernel context switches XFD with task state If the task takes a signal *before* TILERELEASE 4. XSAVE transfers AMX state to signal stack, XRESTOR the reverse. If the task context switches *before* TILERELEASE 5. kernel context switch XSAVES the AMX state to 8KB context switch buffer, XRESTORE the reverse. If the task takes a signal *after* TILERELEASE 4. XSAVE does NOT transfer AMX state (or zeros) to signal stack, 8KB is consumed on signal stack but not touched. XRESTOR, the reverse. If the task context switches *after* TILERELEASE 5. kernel contexts switch ignores INIT=1 AMX state. 8KB buffer is quiescent. As we discussed previously, there is no impact to frequency from either INIT=0 or INIT=1 AMX state. Frequency is impacted by *compute*, and since there isn't any compute this scenario, there is no frequency impact. As we discussed previously, for INIT=1 (which the kernel guarantees, there is also no impact on power. thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Wed, Apr 14, 2021 at 5:58 AM Borislav Petkov wrote: > > On Tue, Apr 13, 2021 at 03:51:50PM -0400, Len Brown wrote: > > AMX does the type of matrix multiplication that AI algorithms use. In > > the unlikely event that you or one of the libraries you call are doing > > the same, then you will be very happy with AMX. Otherwise, you'll > > probably not use it. > > Which sounds to me like AMX is something which should not be enabled > automatically but explicitly requested. I don't see the majority of the > processes on the majority of the Linux machines out there doing AI with > AMX - at least not anytime soon. If it becomes ubiquitous later, we can > make it automatic then. I'm pretty sure that the "it isn't my use case of interest, so it doesn't matter" line of reasoning has long been established as -EINVAL ;-)
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
ack size ABI was ill conceived, and overlooked -- for decades. We all recognize that there are a non-zero number of applications that fail on AVX-512 hardware because of this issue. We all recognize that if not addressed, AVX would increase the likelihood of an application failing due to the too-small-alternative-signal-stack issue. I thank the ARM community for taking action on this issue and setting the example of the ALT-VEC solution to compute and expose required stack size at run-time. I further thank HJ Lu and the libc team for picking up that ball and shipping the solution to this problem with an updated ABI in glibc 2.34. I acknowledge that it is not impossible to fail after that fix -- you can ignore the ABI, or you could hardcode sizes, or you could be statically linked to the old libc. But it gets increasingly harder to fail, the kernel signal series has a new run-time check to prevent data corruption that could have happened in the past, and the remedy is clear -- re-build with the new glibc. Yes, it would have been good if this were done before AVX-512 deployed. 3. Can we do better in the long term? Assuming the ABI update in #2 addresses the issue with applications that declare their own alt-sig-stack, we have a backward compatible solution today. Somewhere in the past, the decision was made that all architectural state should be exposed to signal handlers. And a decision was made on x86 that it should be in uncompacted XSTATE format. There are programs today that count on both of these things being true, and if we change either of those, we break applications. But the irony is that there are a vanishingly small number of signal handlers that actually care at all about that state, and it seems to be wasteful to give it to them. So the question is whether to continue giving them all that information, or to give them a way to decline -- to give the option for signals to be more lightweight. We can certainly do this. The question is if it is important enough to bother. What applications would notice if signal handlers were faster? Would those applications be willing to update to opt-in to a new incompatible signal handling ABI, where the kernel took the time to supply only the state that they request? thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Tue, Apr 13, 2021 at 4:16 PM Andy Lutomirski wrote: > > On Mon, Apr 12, 2021 at 4:46 PM Len Brown wrote: > > > > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski wrote: > > > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a > > > microbenchmark. Do it once and you permanently allocate 8kB (is that > > > even a constant? can it grow in newer parts?), potentially hurts all > > > future context switches, and does who-knows-what to Turbo licenses and > > > such. > > > > Intel expects that AMX will be extremely valuable to key workloads. > > It is true that you may never run that kind of workload on the machine > > in front of you, > > and so you have every right to be doubtful about the value of AMX. > > I fully believe that AMX will be amazing when used for the right > workload. The problem is that a library may have no way to tell > whether a workload is the type of computationally intensive workload > for which it makes sense. Imagine you have a little function: > > int matrix_times_vector(int dim, float *out, const float *matrix, > const float *vector); > > A clever library might use AMX for this. If dim == 4 and the caller > is planning to call it in a long, tight loop, maybe this even makes > sense. If dim == 4 and it's being called once, AMX is probably a > losing proposition. With previous technologies, at least the impact > was limited to the function itself and maybe once per call to the > caller. But now, with AMX, the program that invoked this takes a > performance and memory hit *forever* if it uses AMX once. Again... As this is a "clever" library, built with a clever toolchain, and the result is that TILERELEASE was properly issued at the end of computation. Thus the hardware knows that the (volatile) AMX registers are no longer live. The XSAVE hardware recognizes this INIT=1 condition and transfers NO DATA "*forever*". This is true both on context switch (compacted) where it is automatic, and on (uncompacted) signal delivery, where we check for this case. Was that the "performance hit" of concern, or did I miss something? Yes, it is true that the kernel allocated a context switch buffer for the lifetime of that task, and it will not be freed until that task exits. If this proves to be an issue, there is nothing preventing us from implementing a re-claim scheme for a rarely used buffer. After recognizing this situation, we'd simply arm XFD, free the buffer, and from then onwards, the task behaves exactly as if had never touched AMX. However, nobody has yet suggested that would be a common situation worth an optimization to reclaim that task's 8KB. > Beyond that, we have the signal handling issue. I'm unaware of any unresolved feedback on the signal handling series other than a wistful "wouldn't a new SIGFAIL be more clear (for future apps) than the existing SIGSEGV?" I agree with this sentiment, but I don't think we should hold up a patch to prevent corrupting user data because a new signal number to describe the scenario doesn't exit. Particularly since the new code that knows about the new SIGFAIL will also be new code that has been compiled with the new glibc that for most cases will prevent this scenario in the first place... > One solution, going > off of what WIlly mentioned, is: > > bool amx_begin(void *signal_save_buffer); > void amx_end(); > > In the amx_begin() region, if you get a signal, the AMX state is saved > in the buffer. Outside the region, if you get a signal and AMX is in > use, the kernel will either unceremoniously kill the task or will > deliver SIGYOUBLEWIT. [0] I think it is clear that if a new signal ABI is going to be invented, that it should be opt-in on state, so that it can run fast on machines far into the future by not choosing to opt-in on anything. It isn't clear that changing the signal save state around critical regions (in multiple threads) so that a single (per process definition) of a signal handler gets a different result at different times is going to make that (new) signal handler author especially happy. More likely they either always want the state, or they do not. > I'm really hoping some userspace people can chime in. Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
Thanks for sharing your perspective, Willy. I agree that if your application is so sensitive that you need to hand-code your own memcmp, then linking with (any) new version of (any) dynamic library is a risk you must consider carefully. AMX does the type of matrix multiplication that AI algorithms use. In the unlikely event that you or one of the libraries you call are doing the same, then you will be very happy with AMX. Otherwise, you'll probably not use it. I acknowledge the issue with the toolchain transparently using AVX-512 for copying data, and how that approach impacted systems with a poor AVX-512 hardware implementation. FWIW. I'm not aware of any plans to implicitly use AMX this way, and I'm not aware of any non-Xeon AMX implementations in the near future. cheers, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Mon, Apr 12, 2021 at 8:17 PM Thomas Gleixner wrote: > I'm fine with that as long the kernel has a way to detect that and can > kill the offending application/library combo with an excplicit > -SIG_NICE_TRY. Agreed. The new run-time check for altsigstack overflow is one place we can do that. Let me know if you think of others, thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski wrote: > AMX: Multiplying a 4x4 matrix probably looks *great* in a > microbenchmark. Do it once and you permanently allocate 8kB (is that > even a constant? can it grow in newer parts?), potentially hurts all > future context switches, and does who-knows-what to Turbo licenses and > such. Intel expects that AMX will be extremely valuable to key workloads. It is true that you may never run that kind of workload on the machine in front of you, and so you have every right to be doubtful about the value of AMX. The AMX architectural state size is not expected to change. Rather, if a "new AMX" has a different state size, it is expected to use a new feature bit, different from AMX. The AMX context switch buffer is allocated only if and when a task touches AMX registers. Yes, there will be data transfer to and from that buffer when three things all happen. 1. the data is valid 2. hardware interrupts the application 3. the kernel decides to context switch. There will be no data transfer of AMX architectural state when it is in INIT state. As AMX registers are volatile, correct software will always have them in INIT state before calls, including system calls. I've addressed turbo licenses already. > Even putting aside all kernel and ABI issues, is it actually a good > idea for user libraries to transparently use these new features? I'm > not really convinced. I think that serious discussion among userspace > people is needed. At the risk of stating the obvious... Intel's view is that libraries that deliver the most value from the hardware are a "good thing", and that anything preventing libraries from getting the most value from the hardware is a "bad thing":-) cheers, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Fri, Apr 9, 2021 at 5:44 PM Andy Lutomirski wrote: > > On Fri, Apr 9, 2021 at 1:53 PM Len Brown wrote: > > > > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski wrote: > > > > > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown wrote: > > > > > > > We added compiler annotation for user-level interrupt handlers. > > > > I'm not aware of it failing, or otherwise being confused. > > > > > > I followed your link and found nothing. Can you elaborate? In the > > > kernel, we have noinstr, and gcc gives approximately no help toward > > > catching problems. > > > > A search for the word "interrupt" on this page > > https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes > > comes to the description of this attribute: > > > > __attribute__ ((interrupt)) > > > > I read that and I see no mention of anything saying "this will > generate code that does not touch extended state". Instead I see, > paraphrasing, "this will generate code with an ABI that is completely > inappropriate for use in a user space signal handler". Am I missing > something? Again... An analogous annotation could be created for fast signals. gcc can be told exactly what registers and instructions it can use for that routine. If somebody can suggest a way to make fast signal handers faster than saving only the state that they-themselves actually use, I'm all ears. > > > > dynamic XCR0 breaks the installed base, I thought we had established > > > > that. > > > > > > I don't think this is at all established. If some code thinks it > > > knows the uncompacted XSTATE size and XCR0 changes, it crashes. This > > > is not necessarily a showstopper. > > > > My working assumption is that crashing applications actually *is* a > > showstopper. > > Please clarify. > > I think you're presuming that some program actually does this. If no > program does this, it's not an ABI break. So you agree that for a program that uses xgetbv to read XCR0 and compute XSTATE size for user-space use of XSAVE can break if XCR0 changes during its lifetime. But you don't believe such software exists? > More relevantly, this can only happen in a process that uses XSAVE and > thinks it knows the size that *also* does the prctl to change XCR0. > By construction, existing programs can't break unless they load new > dynamic libraries that break them. Let's say that a program does math. It calls a library to do that math. It doesn't know or care what instructions the library uses to do math. eg. the library uses SSE on an Atom, and uses AVX512 on a Xeon. Who calls the new prctl, the program, or the library? If it is the program, how does it know that the library wants to use what instructions? If it is the library, then you have just changed XCR0 at run-time and you expose breakage of the thread library that has computed XSAVE size. > > > > We've also established that when running in a VMM, every update to > > > > XCR0 causes a VMEXIT. > > > > > > This is true, it sucks, and Intel could fix it going forward. > > > > What hardware fix do you suggest? > > If a guest is permitted to set XCR0 bits without notifying the VMM, > > what happens when it sets bits that the VMM doesn't know about? > > The VM could have a mask of allowed XCR0 bits that don't exist. > > TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on > TDX. Surely plain VMX could fix it too. There are two cases. 1. Hardware that exists today and in the foreseeable future. VM modification of XCR0 results in VMEXIT to VMM. The VMM sees bits set by the guest, and so it can accept what it supports, or send the VM a fault for non-support. Here it is not possible for the VMM to change XCR0 without the VMM knowing. 2. Future Hardware that allows guests to write XCR0 w/o VMEXIT. Not sure I follow your proposal. Yes, the VM effectively has a mask of what is supported, because it can issue CPUID. The VMM virtualizes CPUID, and needs to know it must not expose to the VM any state features it doesn't support. Also, the VMM needs to audit XCR0 before it uses XSAVE, else the guest could attack or crash the VMM through buffer overrun. Is this what you suggest? If yes, what do you suggest in the years between now and when that future hardware and VMM exist? > > > > I thought the goal was to allow new programs to have fast signal > > > > handlers. > > > > By default, those fast signal handlers would have a stable state > > > > image, and would > > > > not inh
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Wed, Mar 31, 2021 at 6:54 PM Borislav Petkov wrote: > > On Wed, Mar 31, 2021 at 06:28:27PM -0400, Len Brown wrote: > > dynamic XCR0 breaks the installed base, I thought we had established > > that. > > We should do a clear cut and have legacy stuff which has its legacy > expectations on the XSTATE layout and not touch those at all. > > And then all new apps which will use these new APIs can go and request > whatever fancy new state constellations we support. Including how they > want their signals handled, etc. > > Fat states like avx512, amx etc will be off by default and apps > explicitly requesting those, can get them. > > That's it. 100% agreement from me! (does anybody disagree?) thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski wrote: > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown wrote: > > > We added compiler annotation for user-level interrupt handlers. > > I'm not aware of it failing, or otherwise being confused. > > I followed your link and found nothing. Can you elaborate? In the > kernel, we have noinstr, and gcc gives approximately no help toward > catching problems. A search for the word "interrupt" on this page https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes comes to the description of this attribute: __attribute__ ((interrupt)) > > dynamic XCR0 breaks the installed base, I thought we had established that. > > I don't think this is at all established. If some code thinks it > knows the uncompacted XSTATE size and XCR0 changes, it crashes. This > is not necessarily a showstopper. My working assumption is that crashing applications actually *is* a showstopper. Please clarify. > > We've also established that when running in a VMM, every update to > > XCR0 causes a VMEXIT. > > This is true, it sucks, and Intel could fix it going forward. What hardware fix do you suggest? If a guest is permitted to set XCR0 bits without notifying the VMM, what happens when it sets bits that the VMM doesn't know about? > > I thought the goal was to allow new programs to have fast signal handlers. > > By default, those fast signal handlers would have a stable state > > image, and would > > not inherit large architectural state on their stacks, and could thus > > have minimal overhead on all hardware. > > That is *a* goal, but not necessarily the only goal. I fully support coming up with a scheme for fast future-proof signal handlers, and I'm willing to back that up by putting work into it. I don't see any other goals articulated in this thread. thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Wed, Mar 31, 2021 at 12:53 PM Andy Lutomirski wrote: > But this whole annotation thing will require serious compiler support. > We already have problems with compilers inlining functions and getting > confused about attributes. We added compiler annotation for user-level interrupt handlers. I'm not aware of it failing, or otherwise being confused. Why would compiler support for fast-signals be any more "serious"? > An API like: > > if (get_amx()) { > use AMX; > } else { > don’t; > } > > Avoids this problem. And making XCR0 dynamic, for all its faults, at least > helps force a degree of discipline on user code. dynamic XCR0 breaks the installed base, I thought we had established that. We've also established that when running in a VMM, every update to XCR0 causes a VMEXIT. I thought the goal was to allow new programs to have fast signal handlers. By default, those fast signal handlers would have a stable state image, and would not inherit large architectural state on their stacks, and could thus have minimal overhead on all hardware.
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Wed, Mar 31, 2021 at 5:42 PM Robert O'Callahan wrote: > > For the record, the benefits of dynamic XCR0 for rr recording > portability still apply. I guess it'd be useful for CRIU too. We would > also benefit from anything that incentivizes increased support for > CPUID faulting. As previously mentioned, today we don't have an architectural way to trap a user into the kernel on CPUID, even though we can do this for a VMM. But spoofing CPUID isn't a solution to all problems. The feature really needs to be OFF to prevent users from using it, even if the supported mechanisms of discovering that feature say "NOT PRESENT". Today there are plenty of users who will opportunistically try everything in the cloud and choose the machine that allows them to do something that other machines will not -- even if it is not officially supported. If something is not enumerated, it really needs to also be turned off. cheers, --Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Tue, Mar 30, 2021 at 6:01 PM David Laight wrote: > > Can we leave it in live registers? That would be the speed-of-light > > signal handler approach. But we'd need to teach the signal handler to > > not clobber it. Perhaps that could be part of the contract that a > > fast signal handler signs? INIT=0 AMX state could simply sit > > patiently in the AMX registers for the duration of the signal handler. > > You can't get any faster than doing nothing :-) > > > > Of course part of the contract for the fast signal handler is that it > > knows that it can't possibly use XRESTOR of the stuff on the stack to > > necessarily get back to the state of the signaled thread (assuming we > > even used XSTATE format on the fast signal handler stack, it would > > forget the contents of the AMX registers, in this example) > > gcc will just use the AVX registers for 'normal' code within > the signal handler. > So it has to have its own copy of all the registers. > (Well, maybe you could make the TMX instructions fault, > but that would need a nested signal delivered.) This is true, by default, but it doesn't have to be true. Today, gcc has an annotation for user-level interrupts https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes An analogous annotation could be created for fast signals. gcc can be told exactly what registers and instructions it can use for that routine. Of course, this begs the question about what routines that handler calls, and that would need to be constrained too. Today signal-safety(7) advises programmers to limit what legacy signal handlers can call. There is no reason that a fast-signal-safety(7) could not be created for the fast path. > There is also the register save buffer that you need in order > to long-jump out of a signal handler. > Unfortunately that is required to work. > I'm pretty sure the original setjmp/longjmp just saved the stack > pointer - but that really doesn't work any more. > > OTOH most signal handlers don't care - but there isn't a flag > to sigset() (etc) so ask for a specific register layout. Right, the idea is to optimize for *most* signal handlers, since making any changes to *all* signal handlers is intractable. So the idea is that opting-in to a fast signal handler would opt-out of some legacy signal capibilities. Complete state is one of them, and thus long-jump is not supported, because the complete state may not automatically be available. thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Tue, Mar 30, 2021 at 4:20 PM Andy Lutomirski wrote: > > > > On Mar 30, 2021, at 12:12 PM, Dave Hansen wrote: > > > > On 3/30/21 10:56 AM, Len Brown wrote: > >> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski > >> wrote: > >>>> On Mar 30, 2021, at 10:01 AM, Len Brown wrote: > >>>> Is it required (by the "ABI") that a user program has everything > >>>> on the stack for user-space XSAVE/XRESTOR to get back > >>>> to the state of the program just before receiving the signal? > >>> The current Linux signal frame format has XSTATE in uncompacted format, > >>> so everything has to be there. > >>> Maybe we could have an opt in new signal frame format, but the details > >>> would need to be worked out. > >>> > >>> It is certainly the case that a signal should be able to be delivered, > >>> run “async-signal-safe” code, > >>> and return, without corrupting register contents. > >> And so an an acknowledgement: > >> > >> We can't change the legacy signal stack format without breaking > >> existing programs. The legacy is uncompressed XSTATE. It is a > >> complete set of architectural state -- everything necessary to > >> XRESTOR. Further, the sigreturn flow allows the signal handler to > >> *change* any of that state, so that it becomes active upon return from > >> signal. > > > > One nit with this: XRSTOR itself can work with the compacted format or > > uncompacted format. Unlike the XSAVE/XSAVEC side where compaction is > > explicit from the instruction itself, XRSTOR changes its behavior by > > reading XCOMP_BV. There's no XRSTORC. > > > > The issue with using the compacted format is when legacy software in the > > signal handler needs to go access the state. *That* is what can't > > handle a change in the XSAVE buffer format (either optimized/XSAVEOPT, > > or compacted/XSAVEC). > > The compacted format isn’t compact enough anyway. If we want to keep AMX and > AVX512 enabled in XCR0 then we need to further muck with the format to omit > the not-in-use features. I *think* we can pull this off in a way that still > does the right thing wrt XRSTOR. Agreed. Compacted format doesn't save any space when INIT=0, so it is only a half-step forward. > If we go this route, I think we want a way for sigreturn to understand a > pointer to the state instead of inline state to allow programs to change the > state. Or maybe just to have a way to ask sigreturn to skip the restore > entirely. The legacy approach puts all architectural state on the signal stack in XSTATE format. If we make the signal stack smaller with a new fast-signal scheme, we need to find another place for that state to live. It can't live in the task context switch buffer. If we put it there and then take an interrupt while running the signal handler, then we'd overwrite the signaled thread's state with the signal handler's state. Can we leave it in live registers? That would be the speed-of-light signal handler approach. But we'd need to teach the signal handler to not clobber it. Perhaps that could be part of the contract that a fast signal handler signs? INIT=0 AMX state could simply sit patiently in the AMX registers for the duration of the signal handler. You can't get any faster than doing nothing :-) Of course part of the contract for the fast signal handler is that it knows that it can't possibly use XRESTOR of the stuff on the stack to necessarily get back to the state of the signaled thread (assuming we even used XSTATE format on the fast signal handler stack, it would forget the contents of the AMX registers, in this example) Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski wrote: > > On Mar 30, 2021, at 10:01 AM, Len Brown wrote: > > Is it required (by the "ABI") that a user program has everything > > on the stack for user-space XSAVE/XRESTOR to get back > > to the state of the program just before receiving the signal? > > The current Linux signal frame format has XSTATE in uncompacted format, > so everything has to be there. > Maybe we could have an opt in new signal frame format, but the details would > need to be worked out. > > It is certainly the case that a signal should be able to be delivered, run > “async-signal-safe” code, > and return, without corrupting register contents. And so an an acknowledgement: We can't change the legacy signal stack format without breaking existing programs. The legacy is uncompressed XSTATE. It is a complete set of architectural state -- everything necessary to XRESTOR. Further, the sigreturn flow allows the signal handler to *change* any of that state, so that it becomes active upon return from signal. And a proposal: Future programs, which know that they don't need the full-blown legacy signal stack format, can opt-in to a new format. That new format, can be minimal (fast) by default. Perhaps, as Noah suggests, it could have some sort of mechanism where the program can explicitly select which state components they would want included on their signal stack, and restored by sigreturn. If the new fast-signal format is successful, in a number of years, it will have spread to have taken over the world. thoughts? Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
Andy, I agree, completely, with your description of the challenge, thank you for focusing the discussion on that problem statement. Question: Is it required (by the "ABI") that a user program has everything on the stack for user-space XSAVE/XRESTOR to get back to the state of the program just before receiving the signal? My understanding is that there are programs that do this. However, if it is not guaranteed to work, that could greatly simplify what we are required to put on the signal stack. thanks, -Len
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
On Tue, Mar 30, 2021 at 4:28 AM Thomas Gleixner wrote: > > Len, > > On Mon, Mar 29 2021 at 18:16, Len Brown wrote: > > On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner wrote: > > Let me know if this problem description is fair: > > > > Many-core Xeon servers will support AMX, and when I run an AMX application > > on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my > > CPU. > > If Linux cpuidle requests C6, the hardware will demote to C1E. > > > > The concern is that a core in C1E will negatively impact power of > > self, or performance > > of a neighboring core. > > > > This is what we are talking about, right? > > Correct. > > > I'm delighted that there are Xeon customers, who care about this power > > savings. > > Unfortunately, they are the exception, not the rule. > > That maybe true or not. The point is that there is some side effect and > from a correctness point of view it needs to be addressed. I don't see how demoting C6 to C1E is a "correctness" issue. There is nothing "incorrect" about demoting to C1E when software permits C6, and as I mentioned, this happens all the time. All architectural state, including the AMX state, is preserved by hardware. I do agree that there is the possibility that this scenario can result may not be optimal power savings. It isn't clear how often that situation might occur. > >>- Use TILERELEASE on context switch after XSAVES? > > > > Yes, that would be perfectly reasonable. > > > >>- Any other mechanism on context switch > > > > XRESTOR of a context with INIT=1 would also do it. > > > >>- Clear XFD[18] when going idle and issue TILERELEASE depending > >> on the last state > > > > I think you mean to *set* XFD. > > When the task touched AMX, he took a #NM, and we cleared XFD for that task. > > So when we get here, XFD is already clear (unarmed). > > Nevertheless, the setting of XFD is moot here. > > No. We set XFD when the task which used AMX schedules out. If the CPU > reaches idle without going back to user space then XFD is still set and > AMX INIT still 0. So my assumption was that in order to issue > TILERELEASE before going idle, you need to clear XFD[18] first, but I > just saw in the spec that it is not necessary. Right, XFD setting is moot here. > >>- Use any other means to set the thing back into INIT=1 state when > >> going idle > > > > TILERELEASE and XRESTOR are the tools in the toolbox, if necessary. > > > >> There is no option 'shrug and ignore' unfortunately. > > > > I'm not going to say it is impossible that this path will matter. > > If some terrible things go wrong with the hardware, and the hardware > > is configured and used in a very specific way, yes, this could matter. > > So then let me summarize for the bare metal case: > >1) The paragraph in the specification is unclear and needs to be > rephrased. > > What I understood from your explanations so far: > > When AMX is disabled by clearing XCR0[18:17], by clearing > CR4.OSXSAVE, or by setting IA32_XFD[18], then there are no > negative side effects due to AMX INIT=0 as long as the CPU is > executing. Right. > The only possible side effect is when the CPU goes idle because > AMX INIT=0 limits C states. Right. >2) As a consequence of #1 there is no further action required on > context switch when XFD[18] is set. I agree. >3) When the CPU goes idle with AMX INIT=0 then the idle code should > invoke TILERELEASE. Maybe the condition is not even necessary and > TILERELEASE can be invoked unconditionally before trying to enter > idle. > > If that's correct, then this should be part of the next series. If you insist, then that is fine with me. Personally, I would prefer to be able to measure an actual benefit before adding code, but this one is small, and so I'm not religious about it. > > In the grand scheme of things, this is a pretty small issue, say, > > compared to the API discussion. > > No argument about that. > > Thanks, > > tglx -- Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski wrote: > > > > On Mar 29, 2021, at 8:47 AM, Len Brown wrote: > > > > On Sat, Mar 27, 2021 at 5:58 AM Greg KH wrote: > >>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote: > >>> Hi Andy, > >>> Say a mainline links with a math library that uses AMX without the > >>> knowledge of the mainline. > > > > sorry for the confusion. > > > > mainline = main(). > > > > ie. the part of the program written by you, and not the library you linked > > with. > > > > In particular, the library may use instructions that main() doesn't know > > exist. > > If we pretend for a bit that AMX were a separate device instead of a part of > the CPU, this would be a no brainer: something would be responsible for > opening a device node or otherwise requesting access to the device. > > Real AMX isn’t so different. Programs acquire access either by syscall or by > a fault, they use it, and (hopefully) they release it again using > TILERELEASE. The only thing special about it is that, supposedly, acquiring > and releasing access (at least after the first time) is quite fast. But > holding access is *not* free — despite all your assertions to the contrary, > the kernel *will* correctly context switch it to avoid blowing up power > consumption, and this will have overhead. > > We’ve seen the pattern of programs thinking that, just because something is a > CPU insn, it’s free and no thought is needed before using it. This happened > with AVX and AVX512, and it will happen again with AMX. We *still* have a > significant performance regression in the kernel due to screwing up the AVX > state machine, and the only way I know about any of the details is that I > wrote silly test programs to try to reverse engineer the nonsensical behavior > of the CPUs. > > I might believe that Intel has figured out how to make a well behaved XSTATE > feature after Intel demonstrates at least once that it’s possible. That > means full documentation of all the weird issues, no new special cases, and > the feature actually making sense in the context of XSTATE. This has not > happened. Let’s list all of them: > > - SSE. Look for all the MXCSR special cases in the pseudocode and tell me > with a straight face that this one works sensibly. > > - AVX. Also has special cases in the pseudocode. And has transition issues > that are still problems and still not fully documented. L > > - AVX2. Horrible undocumented performance issues. Otherwise maybe okay? > > - MPX: maybe the best example, but the compat mode part got flubbed and it’s > MPX. > > - PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was > a major mistake, now unfixable, that seriously limits the usefulness of the > whole feature. I suppose Intel could release PKRU2 with a better ISA and > deprecate the original PKRU, but I’m not holding my breath.) > > - AVX512: Yet more uarch-dependent horrible performance issues, and Intel has > still not responded about documentation. The web is full of people > speculating differently about when, exactly, using AVX512 breaks performance. > This is NAKked in kernel until docs arrive. Also, it broke old user programs. > If we had noticed a few years ago, AVX512 enablement would have been > reverted. > > - AMX: This mess. > > The current system of automatic user enablement does not work. We need > something better. Hi Andy, Can you provide a concise definition of the exact problemI(s) this thread is attempting to address? Thank ahead-of-time for excluding "blow up power consumption", since that paranoia is not grounded in fact. thanks, -Len
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner wrote: > According to documentation it is irrelevant whether AMX usage is > disabled via XCR0, CR4.OSXSAVE or XFD[18]. In any case the effect of > AMX INIT=0 will prevent C6. > > As I explained in great length there are enough ways to get into a > situation where this can happen and a CPU goes idle with AMX INIT=0. > > So what are we supposed to do? Let me know if this problem description is fair: Many-core Xeon servers will support AMX, and when I run an AMX application on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU. If Linux cpuidle requests C6, the hardware will demote to C1E. The concern is that a core in C1E will negatively impact power of self, or performance of a neighboring core. This is what we are talking about, right? First, I should mention that if I threw a dart at a map of Xeons deployed across the universe, the chances are "significant" that I'd hit one that is configured with C6 disabled, and this discussion would be moot. Second, I should mention that Linux cpuidle demotes from deep C-states to shallow ones all day long. This is typically due to expected timer expiration, and other heuristics. Third, I should mention that the processor itself demotes from C6 to C1E for a number of reasons -- basically like what Linux is doing, but in HW. Albeit, the hardware does have the capability to "un-demote" when it demotes and recognizes it made a mistake, and that "un-demote" capability would not be present if the reason for demotion was AVX INIT=0. Okay, that said, let's assume we have found a system where this problem could happen, and we use it in a way that makes it happen. Would we notice? If your system were profoundly idle, and one or more cores were in C1E, then it would prevent the SOC from entering Package C6 (if enabled). Yes, there is a measurable idle power difference between Package C1E and Package C6. (indeed, this is why Package C6 exists). I'm delighted that there are Xeon customers, who care about this power savings. Unfortunately, they are the exception, not the rule. If you were to provoke this scenario on many cores simultaneously, then I expect you could detect a power difference between C1E and CC6. However, that difference would be smaller than the difference in power due to the frequency choice of the running cores, because it is basically just the L2-leakage vs L2-off difference. Regarding frequency credits for a core being in C1E vs C6. Yes, this is factored into the frequency credits for turbo mode. How much impact, I can't say, because that information is not yet available. However, this is mitigated by the fact that Xeon single core turbo is deployed differently than client. Xeon's are deployed more with multi-core turbo in mind, and so how much you'll notice C1E vs C6 may not be significant, unless perhaps it happened on all the cores across the system. >- Use TILERELEASE on context switch after XSAVES? Yes, that would be perfectly reasonable. >- Any other mechanism on context switch XRESTOR of a context with INIT=1 would also do it. >- Clear XFD[18] when going idle and issue TILERELEASE depending > on the last state I think you mean to *set* XFD. When the task touched AMX, he took a #NM, and we cleared XFD for that task. So when we get here, XFD is already clear (unarmed). Nevertheless, the setting of XFD is moot here. >- Use any other means to set the thing back into INIT=1 state when > going idle TILERELEASE and XRESTOR are the tools in the toolbox, if necessary. > There is no option 'shrug and ignore' unfortunately. I'm not going to say it is impossible that this path will matter. If some terrible things go wrong with the hardware, and the hardware is configured and used in a very specific way, yes, this could matter. In the grand scheme of things, this is a pretty small issue, say, compared to the API discussion. thanks, Len Brown, Intel Open Source Technology Center -Len
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
On Mon, Mar 29, 2021 at 1:43 PM Andy Lutomirski wrote: > > *switching* XCR0 on context switch is slow, but perfectly legal. > > How slow is it? And how slow is switching XFD? XFD is definitely > serializing? XCR0 writes in a VM guest cause a VMEXIT.. XCR writes in a VM guest do not. I will find out what the relative cost is on bare metal, where VMEXIT is not an issue. -- Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
> In particular, the library may use instructions that main() doesn't know > exist. And so I'll ask my question another way. How is it okay to change the value of XCR0 during the run time of a program? I submit that it is not, and that is a deal-killer for a request/release API. eg. main() doesn't know that the math library wants to use AMX, and neither does the threading library. So main() doesn't know to call the API before either library is invoked. The threading library starts up and creates user-space threads based on the initial value from XCR0. Then the math library calls the API, which adds bits to XCRO, and then the user-space context switch in the threading library corrupts data because the new XCR0 size doesn't match the initial size. -Len
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
On Mon, Mar 29, 2021 at 11:43 AM Len Brown wrote: > > On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner wrote: > > > > I found the author of this passage, and he agreed to revise it to say this > > > was targeted primarily at VMMs. > > > > Why would this only a problem for VMMs? > > VMMs may have to emulate different hardware for different guest OS's, > and they would likely "context switch" XCR0 to achieve that. > > As switching XCR0 at run-time would confuse the heck out of user-space, > it was not imagined that a bare-metal OS would do that. to clarify... *switching* XCR0 on context switch is slow, but perfectly legal. *changing* XCR0 during the lifetime of a process, in any of its tasks, on any of its CPUs, will confuse any software that uses xgetbv/XCR0 to calculate the size of XSAVE buffers for userspace threading. -- Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
On Sat, Mar 27, 2021 at 5:58 AM Greg KH wrote: > > On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote: > > Hi Andy, > > > > Say a mainline links with a math library that uses AMX without the > > knowledge of the mainline. sorry for the confusion. mainline = main(). ie. the part of the program written by you, and not the library you linked with. In particular, the library may use instructions that main() doesn't know exist. -- Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner wrote: > > I found the author of this passage, and he agreed to revise it to say this > > was targeted primarily at VMMs. > > Why would this only a problem for VMMs? VMMs may have to emulate different hardware for different guest OS's, and they would likely "context switch" XCR0 to achieve that. As switching XCR0 at run-time would confuse the heck out of user-space, it was not imagined that a bare-metal OS would do that. But yes, if a bare metal OS doesn't support any threading libraries that query XCR0 with xgetbv, and they don't care about the performance impact of switching XCR0, they could choose to switch XCR0 and would want to TILERELEASE to assure C6 access, if it is enabled. > > "negative power and performance implications" refers to the fact that > > the processor will not enter C6 when AMX INIT=0, instead it will demote > > to the next shallower C-state, eg C1E. > > > > (this is because the C6 flow doesn't save the AMX registers) > > > > For customers that have C6 enabled, the inability of a core to enter C6 > > may impact the maximum turbo frequency of other cores. > > That's the same on bare metal, right? Yes, the hardware works exactly the same way. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
On Sat, Mar 27, 2021 at 6:20 PM Thomas Gleixner wrote: > What's the actual downside of issuing TILERELEASE conditionally > depending on prev->AMX INIT=0? Is it slow or what's the real > problem here? TILERELEASE is fast, so there should be no down-side to execute it. Indeed, checking whether you need to execute it or not will probably take longer than executing TILERELEASE. My point (perhaps academic) is that Linux should not have to know about TILERELEASE, or execute it. Re: running in the kernel with AMX INIT=0 AMX INIT=0 will prevent c6 on that core. I don't expect to see this in the syscall path, though if a user wanted to neglect to issue TILERELEASE, there is nothing forcing them to do so. It can certainly happen on the interrupt path, but on the interrupt patch I don't know if we can end up requesting c6 -- perhaps on a forced task migration? Re: frequency credits in the kernel with AMX INIT=0. It works exactly the same way as AMX INIT=1. That is to say, the frequency credits don't key off of AMX INIT, they key off of the actual use of the AMX execution unit, and the credits free up several orders of magnitude faster (both for AVX-512 and AMX) on this hardware as in previous generations. As a result, if we interrupt an AMX program, and run for an extended period of time in the kernel without XRESTOR to clear out his AMX INIT=0 state, that will not have any impact on the frequency we run inside the kernel any more than if he had AMX INIT=1 state. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
On Sat, Mar 20, 2021 at 6:14 PM Thomas Gleixner wrote: > > On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote: > > + > > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */ > > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next) > > +{ > > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled()) > > + return; > > + > > + if (unlikely(prev->state_mask != next->state_mask)) > > + xdisable_setbits(xfirstuse_not_detected(next)); > > +} > > So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD > when it does not match. The spec document says: > > "System software may disable use of Intel AMX by clearing XCR0[18:17], by >clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that >system software initialize AMX state (e.g., by executing TILERELEASE) >before doing so. This is because maintaining AMX state in a >non-initialized state may have negative power and performance >implications." > > I'm not seeing anything related to this. Is this a recommendation > which can be ignored or is that going to be duct taped into the code > base once the first user complains about slowdowns of their non AMX > workloads on that machine? I found the author of this passage, and he agreed to revise it to say this was targeted primarily at VMMs. "negative power and performance implications" refers to the fact that the processor will not enter C6 when AMX INIT=0, instead it will demote to the next shallower C-state, eg C1E. (this is because the C6 flow doesn't save the AMX registers) For customers that have C6 enabled, the inability of a core to enter C6 may impact the maximum turbo frequency of other cores. thanks, -Len Brown Intel Open Source Technology Center
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE > > System software may disable use of Intel AMX by clearing XCR0[18:17], > by clearing CR4.OSXSAVE, or by setting > IA32_XFD[18]. It is recommended that system software initialize AMX > state (e.g., by executing TILERELEASE) > before doing so. This is because maintaining AMX state in a > non-initialized state may have negative power and > performance implications. I agree that the wording here about disabling AMX is ominous. The hardware initializes with AMX disabled. The kernel probes AMX, enables it in XCR0, and keeps it enabled. Initially, XFD is "armed" for all tasks. When a task accesses AMX state, #NM fires, we allocate a context switch buffer, and we "disarm" XFD for that task. As we have that buffer in-hand for the lifetime of the task, we never "arm" XFD for that task again. XFD is context switched, and so the next time it is set, is when we are restoring some other task's state. n.b. I'm describing the Linux flow. The VMM scenario is a little different. > Since you reviewed the patch set, I assume you are familiar with how > Linux manages XSTATE. Linux does *not* eagerly load XSTATE on context > switch. Instead, Linux loads XSTATE when the kernel needs it loaded > or before executing user code. This means that the kernel can (and > does, and it's a performance win) execute kernel thread code and/or go > idle, *including long-term deep idle*, with user XSTATE loaded. Yes, this scenario is clear. There are several cases. 1. Since TMM registers are volatile, a routine using TMM that wants them to persist across a call must save them, and will TILERELEASE before invoking that call. That is the calling convention, and I expect that if it is not followed, debugging (of tools) will occur until it is. The only way for a user program's XSTATE to be present during the kernel's call to idle is if it sleep via a system call when no other task wants to run on that CPU. Since system calls are calls, in this case, AMX INIT=1 during idle. All deep C-state are enabled, the idle CPU is able to contribute it's maximum turbo buget to its peers. 2. A correct program with live TMM registers takes an interrupt, and we enter the kernel AMX INIT=0. Yes, we will enter the syscall at the frequency of the app (like we always do). Yes, turbo frequency may be limited by the activity of this processor and its peers (like it always is) 2a. If we return to the same program, then depending on how long the syscall runs, we may execute the program and the system call code at a frequency lower than we might if AMX INIT=1 at time of interrupt. 2b. If we context switch to a task that has AMX INIT=1, then any AMX-imposed limits on turbo are immediately gone. Note for 2b. 4 generations have passed since SKX had significant delay releasing AVX-512 credits. The delay in the first hardware that supports AXM should be negligible for both AVX-512 and AMX. 3. A buggy or purposely bogus program is fully empowered to violate the programming conventions. Say such a program called a long sleep, and nothing else wanted to run on that CPU, so the kernel went idle with AMX INIT=0. Indeed, this could retard the core from getting into the deepest available C-state, which could impact the turbo budget of neighboring cores. However, if that were some kind of DOS, it would be simpler and more effective to simply hog a CPU by running code. Also, as soon as another thread switches in with INIT=1, there is no concept of AMX frequency caps. (see note for 2b) I do not see a situation where the kernel needs to issue TILERELEASE (though a VMM likely would). What did I miss? thanks, Len Brown, Intel Open Source Technology Center ps. I will respond to your ABI thoughts on your new ABI thread.
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
On Fri, Mar 26, 2021 at 2:17 PM Borislav Petkov wrote: > > On Fri, Mar 26, 2021 at 01:53:47PM -0400, Len Brown wrote: > > At Dave's suggestion, we had a 64 *KB* sanity check on this path. > > Boris forced us to remove it, because we could not tell him > > how we chose the number 64. > > The only 64 I can remember is > > #define XSTATE_BUFFER_MAX_BYTES (64 * 1024) > > What does an arbitrary number have to do with signal handling and > pushing a fat frame on the sigaltstack? You are right. If that is where the check was, it was the wrong place. It should be part of that sanity check code where CPUID is parsed. thanks, Len Brown, Intel Open Source Technology Center
Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
Hi Andy, Say a mainline links with a math library that uses AMX without the knowledge of the mainline. Say the mainline is also linked with a userspace threading library that thinks it has a concept of XSAVE area size. Wouldn't the change in XCR0, resulting in XSAVE size change, risk confusing the threading library? thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
On Fri, Mar 26, 2021 at 11:48 AM Andy Lutomirski wrote: > > I submit, that after the generic XFD support is in place, > > there is exactly 1 bit that needs to be flipped to enable > > user applications to benefit from AMX. > > The TILERELEASE opcode itself is rather longer than one bit, and the > supporting code to invoke it at the right time, to avoid corrupting > user state, and avoid causing performance regressions merely by > existing will be orders of magnitude more than 1 bit. Of course, all > of this is zero bits in the current series because the code is > missing.entirely. Please explain why the kernel must know about the TILERELEASE instruction in order for an AMX application to run properly. > This isn't just about validation. There's also ABI, performance, and > correctness. Thank you for agreeing that this is not about unvalidated features. > ABI: The AVX-512 enablement *already* broke user ABI. Sadly no one > told anyone in the kernel community until about 5 years after the > fact, and it's a bit late to revert AVX-512. But we don't want to > enable AMX until the ABI has a reasonable chance of being settled. > Ditto for future features. As it stands, if you xstate.enable some > 16MB feature, the system may well simply fail to boot as too many user > processes explode. At Dave's suggestion, we had a 64 *KB* sanity check on this path. Boris forced us to remove it, because we could not tell him how we chose the number 64. I would be delighted to see a check for 64 KB restored, and that it be a rejection, rather than warning. At this point, as there is no way go down that path without manually modifying the kernel, it would devolve into a sanity check for a hardware (CPUID) bug. > Performance: > > We *still* don't know the performance implications of leaving the AMX > features in use inappropriately. Does it completely destroy idle? No. > Will it literally operate CPUs out of spec such that Intel's > reliability estimates will be invalidated? No. > (We had that with NVMe APST. Let's not repeat this with XSTATE.) I acknowledge that the possibility of broken hardware always exists. However, I don't see how the experience with broken NVMe actually applies here, other than general paranoia about new features (which is, arguably, healthy). > The performance impacts > and transitions for AVX-512 are, to put it charitably, forthcoming. I acknowledge the parallels with AVX-512, in that AMX adds new instructions, and it has even bigger registers. I also acknowledge that the AVX-512 rollout (and arguably, its brief existence on client CPUs) was problematic. My understanding is that Intel continues to learn (a lot) from its mistakes. I believe that the AVX-512 credits problem has been largely eliminated on newer Xeons. My understanding is that AMX is implemented only in CPUs that actually have the hardware to properly support AMX. If it were not, then that would be a problem for Intel to deal with in hardware, not a problem for Linux to deal with in software. > Correctness: PKRU via the kernel's normal XSAVE path would simply be > incorrect. Do we really trust that this won't get repeated? Also, > frankly, a command line option that may well break lots of userspace > but that we fully expect Intel to recommend setting is not a good > thing. There is no analogy between AMX and PKRU, except the fact that they are both features, and at one time, both were new. I am unaware of anybody at Intel recommending that any cmdline be set that would break userspace. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
On Thu, Mar 25, 2021 at 9:50 PM Thomas Gleixner wrote: > Please provide the architectural document which guarantees that and does > so in a way that it can be evaluated by the kernel. Have not seen that, > so it does not exist at all. > > Future CPUID attributes are as useful as the tweet of today. I will do so the moment I am permitted. I'm fine with dropping patch 22 until it can rely on the assurance of that architectural feature. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
On Thu, Mar 25, 2021 at 9:42 PM Andy Lutomirski wrote: > Regardless of what you call AMX, AMX requires kernel enabling. I submit, that after the generic XFD support is in place, there is exactly 1 bit that needs to be flipped to enable user applications to benefit from AMX. I submit the patch that knows about AMX and double checks the state size is superfluous. I submit that updating /proc/cpuinfo is superfluous. What AMX-specific kernel enabling did I miss? > Specifically, it appears that leaving AMX in use in the XINUSE sense > degrades system performance and/or power. Please share the specifics about what performance or power issue you anticipate. thanks, -Len
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
On Thu, Mar 25, 2021 at 7:10 PM Dave Hansen wrote: > > On 3/25/21 3:59 PM, Len Brown wrote: > > We call AMX a "simple state feature" -- it actually requires NO KERNEL > > ENABLING > > above the generic state save/restore to fully support userspace AMX > > applications. > > > > While not all ISA extensions can be simple state features, we do expect > > future features to share this trait, and so we want to be sure that it is > > simple > > to update the kernel to turn those features on (and when necessary, off). > > From some IRC chats with Thomaas and Andy, I think it's safe to say that > they're not comfortable blindly enabling even our "simple features". I > think we're going to need at least some additional architecture to get > us to a point where everyone will be comfortable. Hi Dave, There is no code in this patch series, including patch 22, that enables an unvalidated feature by default. Yes, I fully accept that patch 22 allows a user to enable something that a distro didn't validate. If there is a new requirement that the kernel cmdline not allow anything that a distro didn't explicitly validate, then about 99.9% of the kernel cmdline options that exist today would need to be removed. Does such a requirement exist, or does it not? thanks, -Len
Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
On Sat, Mar 20, 2021 at 4:57 PM Thomas Gleixner wrote: > We won't enable features which are unknown ever. Keep that presilicon > test gunk where it belongs: In the Intel poison cabinet along with the > rest of the code which nobody ever want's to see. I agree, it would be irresponsible to enable unvalidated features by default, and pre-silicon "test gunk" should be kept out of the upstream kernel. This patch series is intended solely to enable fully validated hardware features, with product quality kernel support. The reason that the actual AMX feature isn't mentioned until the 16th patch in this series is because all of the patches before it are generic state save/restore patches, that are not actually specific to AMX. We call AMX a "simple state feature" -- it actually requires NO KERNEL ENABLING above the generic state save/restore to fully support userspace AMX applications. While not all ISA extensions can be simple state features, we do expect future features to share this trait, and so we want to be sure that it is simple to update the kernel to turn those features on (and when necessary, off). There will be a future CPUID attribute that will help us identify future simple-state features. For AMX, of course, we simply know. So after the generic state management support, the kernel enabling of AMX is not actually required to run applications. Just like when a new instruction is added that re-uses existing state -- the application or library can check CPUID and just use it. It is a formality (perhaps an obsolete one), that we add every feature flag to /proc/cpuid for the "benefit" of userspace. The reason we propose this cmdline switch is 1. Ability of customers to disable a feature right away if an issue is found. Unlike the CPUid cmdline that works on flags, this is the ability to turn off a feature based on its state number. Ie. There could be 20 features that use the same state, and you can turn them all off at once this way. 2. Ability of customers to enable a feature that is disabled by default in their kernel. Yes, this will taint their kernel (thanks Andy), but we have customers that want to run the new feature on day 0 before they have got a distro update to change the default, and this gives them a way to do that. Yeah, the cmdline syntax is not a user-friendly mnemonic, and I don't know that making it so would be an improvement. Like the CPUID cmdline, it is precise, it is future-proof, and it is used only in special situations. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2 wrote: > > IMO, the problem with AVX512 state > > is that we guaranteed it will be zero for XINUSE=0. > > That means we have to write 0's on saves. > why "we have to write 0's on saves" when XINUSE=0. > > Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and > xstate_bv bit is 0; if use XSAVE, it need save the state but > xstate_bv bit is also 0. > > It would be better > > to be able to skip the write -- even if we can't save the space > > we can save the data transfer. (This is what we did for AMX). > With XFD feature that XFD=1, XSAVE command still has to save INIT state > to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands > do the same that both can help save the data transfer. Hi Jing, Good observation! There are 3 cases. 1. Task context switch save into the context switch buffer. Here we use XSAVES, and as you point out, XSAVES includes the compaction optimization feature tracked by XINUSE. So when AMX is enabled, but clean, XSAVES doesn't write zeros. Further, it omits the buffer space for AMX in the destination altogether! However, since XINUSE=1 is possible, we have to *allocate* a buffer large enough to handle the dirty data for when XSAVES can not employ that optimization. 2. Entry into user signal handler saves into the user space sigframe. Here we use XSAVE, and so the hardware will write zeros for XINUSE=0, and for AVX512, we save neither time or space. My understanding that for application compatibility, we can *not* compact the destination buffer that user-space sees. This is because existing code may have adopted fixed size offsets. (which is unfortunate). And so, for AVX512, we both reserve the space, and we write zeros for clean AVX512 state. For AMX, we must still reserve the space, but we are not going to write zeros for clean state. We so this in software by checking XINUSE=0, and clearing the xstate_bf for the XSAVE. As a result, for XINUSE=0, we can skip writing the zeros, even though we can't compress the space. 3. user space always uses fully uncompacted XSAVE buffers. > The reason I'm interested in XINUSE denotation is that it might be helpful > for the XFD MSRs context switch cost during vmexit and vmenter. As the guest OS may be using XFD, the VMM can not use it for itself. Rather, the VMM must context switch it when it switches between guests. (or not expose it to guests at all) cheers, -Len cheers, Len Brown, Intel Open Source Technology Center
Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
> I have an obnoxious question: do we really want to use the XFD mechanism? Obnoxious questions are often the most valuable! :-) > Right now, glibc, and hence most user space code, blindly uses > whatever random CPU features are present for no particularly good > reason, which means that all these features get stuck in the XINUSE=1 > state, even if there is no code whatsoever in the process that > benefits. AVX512 is bad enough as we're seeing right now. AMX will > be much worse if this happens. > > We *could* instead use XCR0 and require an actual syscall to enable > it. We could even then play games like requiring whomever enables the > feature to allocate memory for the state save area for signals, and > signal delivery could save the state and disable the feature, this > preventing the signal frame from blowing up to 8 or 12 or who knows > how many kB. This approach would have some challenges. Enumeration today is two parts. 1. CPUID tells you if the feature exists in the HW 2. xgetbv/XCR0 tells you if the OS supports that feature Since #2 would be missing, you are right, there would need to be a new API enabling the user to request the OS to enable support for that task. If that new API is not invoked before the user touches the feature, they die with a #UD. And so there would need to be some assurance that the API is successfully called before any library might use the feature. Is there a practical way to guarantee that, given that the feature may be used (or not) only by a dynamically linked library? If a library spawns threads and queries the size of XSAVE before the API is called, it may be confused when that size changes after the API is called. So a simple question, "who calls the API, and when?" isn't so simple. Finally, note that XCR0 updates cause a VMEXIT, while XFD updates do not. So context switching XCR0 is possible, but is problematic. The other combination is XFD + API rather than XCR0 + API. With XFD, the context switching is faster, and the faulting (#NM and the new MSR with #NM cause) is viable. We have the bit set in XCR0, so no state size advantage. Still have issues with API logistics. So we didn't see that the API adds any value, only pain, over transparent 1st use enabling with XFD and no API. cheers, Len Brown, Intel Open Source Technology Center ps. I agree that un-necessary XINUSE=1 is possible. Notwithstanding the issues initially deploying AVX512, I am skeptical that it is common today. IMO, the problem with AVX512 state is that we guaranteed it will be zero for XINUSE=0. That means we have to write 0's on saves. It would be better to be able to skip the write -- even if we can't save the space we can save the data transfer. (This is what we did for AMX). pps. your idea of requiring the user to allocate their own signal stack is interesting. It isn't really about allocating the stack, though -- the stack of the task that uses the feature is generally fine already. The opportunity is to allow tasks that do *not* use the new feature to get away with minimal data transfer and stack size. As we don't have the 0's guarantee for AMX, we bought the important part of that back.
Re: [PATCH v7 0/6] x86: Improve Minimum Alternate Stack Size
On Wed, Mar 17, 2021 at 6:45 AM Ingo Molnar wrote: > > > * Ingo Molnar wrote: > > > > > * Chang S. Bae wrote: > > > > > During signal entry, the kernel pushes data onto the normal userspace > > > stack. On x86, the data pushed onto the user stack includes XSAVE state, > > > which has grown over time as new features and larger registers have been > > > added to the architecture. > > > > > > MINSIGSTKSZ is a constant provided in the kernel signal.h headers and > > > typically distributed in lib-dev(el) packages, e.g. [1]. Its value is > > > compiled into programs and is part of the user/kernel ABI. The MINSIGSTKSZ > > > constant indicates to userspace how much data the kernel expects to push > > > on > > > the user stack, [2][3]. > > > > > > However, this constant is much too small and does not reflect recent > > > additions to the architecture. For instance, when AVX-512 states are in > > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 2KB. > > > > > > The bug report [4] explains this as an ABI issue. The small MINSIGSTKSZ > > > can > > > cause user stack overflow when delivering a signal. > > > > > uapi: Define the aux vector AT_MINSIGSTKSZ > > > x86/signal: Introduce helpers to get the maximum signal frame size > > > x86/elf: Support a new ELF aux vector AT_MINSIGSTKSZ > > > selftest/sigaltstack: Use the AT_MINSIGSTKSZ aux vector if available > > > x86/signal: Detect and prevent an alternate signal stack overflow > > > selftest/x86/signal: Include test cases for validating sigaltstack > > > > So this looks really complicated, is this justified? > > > > Why not just internally round up sigaltstack size if it's too small? > > This would be more robust, as it would fix applications that use > > MINSIGSTKSZ but don't use the new AT_MINSIGSTKSZ facility. > > > > I.e. does AT_MINSIGSTKSZ have any other uses than avoiding the > > segfault if MINSIGSTKSZ is used to create a small signal stack? > > I.e. if the kernel sees a too small ->ss_size in sigaltstack() it > would ignore ->ss_sp and mmap() a new sigaltstack instead and use that > for the signal handler stack. > > This would automatically make MINSIGSTKSZ - and other too small sizes > work today, and in the future. > > But the question is, is there user-space usage of sigaltstacks that > relies on controlling or reading the contents of the stack? > > longjmp using programs perhaps? For the legacy binary that requests a too-small sigaltstack, there are several choices: We could detect the too-small stack at sigaltstack(2) invocation and return an error. This results in two deal-killing problems: First, some applications don't check the return value, so the check would be fruitless. Second, those that check and error-out may be programs that never actually take the signal, and so we'd be causing a dusty binary to exit, when it didn't exit on another system, or another kernel. Or we could detect the too small stack at signal registration time. This has the same two deal-killers as above. Then there is the approach in this patch-set, which detects an imminent stack overflow at run time. It has neither of the two problems above, and the benefit that we now prevent data corruption that could have been happening on some systems already today. The down side is that the dusty binary that does request the too-small stack can now die at run time. So your idea of recognizing the problem and conjuring up a sufficient stack is compelling, since it would likely "just work", no matter how dumb the program. But where would the the sufficient stack come from -- is this a new kernel buffer, or is there a way to abscond some user memory? I would expect a signal handler to look at the data on its stack and nobody else will look at that stack. But this is already an unreasonable program for allocating a special signal stack in the first place :-/ So yes, one could imagine the signal handler could longjump instead of gracefully completing, and if this specially allocated signal stack isn't where the user planned, that could be trouble. Another idea we discussed was to detect the potential overflow at run-time, and instead of killing the process, just push the signal onto the regular user stack. this might actually work, but it is sort of devious; and it would not work in the case where the user overflowed their regular stack already, which may be the most (only?) compelling reason that they allocated and declared a special sigaltstack in the first place... -- Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask
Doug, The offset works for control. However, it is erroneous to use it for reporting of the actual temperature, like I did in turbostat. Thus, I'm going to revert the patch that added it's use in turbostat for the Temperature column. thanks, -Len On Fri, Mar 12, 2021 at 1:26 AM Doug Smythies wrote: > > Hi Len, > > > thank you for your reply. > > On Thu, Mar 11, 2021 at 3:19 PM Len Brown wrote: > > > > Thanks for the close read, Doug. > > > > This field size actually varies from system to system, > > but the reality is that the offset is never that big, and so the > > smaller mask is sufficient. > > Disagree. > > I want to use an offset of 26. > > > Finally, this may all be moot, because there is discussion that using > > the offset this way is simply erroneous. > > Disagree. > It works great. > As far as I know/recall I was the only person that responded to Rui's thread > "thermal/intel: introduce tcc cooling driver" [1] > And, I spent quite a bit of time doing so. > However, I agree the response seems different between the two systems > under test, Rui's and mine. > > [1] https://marc.info/?l=linux-pm&m=161070345329806&w=2 > > > stay tuned. > > O.K. > > ... Doug > > > > -Len > > > > > > On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies > > wrote: > > > > > > The TCC offset mask is incorrect, resulting in > > > incorrect target temperature calculations, if > > > the offset is big enough to exceed the mask size. > > > > > > Signed-off-by: Doug Smythies > > > --- > > > tools/power/x86/turbostat/turbostat.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/tools/power/x86/turbostat/turbostat.c > > > b/tools/power/x86/turbostat/turbostat.c > > > index 389ea5209a83..d7acdd4d16c4 100644 > > > --- a/tools/power/x86/turbostat/turbostat.c > > > +++ b/tools/power/x86/turbostat/turbostat.c > > > @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp() > > > > > > target_c = (msr >> 16) & 0xFF; > > > > > > - offset_c = (msr >> 24) & 0xF; > > > + offset_c = (msr >> 24) & 0x3F; > > > > > > tcc = target_c - offset_c; > > > > > > -- > > > 2.25.1 > > > > > > > > > -- > > Len Brown, Intel Open Source Technology Center -- Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask
Thanks for the close read, Doug. This field size actually varies from system to system, but the reality is that the offset is never that big, and so the smaller mask is sufficient. Finally, this may all be moot, because there is discussion that using the offset this way is simply erroneous. stay tuned. -Len On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies wrote: > > The TCC offset mask is incorrect, resulting in > incorrect target temperature calculations, if > the offset is big enough to exceed the mask size. > > Signed-off-by: Doug Smythies > --- > tools/power/x86/turbostat/turbostat.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/power/x86/turbostat/turbostat.c > b/tools/power/x86/turbostat/turbostat.c > index 389ea5209a83..d7acdd4d16c4 100644 > --- a/tools/power/x86/turbostat/turbostat.c > +++ b/tools/power/x86/turbostat/turbostat.c > @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp() > > target_c = (msr >> 16) & 0xFF; > > - offset_c = (msr >> 24) & 0xF; > + offset_c = (msr >> 24) & 0x3F; > > tcc = target_c - offset_c; > > -- > 2.25.1 > -- Len Brown, Intel Open Source Technology Center
Re: [PATCH V2] rtc: mc146818: Dont test for bit 0-5 in Register D
Thanks for the update, Thomas. V1 prevented rc6 automated suspend/resume testing on all 13 of my local machines. V2 applied, and they are back in business. tested-by: Len Brown On Mon, Feb 1, 2021 at 2:25 PM Thomas Gleixner wrote: > > The recent change to validate the RTC turned out to be overly tight. > > While it cures the problem on the reporters machine it breaks machines > with Intel chipsets which use bit 0-5 of the D register. So check only > for bit 6 being 0 which is the case on these Intel machines as well. > > Fixes: 211e5db19d15 ("rtc: mc146818: Detect and handle broken RTCs") > Reported-by: Serge Belyshev > Reported-by: Dirk Gouders > Signed-off-by: Thomas Gleixner > --- > V2: Provide the actual delta patch. Should have stayed away from > computers today > --- > drivers/rtc/rtc-cmos.c |4 ++-- > drivers/rtc/rtc-mc146818-lib.c |4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) > > --- a/drivers/rtc/rtc-cmos.c > +++ b/drivers/rtc/rtc-cmos.c > @@ -805,8 +805,8 @@ cmos_do_probe(struct device *dev, struct > > spin_lock_irq(&rtc_lock); > > - /* Ensure that the RTC is accessible. Bit 0-6 must be 0! */ > - if ((CMOS_READ(RTC_VALID) & 0x7f) != 0) { > + /* Ensure that the RTC is accessible. Bit 6 must be 0! */ > + if ((CMOS_READ(RTC_VALID) & 0x40) != 0) { > spin_unlock_irq(&rtc_lock); > dev_warn(dev, "not accessible\n"); > retval = -ENXIO; > --- a/drivers/rtc/rtc-mc146818-lib.c > +++ b/drivers/rtc/rtc-mc146818-lib.c > @@ -21,8 +21,8 @@ unsigned int mc146818_get_time(struct rt > > again: > spin_lock_irqsave(&rtc_lock, flags); > - /* Ensure that the RTC is accessible. Bit 0-6 must be 0! */ > - if (WARN_ON_ONCE((CMOS_READ(RTC_VALID) & 0x7f) != 0)) { > + /* Ensure that the RTC is accessible. Bit 6 must be 0! */ > + if (WARN_ON_ONCE((CMOS_READ(RTC_VALID) & 0x40) != 0)) { > spin_unlock_irqrestore(&rtc_lock, flags); > memset(time, 0xff, sizeof(*time)); > return 0; -- Len Brown, Intel Open Source Technology Center
Re: [PATCH v2 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
> > On Fri, Nov 20, 2020 at 12:03 AM Andy Lutomirski wrote: > We may want to taint the kernel if one of these flags is used I agree with Andy. I can imagine a distro would want to see the taint bit set if a user overrides the default that they ship and support. Chang, Please taint when the enable flag is used -- Thanks! > because, > frankly, Intel’s track record is poor. Suppose we get a new feature > with PKRU-like semantics -- switching it blindly using > XSAVE(C,S,OPT,whatever) would simply incorrect. A valid concern. To mitigate, there is a new CPUID bit on the way to tell us explicitly which features are simple (require just XSAVE), versus which features are not simple and require more complex support.n t When that bit is locked down, we will add that sanity check here. > And XFD itself has > problems — supposedly it’s difficult or impossible to virtualize. It > wouldn’t utterly shock me if Intel were to drop IA32_XFD_ERR and > replace it with a new mechanism that’s less janky. > > So this whole thing makes me quite nervous. > > (I'm not a virtualization expert, but AIUI IA32_XFD_ERR has some > issues. If it's too late to fix those issues, Intel could probably > get away with completely dropping IA32_XFD_ERR from the spec -- OSes > can handle AMX just fine without it. Then the next XFD-able feature > could introduce a new improved way of reporting which feature > triggered #NM.) I don't follow the XFD concern. There are a couple of cases. If the hypervisor isn't aware of XFD, it neither uses it, or shows it to the guests. If the hypervisor is aware of XFD and supports TMUL, it allocates its own context switch buffers at boot time, and does not use the XFD optimization to save space inside the hypervisor. It can expose XFD to the guests, (and context switch it when guests switch on/off physical CPUs) The guest sees two cases, either XFD exists or it doesn't, just like on real hardware. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v2 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
On Fri, Nov 20, 2020 at 12:03 AM Andy Lutomirski wrote: > > On Thu, Nov 19, 2020 at 3:37 PM Chang S. Bae wrote: > > "xstate.enable=0x6" will enable AMX on a system that does NOT have AMX > > compiled into XFEATURE_MASK_USER_ENABLED (assuming the kernel is new enough > > to support this feature). > > > > What's the purpose of xstate.enable? I can't really imagine it's > useful for AMX. I suppose it could be useful for hypothetical > post-AMX features, but that sounds extremely dangerous. Intel has > changed its strategy so many times on XSTATE extensibility that I find > it quite hard to believe that supporting unknown states is wise. Not hypothetical -- there are subsequent hardware features coming that will use the same exact XSTATE support that this series puts in place for AMX. We know that when those features ship in new hardware, there will be a set of customers who want to exercise those features immediately, but their kernel binary has not yet been re-compiled to see those features by-default. The purpose of "xstate.enable" is to empower those users to be able to explicitly enable support using their existing binary. You are right -- the feature isn't needed to enable AMX, unless somebody went to the trouble of building a kernel with the AMX source update, but chose to disable AMX-specific recognition, by-default. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH v2 14/22] x86/fpu/xstate: Inherit dynamic user state when used in the parent
On Fri, Nov 20, 2020 at 12:08 AM Andy Lutomirski wrote: > > On Thu, Nov 19, 2020 at 3:37 PM Chang S. Bae wrote: > > > > When a new task is created, the kernel copies all the states from the > > parent. If the parent already has any dynamic user state in use, the new > > task has to expand the XSAVE buffer to save them. Also, disable the > > associated first-use fault. > > This seems like a mistake. If init uses AMX for some misguided > reason, ever task on the whole system will end up with AMX state > allocated. Andy, you are right -- the child can (and should) start with the un-expanded context switch buffer, and as a result XFD should be armed to fire on the child's first access to TMM hardware. TMM registers are scratchpad, they will never be used to pass globals, say, from parent to child thread. Further, they are volatile and caller saved. The callee can assume they are empty -- so even by virtue of being in a fork system call, that state is already gone. thanks, Len Brown, Intel Open Source Technology Center
[GIT PULL] turbostat: update to version 20.09.30
Hi Linus, Please pull these turbostat related patches. thanks! Len Brown, Intel Open Source Technology Center The following changes since commit e00b62f0b06d0ae2b844049f216807617aff0cdb: x86/cpu: Add Lakefield, Alder Lake and Rocket Lake models to the to Intel CPU family (2020-07-25 12:16:59 +0200) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git turbostat for you to fetch changes up to 3e9fa9983b9297407c2448114d6d27782d5e2ef2: tools/power turbostat: update version number (2020-11-10 11:41:36 -0500) Alexander A. Klimov (1): tools/power turbostat: Replace HTTP links with HTTPS ones: TURBOSTAT UTILITY Alexander Monakov (1): tools/power turbostat: Build with _FILE_OFFSET_BITS=64 Antti Laakso (1): tools/power turbostat: Remove empty columns for Jacobsville Chen Yu (3): tools/power turbostat: Make the energy variable to be 64 bit tools/power turbostat: Introduce functions to accumulate RAPL consumption tools/power turbostat: Enable accumulate RAPL display David Arcari (1): tools/power turbostat: Fix output formatting for ACPI CST enumeration Doug Smythies (1): tools/power turbostat: Always print idle in the system configuration header Kim Phillips (1): tools/power turbostat: Support AMD Family 19h Len Brown (7): tools/power turbostat: Print /dev/cpu_dma_latency tools/power turbostat: Support additional CPU model numbers tools/power turbostat: Skip pc8, pc9, pc10 columns, if they are disabled tools/power turbostat: adjust for temperature offset tools/power turbostat: harden against cpu hotplug powercap: restrict energy meter to root access tools/power turbostat: update version number Ondřej Lysoněk (1): tools/power x86_energy_perf_policy: Input/output error in a VM Prarit Bhargava (1): tools/power turbostat: Use sched_getcpu() instead of hardcoded cpu 0 Rafael Antognolli (1): tools/power turbostat: Add a new GFXAMHz column that exposes gt_act_freq_mhz. drivers/powercap/powercap_sys.c| 4 +- tools/power/x86/turbostat/Makefile | 3 +- tools/power/x86/turbostat/turbostat.8 | 2 +- tools/power/x86/turbostat/turbostat.c | 573 - .../x86_energy_perf_policy.c | 67 ++- 5 files changed, 509 insertions(+), 140 deletions(-)
Re: [PATCH] tools/power turbostat: call pread64 in kernel directly
> -D_FILE_OFFSET_BITS=64 Applied. thanks! -Len On Mon, Aug 24, 2020 at 12:09 AM Liwei Song wrote: > > > > On 8/24/20 04:54, Alexander Monakov wrote: > > Hi, > > > > I am not the original submitter, but I have answers and a proper patch :) > > > > On Fri, 21 Aug 2020, Len Brown wrote: > > > >> Re: offset size > >> > >> The offsets on this file are the MSR offsets. > >> What MSR are you trying to access at offset 0xc0010299? > > > > This MSR is particular is part of AMD RAPL (energy measurements) interface. > > > >> Re: pread vs pread64 > >> > >> If I take on faith that you have some kind of 32-bit execution > >> environment that makes pread into pread32 instead of pread64, and that > >> truncates an off_t to 32-bits from 64-bits, and it actually makes > >> sense to request a read at this large offset... > > > > The problem here stems from the backward compatibility in Glibc: off_t is > > 32-bit on 32-bit x86, unless compiled with -D_FILE_OFFSET_BITS=64. This > > macro should be used for all new code. Distros should enable it for all > > builds, but when one builds turbostat 'by hand', they hit the issue. > > > >> would we really have to invoke syscall() directly -- couldn't we > >> invoke pread64() directly? (eg. below) > > > > No, the proper fix is to pass -D_FILE_OFFSET_BITS=64 to the compiler. > > > > Here's the patch: > > This path works with my case. > > Thanks, > Liwei. > > > > > > ---8<--- > > > > From: Alexander Monakov > > Date: Sun, 23 Aug 2020 23:27:02 +0300 > > Subject: [PATCH] turbostat: build with _FILE_OFFSET_BITS=64 > > > > For compatibility reasons, Glibc off_t is a 32-bit type on 32-bit x86 > > unless _FILE_OFFSET_BITS=64 is defined. Add this define, as otherwise > > reading MSRs with index 0x8000 and above attempts a pread with a > > negative offset, which fails. > > > > Signed-off-by: Alexander Monakov > > --- > > tools/power/x86/turbostat/Makefile | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/tools/power/x86/turbostat/Makefile > > b/tools/power/x86/turbostat/Makefile > > index 2b6551269e43..40ae44402eec 100644 > > --- a/tools/power/x86/turbostat/Makefile > > +++ b/tools/power/x86/turbostat/Makefile > > @@ -12,6 +12,7 @@ turbostat : turbostat.c > > override CFLAGS += -O2 -Wall -I../../../include > > override CFLAGS += > > -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"' > > override CFLAGS += > > -DINTEL_FAMILY_HEADER='"../../../../arch/x86/include/asm/intel-family.h"' > > +override CFLAGS += -D_FILE_OFFSET_BITS=64 > > override CFLAGS += -D_FORTIFY_SOURCE=2 > > > > %: %.c > > -- Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power turbostat: call pread64 in kernel directly
Re: offset size The offsets on this file are the MSR offsets. What MSR are you trying to access at offset 0xc0010299? Re: pread vs pread64 If I take on faith that you have some kind of 32-bit execution environment that makes pread into pread32 instead of pread64, and that truncates an off_t to 32-bits from 64-bits, and it actually makes sense to request a read at this large offset... would we really have to invoke syscall() directly -- couldn't we invoke pread64() directly? (eg. below) thanks, -Len @@ -490,11 +491,12 @@ int get_msr_fd(int cpu) return fd; } -int get_msr(int cpu, off_t offset, unsigned long long *msr) +int get_msr(int cpu, unsigned long long offset, unsigned long long *msr) { ssize_t retval; - retval = pread(get_msr_fd(cpu), msr, sizeof(*msr), offset); + retval = pread64(get_msr_fd(cpu), msr, sizeof(*msr), offset); if (retval != sizeof *msr) err(-1, "cpu%d: msr offset 0x%llx read failed", cpu, (unsigned long long)offset); On Thu, Aug 13, 2020 at 10:17 PM Liwei Song wrote: > > with multilib lib32 support, the rootfs will be 32-bit, > the kernel is still 64-bit, in this case run turbostat > will failed with "out of range" error. > > Thanks, > Liwei. > > On 8/14/20 05:43, Len Brown wrote: > > Huh? > > > > On Fri, Jul 17, 2020 at 2:09 AM Liwei Song wrote: > >> > >> with 32-bit rootfs, the offset may out of range when set it > >> to 0xc0010299, define it as "unsigned long long" type and > >> call pread64 directly in kernel. > >> > >> Signed-off-by: Liwei Song > >> --- > >> tools/power/x86/turbostat/turbostat.c | 5 +++-- > >> 1 file changed, 3 insertions(+), 2 deletions(-) > >> > >> diff --git a/tools/power/x86/turbostat/turbostat.c > >> b/tools/power/x86/turbostat/turbostat.c > >> index 33b370865d16..4c5cdfcb5721 100644 > >> --- a/tools/power/x86/turbostat/turbostat.c > >> +++ b/tools/power/x86/turbostat/turbostat.c > >> @@ -33,6 +33,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> > >> char *proc_stat = "/proc/stat"; > >> FILE *outf; > >> @@ -381,11 +382,11 @@ int get_msr_fd(int cpu) > >> return fd; > >> } > >> > >> -int get_msr(int cpu, off_t offset, unsigned long long *msr) > >> +int get_msr(int cpu, unsigned long long offset, unsigned long long *msr) > >> { > >> ssize_t retval; > >> > >> - retval = pread(get_msr_fd(cpu), msr, sizeof(*msr), offset); > >> + retval = syscall(SYS_pread64, get_msr_fd(cpu), msr, sizeof(*msr), > >> offset); > >> > >> if (retval != sizeof *msr) > >> err(-1, "cpu%d: msr offset 0x%llx read failed", cpu, > >> (unsigned long long)offset); > >> -- > >> 2.17.1 > >> > > > > -- Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power turbostat: fix output formatting for ACPI CST enumeration
Hi Dave, I think this is fine. Indeed, I actually went ahead and applied it a week or so ago. the only alternative that I can think of is actually shortening the ACPI C-state names in the intel_idle driver -- which is still an option. It would not be the first time we have tweaked the names used in that driver to make tools more happy... My apology for neglecting to send you an ACK. I had intended to send my queued series to the list, which would suffice for all the ACKs, but that send and the subsequent push got delayed by this and that. So I'll try to ack as I go, so it is clear at any time where a patch stands. thanks! Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power turbostat: Support Sapphire Rapids
Already have this one. thanks! Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power turbostat: Support AMD Family 19h
Applied. thanks! Len Brown, Intel Open Source Technology Center
Re: [PATCH 2/2][RFC] tools/power turbostat: Introduce reliable RAPL display
ne(int argc, char **argv) > {"hide",required_argument, 0, 'H'},// > meh, -h taken by --help > {"Joules", no_argument,0, 'J'}, > {"list",no_argument,0, 'l'}, > + {"Longtime",no_argument,0, 'L'}, > {"out", required_argument, 0, 'o'}, > {"quiet", no_argument,0, 'q'}, > {"show",required_argument, 0, 's'}, > @@ -5746,7 +5995,7 @@ void cmdline(int argc, char **argv) > > progname = argv[0]; > > - while ((opt = getopt_long_only(argc, argv, "+C:c:Dde:hi:Jn:o:qST:v", > + while ((opt = getopt_long_only(argc, argv, "+C:c:Dde:hi:JLn:o:qST:v", > long_options, &option_index)) != -1) { > switch (opt) { > case 'a': > @@ -5800,6 +6049,9 @@ void cmdline(int argc, char **argv) > list_header_only++; > quiet++; > break; > + case 'L': > + longtime = 1; > + break; > case 'o': > outf = fopen_or_die(optarg, "w"); > break; > @@ -5864,6 +6116,8 @@ int main(int argc, char **argv) > return 0; > } > > + if (longtime) > + msr_longtime_record(); > /* > * if any params left, it must be a command to fork > */ > -- > 2.17.1 > -- Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power turbostat: call pread64 in kernel directly
Huh? On Fri, Jul 17, 2020 at 2:09 AM Liwei Song wrote: > > with 32-bit rootfs, the offset may out of range when set it > to 0xc0010299, define it as "unsigned long long" type and > call pread64 directly in kernel. > > Signed-off-by: Liwei Song > --- > tools/power/x86/turbostat/turbostat.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/tools/power/x86/turbostat/turbostat.c > b/tools/power/x86/turbostat/turbostat.c > index 33b370865d16..4c5cdfcb5721 100644 > --- a/tools/power/x86/turbostat/turbostat.c > +++ b/tools/power/x86/turbostat/turbostat.c > @@ -33,6 +33,7 @@ > #include > #include > #include > +#include > > char *proc_stat = "/proc/stat"; > FILE *outf; > @@ -381,11 +382,11 @@ int get_msr_fd(int cpu) > return fd; > } > > -int get_msr(int cpu, off_t offset, unsigned long long *msr) > +int get_msr(int cpu, unsigned long long offset, unsigned long long *msr) > { > ssize_t retval; > > - retval = pread(get_msr_fd(cpu), msr, sizeof(*msr), offset); > + retval = syscall(SYS_pread64, get_msr_fd(cpu), msr, sizeof(*msr), > offset); > > if (retval != sizeof *msr) > err(-1, "cpu%d: msr offset 0x%llx read failed", cpu, > (unsigned long long)offset); > -- > 2.17.1 > -- Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power/x86/turbostat: Always print idle in the system configuration header
Applied. thanks! -Len On Thu, Mar 26, 2020 at 4:36 PM Doug Smythies wrote: > > If the --quiet option is not used, turbostat prints a useful system > configuration header during startup. Inclusion of idle system configuration > information is a function of inclusion in the columns choosen to be displayed. > > Always list the idle system configuration. > > Signed-off-by: Doug Smythies > --- > tools/power/x86/turbostat/turbostat.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/tools/power/x86/turbostat/turbostat.c > b/tools/power/x86/turbostat/turbostat.c > index 33b370865d16..834b86676d00 100644 > --- a/tools/power/x86/turbostat/turbostat.c > +++ b/tools/power/x86/turbostat/turbostat.c > @@ -3530,9 +3530,6 @@ dump_sysfs_cstate_config(void) > int state; > char *sp; > > - if (!DO_BIC(BIC_sysfs)) > - return; > - > if (access("/sys/devices/system/cpu/cpuidle", R_OK)) { > fprintf(outf, "cpuidle not loaded\n"); > return; > -- > 2.25.1 > -- Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power turbostat: Make interval calculation per thread to reduce jitter
Yeah, I like this patch, and I have applied it. thanks! -Len On Fri, Jun 7, 2019 at 12:28 PM Ghannam, Yazen wrote: > > > -Original Message- > > From: Ghannam, Yazen > > Sent: Tuesday, April 23, 2019 12:53 PM > > To: Ghannam, Yazen ; linux...@vger.kernel.org; > > len.br...@intel.com > > Cc: linux-kernel@vger.kernel.org; Len Brown > > Subject: RE: [PATCH] tools/power turbostat: Make interval calculation per > > thread to reduce jitter > > > > > -Original Message- > > > From: linux-kernel-ow...@vger.kernel.org > > > On Behalf Of Ghannam, Yazen > > > Sent: Monday, March 25, 2019 12:33 PM > > > To: linux...@vger.kernel.org > > > Cc: Ghannam, Yazen ; linux-kernel@vger.kernel.org; > > > l...@kernel.org > > > Subject: [PATCH] tools/power turbostat: Make interval calculation per > > > thread to reduce jitter > > > > > > From: Yazen Ghannam > > > > > > Turbostat currently normalizes TSC and other values by dividing by an > > > interval. This interval is the delta between the start of one global > > > (all counters on all CPUs) sampling and the start of another. However, > > > this introduces a lot of jitter into the data. > > > > > > In order to reduce jitter, the interval calculation should be based on > > > timestamps taken per thread and close to the start of the thread's > > > sampling. > > > > > > Define a per thread time value to hold the delta between samples taken > > > on the thread. > > > > > > Use the timestamp taken at the beginning of sampling to calculate the > > > delta. > > > > > > Move the thread's beginning timestamp to after the CPU migration to > > > avoid jitter due to the migration. > > > > > > Use the global time delta for the average time delta. > > > > > > Signed-off-by: Yazen Ghannam > > > --- > > > > Hi Len, > > > > Any comments on this patch? > > > > Hi Len, > > Just wanted to check in. Do you have any comments on this patch? > > Thanks, > Yazen -- Len Brown, Intel Open Source Technology Center
[tip:x86/topology] hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages
Commit-ID: 835896a59b9577d0bc2131e027c37bdde5b979af Gitweb: https://git.kernel.org/tip/835896a59b9577d0bc2131e027c37bdde5b979af Author: Len Brown AuthorDate: Mon, 13 May 2019 13:59:01 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:36 +0200 hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages Syntax update only -- no logical or functional change. In response to the new multi-die/package changes, update variable names to use the more generic thermal "zone" terminology, instead of "package", as the zones can refer to either packages or die. Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Cc: Zhang Rui Link: https://lkml.kernel.org/r/facecfd3525d55c2051f63a7ec709aeb03cc1dc1.1557769318.git.len.br...@intel.com --- drivers/hwmon/coretemp.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c index c64ce32d3214..4ebed90d81aa 100644 --- a/drivers/hwmon/coretemp.c +++ b/drivers/hwmon/coretemp.c @@ -109,10 +109,10 @@ struct platform_data { struct device_attribute name_attr; }; -/* Keep track of how many package pointers we allocated in init() */ -static int max_packages __read_mostly; -/* Array of package pointers. Serialized by cpu hotplug lock */ -static struct platform_device **pkg_devices; +/* Keep track of how many zone pointers we allocated in init() */ +static int max_zones __read_mostly; +/* Array of zone pointers. Serialized by cpu hotplug lock */ +static struct platform_device **zone_devices; static ssize_t show_label(struct device *dev, struct device_attribute *devattr, char *buf) @@ -435,10 +435,10 @@ static int chk_ucode_version(unsigned int cpu) static struct platform_device *coretemp_get_pdev(unsigned int cpu) { - int pkgid = topology_logical_die_id(cpu); + int id = topology_logical_die_id(cpu); - if (pkgid >= 0 && pkgid < max_packages) - return pkg_devices[pkgid]; + if (id >= 0 && id < max_zones) + return zone_devices[id]; return NULL; } @@ -544,7 +544,7 @@ static int coretemp_probe(struct platform_device *pdev) struct device *dev = &pdev->dev; struct platform_data *pdata; - /* Initialize the per-package data structures */ + /* Initialize the per-zone data structures */ pdata = devm_kzalloc(dev, sizeof(struct platform_data), GFP_KERNEL); if (!pdata) return -ENOMEM; @@ -579,13 +579,13 @@ static struct platform_driver coretemp_driver = { static struct platform_device *coretemp_device_add(unsigned int cpu) { - int err, pkgid = topology_logical_die_id(cpu); + int err, zoneid = topology_logical_die_id(cpu); struct platform_device *pdev; - if (pkgid < 0) + if (zoneid < 0) return ERR_PTR(-ENOMEM); - pdev = platform_device_alloc(DRVNAME, pkgid); + pdev = platform_device_alloc(DRVNAME, zoneid); if (!pdev) return ERR_PTR(-ENOMEM); @@ -595,7 +595,7 @@ static struct platform_device *coretemp_device_add(unsigned int cpu) return ERR_PTR(err); } - pkg_devices[pkgid] = pdev; + zone_devices[zoneid] = pdev; return pdev; } @@ -703,7 +703,7 @@ static int coretemp_cpu_offline(unsigned int cpu) * the rest. */ if (cpumask_empty(&pd->cpumask)) { - pkg_devices[topology_logical_die_id(cpu)] = NULL; + zone_devices[topology_logical_die_id(cpu)] = NULL; platform_device_unregister(pdev); return 0; } @@ -741,10 +741,10 @@ static int __init coretemp_init(void) if (!x86_match_cpu(coretemp_ids)) return -ENODEV; - max_packages = topology_max_packages() * topology_max_die_per_package(); - pkg_devices = kcalloc(max_packages, sizeof(struct platform_device *), + max_zones = topology_max_packages() * topology_max_die_per_package(); + zone_devices = kcalloc(max_zones, sizeof(struct platform_device *), GFP_KERNEL); - if (!pkg_devices) + if (!zone_devices) return -ENOMEM; err = platform_driver_register(&coretemp_driver); @@ -760,7 +760,7 @@ static int __init coretemp_init(void) outdrv: platform_driver_unregister(&coretemp_driver); - kfree(pkg_devices); + kfree(zone_devices); return err; } module_init(coretemp_init) @@ -769,7 +769,7 @@ static void __exit coretemp_exit(void) { cpuhp_remove_state(coretemp_hp_online); platform_driver_unregister(&coretemp_driver); - kfree(pkg_devices); + kfree(zone_devices); } module_exit(coretemp_exit)
[tip:x86/topology] thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages
Commit-ID: b2ce1c883df91a231f8138935167273c1767ad66 Gitweb: https://git.kernel.org/tip/b2ce1c883df91a231f8138935167273c1767ad66 Author: Len Brown AuthorDate: Mon, 13 May 2019 13:59:00 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:36 +0200 thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages Syntax update only -- no logical or functional change. In response to the new multi-die/package changes, update variable names to use the more generic thermal "zone" terminology, instead of "package", as the zones can refer to either packages or die. Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Cc: Zhang Rui Link: https://lkml.kernel.org/r/b65494a76be13481dc3a809c75debb2574c34eda.1557769318.git.len.br...@intel.com --- drivers/thermal/intel/x86_pkg_temp_thermal.c | 142 ++- 1 file changed, 72 insertions(+), 70 deletions(-) diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c b/drivers/thermal/intel/x86_pkg_temp_thermal.c index 405b3858900a..87e929ffb0cb 100644 --- a/drivers/thermal/intel/x86_pkg_temp_thermal.c +++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c @@ -55,7 +55,7 @@ MODULE_PARM_DESC(notify_delay_ms, */ #define MAX_NUMBER_OF_TRIPS2 -struct pkg_device { +struct zone_device { int cpu; boolwork_scheduled; u32 tj_max; @@ -70,10 +70,10 @@ static struct thermal_zone_params pkg_temp_tz_params = { .no_hwmon = true, }; -/* Keep track of how many package pointers we allocated in init() */ -static int max_packages __read_mostly; -/* Array of package pointers */ -static struct pkg_device **packages; +/* Keep track of how many zone pointers we allocated in init() */ +static int max_id __read_mostly; +/* Array of zone pointers */ +static struct zone_device **zones; /* Serializes interrupt notification, work and hotplug */ static DEFINE_SPINLOCK(pkg_temp_lock); /* Protects zone operation in the work function against hotplug removal */ @@ -120,12 +120,12 @@ err_out: * * - Other callsites: Must hold pkg_temp_lock */ -static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu) +static struct zone_device *pkg_temp_thermal_get_dev(unsigned int cpu) { - int pkgid = topology_logical_die_id(cpu); + int id = topology_logical_die_id(cpu); - if (pkgid >= 0 && pkgid < max_packages) - return packages[pkgid]; + if (id >= 0 && id < max_id) + return zones[id]; return NULL; } @@ -150,12 +150,13 @@ static int get_tj_max(int cpu, u32 *tj_max) static int sys_get_curr_temp(struct thermal_zone_device *tzd, int *temp) { - struct pkg_device *pkgdev = tzd->devdata; + struct zone_device *zonedev = tzd->devdata; u32 eax, edx; - rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_STATUS, &eax, &edx); + rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_STATUS, + &eax, &edx); if (eax & 0x8000) { - *temp = pkgdev->tj_max - ((eax >> 16) & 0x7f) * 1000; + *temp = zonedev->tj_max - ((eax >> 16) & 0x7f) * 1000; pr_debug("sys_get_curr_temp %d\n", *temp); return 0; } @@ -165,7 +166,7 @@ static int sys_get_curr_temp(struct thermal_zone_device *tzd, int *temp) static int sys_get_trip_temp(struct thermal_zone_device *tzd, int trip, int *temp) { - struct pkg_device *pkgdev = tzd->devdata; + struct zone_device *zonedev = tzd->devdata; unsigned long thres_reg_value; u32 mask, shift, eax, edx; int ret; @@ -181,14 +182,14 @@ static int sys_get_trip_temp(struct thermal_zone_device *tzd, shift = THERM_SHIFT_THRESHOLD0; } - ret = rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT, + ret = rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT, &eax, &edx); if (ret < 0) return ret; thres_reg_value = (eax & mask) >> shift; if (thres_reg_value) - *temp = pkgdev->tj_max - thres_reg_value * 1000; + *temp = zonedev->tj_max - thres_reg_value * 1000; else *temp = 0; pr_debug("sys_get_trip_temp %d\n", *temp); @@ -199,14 +200,14 @@ static int sys_get_trip_temp(struct thermal_zone_device *tzd, static int sys_set_trip_temp(struct thermal_zone_device *tzd, int trip, int temp) { - struct pkg_device *pkgdev = tzd->devdata; + struct zone_device *zonedev = tzd->devdata; u32 l, h, mask, shift, intr; int r
[tip:x86/topology] topology: Create core_cpus and die_cpus sysfs attributes
Commit-ID: 2e4c54dac7b360c3820399bdf06cde9134a4495b Gitweb: https://git.kernel.org/tip/2e4c54dac7b360c3820399bdf06cde9134a4495b Author: Len Brown AuthorDate: Mon, 13 May 2019 13:58:56 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:34 +0200 topology: Create core_cpus and die_cpus sysfs attributes Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list" These attributes represent all of the logical CPUs that share the same core. These attriutes is synonymous with the existing "thread_siblings" and "thread_siblings_list" attribute, which will be deprecated. Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list". These attributes represent all of the logical CPUs that share the same die. Suggested-by: Brice Goglin Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/071c23a298cd27ede6ed0b6460cae190d193364f.1557769318.git.len.br...@intel.com --- Documentation/cputopology.txt | 21 +++-- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 22 ++ arch/x86/xen/smp_pv.c | 1 + drivers/base/topology.c | 12 include/linux/topology.h| 3 +++ 7 files changed, 55 insertions(+), 6 deletions(-) diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 48af5c290e20..b90dafcc8237 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -36,15 +36,15 @@ drawer_id: identifier (rather than the kernel's). The actual value is architecture and platform dependent. -thread_siblings: +core_cpus: - internal kernel map of cpuX's hardware threads within the same - core as cpuX. + internal kernel map of CPUs within the same core. + (deprecated name: "thread_siblings") -thread_siblings_list: +core_cpus_list: - human-readable list of cpuX's hardware threads within the same - core as cpuX. + human-readable list of CPUs within the same core. + (deprecated name: "thread_siblings_list"); package_cpus: @@ -56,6 +56,14 @@ package_cpus_list: human-readable list of CPUs sharing the same physical_package_id. (deprecated name: "core_siblings_list") +die_cpus: + + internal kernel map of CPUs within the same die. + +die_cpus_list: + + human-readable list of CPUs within the same die. + book_siblings: internal kernel map of cpuX's hardware threads within the same @@ -93,6 +101,7 @@ these macros in include/asm-XXX/topology.h:: #define topology_drawer_id(cpu) #define topology_sibling_cpumask(cpu) #define topology_core_cpumask(cpu) + #define topology_die_cpumask(cpu) #define topology_book_cpumask(cpu) #define topology_drawer_cpumask(cpu) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df207b2..b673a226ad6c 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -23,6 +23,7 @@ extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); +DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 9de16b4f6023..4b14d2318251 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) #ifdef CONFIG_SMP +#define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu)) #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index a6e01b6c2709..1a19a5171949 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -91,6 +91,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); +/* representing HT, core, and die siblings of each logical CPU */ +DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); +EXPORT_PER_CPU_SYMBOL(cpu_die_map); + DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); /* Per CPU bogomips and other parameters */ @@ -509,6 +513,15 @@ static bool match_pkg(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) return false; } +static bool match_die(struct cp
[tip:x86/topology] topology: Create package_cpus sysfs attribute
Commit-ID: b73ed8dc0597c11ec5064d06b9bbd4e541b6d4e7 Gitweb: https://git.kernel.org/tip/b73ed8dc0597c11ec5064d06b9bbd4e541b6d4e7 Author: Len Brown AuthorDate: Mon, 13 May 2019 13:58:55 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:34 +0200 topology: Create package_cpus sysfs attribute The existing sysfs cpu/topology/core_siblings (and core_siblings_list) attributes are documented, implemented, and used by programs to represent set of logical CPUs sharing the same package. This makes sense if the next topology level above a core is always a package. But on systems where there is a die topology level between a core and a package, the name and its definition become inconsistent. So without changing its function, add a name for this map that describes what it actually is -- package CPUs -- the set of CPUs that share the same package. This new name will be immune to changes in topology, since it describes threads at the current level, not siblings at a contained level. Suggested-by: Brice Goglin Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/d9d3228b82fb5665e6f93a0ccd033fe022558521.1557769318.git.len.br...@intel.com --- Documentation/cputopology.txt | 12 ++-- drivers/base/topology.c | 6 ++ 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 2ff8a1e9a2db..48af5c290e20 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -46,15 +46,15 @@ thread_siblings_list: human-readable list of cpuX's hardware threads within the same core as cpuX. -core_siblings: +package_cpus: - internal kernel map of cpuX's hardware threads within the same - physical_package_id. + internal kernel map of the CPUs sharing the same physical_package_id. + (deprecated name: "core_siblings") -core_siblings_list: +package_cpus_list: - human-readable list of cpuX's hardware threads within the same - physical_package_id. + human-readable list of CPUs sharing the same physical_package_id. + (deprecated name: "core_siblings_list") book_siblings: diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 50352cf96f85..dc3c19b482f3 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -57,6 +57,10 @@ define_siblings_show_func(core_siblings, core_cpumask); static DEVICE_ATTR_RO(core_siblings); static DEVICE_ATTR_RO(core_siblings_list); +define_siblings_show_func(package_cpus, core_cpumask); +static DEVICE_ATTR_RO(package_cpus); +static DEVICE_ATTR_RO(package_cpus_list); + #ifdef CONFIG_SCHED_BOOK define_id_show_func(book_id); static DEVICE_ATTR_RO(book_id); @@ -81,6 +85,8 @@ static struct attribute *default_attrs[] = { &dev_attr_thread_siblings_list.attr, &dev_attr_core_siblings.attr, &dev_attr_core_siblings_list.attr, + &dev_attr_package_cpus.attr, + &dev_attr_package_cpus_list.attr, #ifdef CONFIG_SCHED_BOOK &dev_attr_book_id.attr, &dev_attr_book_siblings.attr,
[tip:x86/topology] x86/topology: Define topology_logical_die_id()
Commit-ID: 212bf4fdb7f9eeeb99afd97ebad677d43e7b55ac Gitweb: https://git.kernel.org/tip/212bf4fdb7f9eeeb99afd97ebad677d43e7b55ac Author: Len Brown AuthorDate: Mon, 13 May 2019 13:58:49 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:32 +0200 x86/topology: Define topology_logical_die_id() Define topology_logical_die_id() ala existing topology_logical_package_id() Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Tested-by: Zhang Rui Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/2f3526e25ae14fbeff26fb26e877d159df8946d9.1557769318.git.len.br...@intel.com --- arch/x86/include/asm/processor.h | 1 + arch/x86/include/asm/topology.h | 5 + arch/x86/kernel/cpu/common.c | 1 + arch/x86/kernel/smpboot.c| 45 4 files changed, 52 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 7c17343946dd..6aba36bde57f 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -118,6 +118,7 @@ struct cpuinfo_x86 { /* Core id: */ u16 cpu_core_id; u16 cpu_die_id; + u16 logical_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 3777dbe9c0ff..9de16b4f6023 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) +#define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id) #define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) @@ -131,13 +132,17 @@ static inline int topology_max_smt_threads(void) } int topology_update_package_map(unsigned int apicid, unsigned int cpu); +int topology_update_die_map(unsigned int dieid, unsigned int cpu); int topology_phys_to_logical_pkg(unsigned int pkg); +int topology_phys_to_logical_die(unsigned int die, unsigned int cpu); bool topology_is_primary_thread(unsigned int cpu); bool topology_smt_supported(void); #else #define topology_max_packages()(1) static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } +static inline int +topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } static inline int topology_phys_to_logical_die(unsigned int die, unsigned int cpu) { return 0; } diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index d7f55ad2dfb1..8cdca1223b0f 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1298,6 +1298,7 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c) cpu, apicid, c->initial_apicid); } BUG_ON(topology_update_package_map(c->phys_proc_id, cpu)); + BUG_ON(topology_update_die_map(c->cpu_die_id, cpu)); #else c->logical_proc_id = 0; #endif diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 40ffe23249c0..a6e01b6c2709 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,6 +101,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info); unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); static unsigned int logical_packages __read_mostly; +static unsigned int logical_die __read_mostly; /* Maximum number of SMT threads on any online core */ int __read_mostly __max_smt_threads = 1; @@ -302,6 +303,26 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg) return -1; } EXPORT_SYMBOL(topology_phys_to_logical_pkg); +/** + * topology_phys_to_logical_die - Map a physical die id to logical + * + * Returns logical die id or -1 if not found + */ +int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu) +{ + int cpu; + int proc_id = cpu_data(cur_cpu).phys_proc_id; + + for_each_possible_cpu(cpu) { + struct cpuinfo_x86 *c = &cpu_data(cpu); + + if (c->initialized && c->cpu_die_id == die_id && + c->phys_proc_id == proc_id) + return c->logical_die_id; + } + return -1; +} +EXPORT_SYMBOL(topology_phys_to_logical_die); /** * topology_update_package_map - Update the physical to logical package map @@ -326,6 +347,29 @@ found: cpu_data(cpu).logical_proc_id = new; return 0; } +/** + * topo
[tip:x86/topology] x86/topology: Define topology_die_id()
Commit-ID: 306a0de329f77537f29022c2982006f9145d782d Gitweb: https://git.kernel.org/tip/306a0de329f77537f29022c2982006f9145d782d Author: Len Brown AuthorDate: Mon, 13 May 2019 13:58:48 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:31 +0200 x86/topology: Define topology_die_id() topology_die_id(cpu) is a simple macro for use inside the kernel to get the die_id associated with the given cpu. Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/6463bc422b1b05445a502dc505c1d7c6756bda6a.1557769318.git.len.br...@intel.com --- arch/x86/include/asm/topology.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index e0232f7042c3..3777dbe9c0ff 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) +#define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) #ifdef CONFIG_SMP
[tip:x86/topology] cpu/topology: Export die_id
Commit-ID: 0e344d8c709fe01d882fc0fb5452bedfe5eba67a Gitweb: https://git.kernel.org/tip/0e344d8c709fe01d882fc0fb5452bedfe5eba67a Author: Len Brown AuthorDate: Mon, 13 May 2019 13:58:47 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:31 +0200 cpu/topology: Export die_id Export die_id in cpu topology, for the benefit of hardware that has multiple-die/package. Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Cc: linux-...@vger.kernel.org Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.br...@intel.com --- Documentation/cputopology.txt | 15 --- drivers/base/topology.c | 4 include/linux/topology.h | 3 +++ 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index cb61277e2308..2ff8a1e9a2db 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -12,6 +12,12 @@ physical_package_id: socket number, but the actual value is architecture and platform dependent. +die_id: + + the CPU die ID of cpuX. Typically it is the hardware platform's + identifier (rather than the kernel's). The actual value is + architecture and platform dependent. + core_id: the CPU core ID of cpuX. Typically it is the hardware platform's @@ -81,6 +87,7 @@ For an architecture to support this feature, it must define some of these macros in include/asm-XXX/topology.h:: #define topology_physical_package_id(cpu) + #define topology_die_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_drawer_id(cpu) @@ -99,9 +106,11 @@ provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h: 1) topology_physical_package_id: -1 -2) topology_core_id: 0 -3) topology_sibling_cpumask: just the given CPU -4) topology_core_cpumask: just the given CPU +2) topology_die_id: -1 +3) topology_core_id: 0 +4) topology_sibling_cpumask: just the given CPU +5) topology_core_cpumask: just the given CPU +6) topology_die_cpumask: just the given CPU For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 5fd9f167ecc1..50352cf96f85 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -43,6 +43,9 @@ static ssize_t name##_list_show(struct device *dev, \ define_id_show_func(physical_package_id); static DEVICE_ATTR_RO(physical_package_id); +define_id_show_func(die_id); +static DEVICE_ATTR_RO(die_id); + define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id); @@ -72,6 +75,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list); static struct attribute *default_attrs[] = { &dev_attr_physical_package_id.attr, + &dev_attr_die_id.attr, &dev_attr_core_id.attr, &dev_attr_thread_siblings.attr, &dev_attr_thread_siblings_list.attr, diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e1ee4b..5cc8595dd0e4 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -184,6 +184,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif +#ifndef topology_die_id +#define topology_die_id(cpu) ((void)(cpu), -1) +#endif #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif
[tip:x86/topology] x86/topology: Create topology_max_die_per_package()
Commit-ID: 14d96d6c06b5d8116b8d52c9c5530f5528ef1e61 Gitweb: https://git.kernel.org/tip/14d96d6c06b5d8116b8d52c9c5530f5528ef1e61 Author: Len Brown AuthorDate: Mon, 13 May 2019 13:58:46 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:30 +0200 x86/topology: Create topology_max_die_per_package() topology_max_packages() is available to size resources to cover all packages in the system. But now multi-die/package systems are coming up, and some resources are per-die. Create topology_max_die_per_package(), for detecting multi-die/package systems, and sizing any per-die resources. Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/e6eaf384571ae52ac7d0ca41510b7fb7d2fda0e4.1557769318.git.len.br...@intel.com --- arch/x86/include/asm/processor.h | 1 - arch/x86/include/asm/topology.h | 10 ++ arch/x86/kernel/cpu/topology.c | 5 - 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index ef0a44fccaec..7c17343946dd 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -106,7 +106,6 @@ struct cpuinfo_x86 { unsigned long loops_per_jiffy; /* cpuid returned max cores value: */ u16 x86_max_cores; - u16 x86_max_dies; u16 apicid; u16 initial_apicid; u16 x86_clflush_size; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38a1c33..e0232f7042c3 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -115,6 +115,13 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) +extern unsigned int __max_die_per_package; + +static inline int topology_max_die_per_package(void) +{ + return __max_die_per_package; +} + extern int __max_smt_threads; static inline int topology_max_smt_threads(void) @@ -131,6 +138,9 @@ bool topology_smt_supported(void); static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } +static inline int topology_phys_to_logical_die(unsigned int die, + unsigned int cpu) { return 0; } +static inline int topology_max_die_per_package(void) { return 1; } static inline int topology_max_smt_threads(void) { return 1; } static inline bool topology_is_primary_thread(unsigned int cpu) { return true; } static inline bool topology_smt_supported(void) { return false; } diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 4d17e699657d..ee48c3fc8a65 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -26,6 +26,9 @@ #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x) #ifdef CONFIG_SMP +unsigned int __max_die_per_package __read_mostly = 1; +EXPORT_SYMBOL(__max_die_per_package); + /* * Check if given CPUID extended toplogy "leaf" is implemented */ @@ -146,7 +149,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c) c->apicid = apic->phys_pkg_id(c->initial_apicid, 0); c->x86_max_cores = (core_level_siblings / smp_num_siblings); - c->x86_max_dies = (die_level_siblings / core_level_siblings); + __max_die_per_package = (die_level_siblings / core_level_siblings); #endif return 0; }
[tip:x86/topology] x86/topology: Add CPUID.1F multi-die/package support
Commit-ID: 7745f03eb39587dd15a1fb26e6223678b8e906d2 Gitweb: https://git.kernel.org/tip/7745f03eb39587dd15a1fb26e6223678b8e906d2 Author: Len Brown AuthorDate: Mon, 13 May 2019 13:58:45 -0400 Committer: Thomas Gleixner CommitDate: Thu, 23 May 2019 10:08:30 +0200 x86/topology: Add CPUID.1F multi-die/package support Some new systems have multiple software-visible die within each package. Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle either CPUID.B, or the new CPUID.1F. Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result. die_id will be non-zero only for multi-die/package systems. Signed-off-by: Len Brown Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Cc: linux-...@vger.kernel.org Link: https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.1557769318.git.len.br...@intel.com --- Documentation/x86/topology.rst | 4 ++ arch/x86/include/asm/processor.h | 4 +- arch/x86/kernel/cpu/topology.c | 85 +++- arch/x86/kernel/smpboot.c| 2 + 4 files changed, 75 insertions(+), 20 deletions(-) diff --git a/Documentation/x86/topology.rst b/Documentation/x86/topology.rst index 6e28dbe818ab..8e9704f61017 100644 --- a/Documentation/x86/topology.rst +++ b/Documentation/x86/topology.rst @@ -49,6 +49,10 @@ Package-related topology information in the kernel: The number of cores in a package. This information is retrieved via CPUID. + - cpuinfo_x86.x86_max_dies: + +The number of dies in a package. This information is retrieved via CPUID. + - cpuinfo_x86.phys_proc_id: The physical ID of the package. This information is retrieved via CPUID diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index c34a35c78618..ef0a44fccaec 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -105,7 +105,8 @@ struct cpuinfo_x86 { int x86_power; unsigned long loops_per_jiffy; /* cpuid returned max cores value: */ - u16 x86_max_cores; + u16 x86_max_cores; + u16 x86_max_dies; u16 apicid; u16 initial_apicid; u16 x86_clflush_size; @@ -117,6 +118,7 @@ struct cpuinfo_x86 { u16 logical_proc_id; /* Core id: */ u16 cpu_core_id; + u16 cpu_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 8f6c784141d1..4d17e699657d 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -15,33 +15,63 @@ /* leaf 0xb SMT level */ #define SMT_LEVEL 0 -/* leaf 0xb sub-leaf types */ +/* extended topology sub-leaf types */ #define INVALID_TYPE 0 #define SMT_TYPE 1 #define CORE_TYPE 2 +#define DIE_TYPE 5 #define LEAFB_SUBTYPE(ecx) (((ecx) >> 8) & 0xff) #define BITS_SHIFT_NEXT_LEVEL(eax) ((eax) & 0x1f) #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x) -int detect_extended_topology_early(struct cpuinfo_x86 *c) -{ #ifdef CONFIG_SMP +/* + * Check if given CPUID extended toplogy "leaf" is implemented + */ +static int check_extended_topology_leaf(int leaf) +{ unsigned int eax, ebx, ecx, edx; - if (c->cpuid_level < 0xb) + cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + + if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE)) return -1; - cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + return 0; +} +/* + * Return best CPUID Extended Toplogy Leaf supported + */ +static int detect_extended_topology_leaf(struct cpuinfo_x86 *c) +{ + if (c->cpuid_level >= 0x1f) { + if (check_extended_topology_leaf(0x1f) == 0) + return 0x1f; + } - /* -* check if the cpuid leaf 0xb is actually implemented. -*/ - if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE)) + if (c->cpuid_level >= 0xb) { + if (check_extended_topology_leaf(0xb) == 0) + return 0xb; + } + + return -1; +} +#endif + +int detect_extended_topology_early(struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + unsigned int eax, ebx, ecx, edx; + int leaf; + + leaf = detect_extended_topology_leaf(c); + if (leaf < 0) return -1; set_cpu_cap(c, X86_FEATURE_XTOPOLOGY); + cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx); /* * initial apic id, which also represents 32-bit
[PATCH 05/19] x86 topology: Define topology_logical_die_id()
From: Len Brown Define topology_logical_die_id() ala existing topology_logical_package_id() Tested-by: Zhang Rui Signed-off-by: Len Brown --- arch/x86/include/asm/processor.h | 1 + arch/x86/include/asm/topology.h | 5 arch/x86/kernel/cpu/common.c | 1 + arch/x86/kernel/smpboot.c| 45 4 files changed, 52 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 87d42c0c6ccc..603ce3fe578d 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -118,6 +118,7 @@ struct cpuinfo_x86 { /* Core id: */ u16 cpu_core_id; u16 cpu_die_id; + u16 logical_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 3777dbe9c0ff..9de16b4f6023 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) +#define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id) #define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) @@ -131,13 +132,17 @@ static inline int topology_max_smt_threads(void) } int topology_update_package_map(unsigned int apicid, unsigned int cpu); +int topology_update_die_map(unsigned int dieid, unsigned int cpu); int topology_phys_to_logical_pkg(unsigned int pkg); +int topology_phys_to_logical_die(unsigned int die, unsigned int cpu); bool topology_is_primary_thread(unsigned int cpu); bool topology_smt_supported(void); #else #define topology_max_packages()(1) static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } +static inline int +topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } static inline int topology_phys_to_logical_die(unsigned int die, unsigned int cpu) { return 0; } diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 8739bdfe9bdf..48de7f069ac7 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1277,6 +1277,7 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c) cpu, apicid, c->initial_apicid); } BUG_ON(topology_update_package_map(c->phys_proc_id, cpu)); + BUG_ON(topology_update_die_map(c->cpu_die_id, cpu)); #else c->logical_proc_id = 0; #endif diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 40ffe23249c0..a6e01b6c2709 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,6 +101,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info); unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); static unsigned int logical_packages __read_mostly; +static unsigned int logical_die __read_mostly; /* Maximum number of SMT threads on any online core */ int __read_mostly __max_smt_threads = 1; @@ -302,6 +303,26 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg) return -1; } EXPORT_SYMBOL(topology_phys_to_logical_pkg); +/** + * topology_phys_to_logical_die - Map a physical die id to logical + * + * Returns logical die id or -1 if not found + */ +int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu) +{ + int cpu; + int proc_id = cpu_data(cur_cpu).phys_proc_id; + + for_each_possible_cpu(cpu) { + struct cpuinfo_x86 *c = &cpu_data(cpu); + + if (c->initialized && c->cpu_die_id == die_id && + c->phys_proc_id == proc_id) + return c->logical_die_id; + } + return -1; +} +EXPORT_SYMBOL(topology_phys_to_logical_die); /** * topology_update_package_map - Update the physical to logical package map @@ -326,6 +347,29 @@ int topology_update_package_map(unsigned int pkg, unsigned int cpu) cpu_data(cpu).logical_proc_id = new; return 0; } +/** + * topology_update_die_map - Update the physical to logical die map + * @die: The die id as retrieved via CPUID + * @cpu: The cpu for which this is updated + */ +int topology_update_die_map(unsigned int die, unsigned int cpu) +{ + int new; + + /* Already available somewhere? */ + new = topology_phys_to_logical_die(die, cpu); + if (new >= 0) + goto found; + + new = logical_die++; + if
[PATCH 03/19] cpu topology: Export die_id
From: Len Brown Export die_id in cpu topology, for the benefit of hardware that has multiple-die/package. Signed-off-by: Len Brown Cc: linux-...@vger.kernel.org --- Documentation/cputopology.txt | 15 --- drivers/base/topology.c | 4 include/linux/topology.h | 3 +++ 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index cb61277e2308..2ff8a1e9a2db 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -12,6 +12,12 @@ physical_package_id: socket number, but the actual value is architecture and platform dependent. +die_id: + + the CPU die ID of cpuX. Typically it is the hardware platform's + identifier (rather than the kernel's). The actual value is + architecture and platform dependent. + core_id: the CPU core ID of cpuX. Typically it is the hardware platform's @@ -81,6 +87,7 @@ For an architecture to support this feature, it must define some of these macros in include/asm-XXX/topology.h:: #define topology_physical_package_id(cpu) + #define topology_die_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_drawer_id(cpu) @@ -99,9 +106,11 @@ provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h: 1) topology_physical_package_id: -1 -2) topology_core_id: 0 -3) topology_sibling_cpumask: just the given CPU -4) topology_core_cpumask: just the given CPU +2) topology_die_id: -1 +3) topology_core_id: 0 +4) topology_sibling_cpumask: just the given CPU +5) topology_core_cpumask: just the given CPU +6) topology_die_cpumask: just the given CPU For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 5fd9f167ecc1..50352cf96f85 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -43,6 +43,9 @@ static ssize_t name##_list_show(struct device *dev, \ define_id_show_func(physical_package_id); static DEVICE_ATTR_RO(physical_package_id); +define_id_show_func(die_id); +static DEVICE_ATTR_RO(die_id); + define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id); @@ -72,6 +75,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list); static struct attribute *default_attrs[] = { &dev_attr_physical_package_id.attr, + &dev_attr_die_id.attr, &dev_attr_core_id.attr, &dev_attr_thread_siblings.attr, &dev_attr_thread_siblings_list.attr, diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e1ee4b..5cc8595dd0e4 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -184,6 +184,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif +#ifndef topology_die_id +#define topology_die_id(cpu) ((void)(cpu), -1) +#endif #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif -- 2.18.0-rc0
[PATCH 06/19] powercap/intel_rapl: Simplify rapl_find_package()
From: Zhang Rui Syntax only, no functional or semantic change. Simplify how the code to discover a package is called. Rename find_package_by_id() to rapl_find_package_domain() Signed-off-by: Zhang Rui Signed-off-by: Len Brown Acked-by: Rafael J. Wysocki Cc: linux...@vger.kernel.org --- drivers/powercap/intel_rapl.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 4347f15165f8..3c3c0c23180b 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -264,8 +264,9 @@ static struct powercap_control_type *control_type; /* PowerCap Controller */ static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */ /* caller to ensure CPU hotplug lock is held */ -static struct rapl_package *find_package_by_id(int id) +static struct rapl_package *rapl_find_package_domain(int cpu) { + int id = topology_physical_package_id(cpu); struct rapl_package *rp; list_for_each_entry(rp, &rapl_packages, plist) { @@ -1300,7 +1301,7 @@ static int __init rapl_register_psys(void) rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; rd->rpl[1].name = pl2_name; - rd->rp = find_package_by_id(0); + rd->rp = rapl_find_package_domain(0); power_zone = powercap_register_zone(&rd->power_zone, control_type, "psys", NULL, @@ -1456,8 +1457,9 @@ static void rapl_remove_package(struct rapl_package *rp) } /* called from CPU hotplug notifier, hotplug lock held */ -static struct rapl_package *rapl_add_package(int cpu, int pkgid) +static struct rapl_package *rapl_add_package(int cpu) { + int id = topology_physical_package_id(cpu); struct rapl_package *rp; int ret; @@ -1466,7 +1468,7 @@ static struct rapl_package *rapl_add_package(int cpu, int pkgid) return ERR_PTR(-ENOMEM); /* add the new package to the list */ - rp->id = pkgid; + rp->id = id; rp->lead_cpu = cpu; /* check if the package contains valid domains */ @@ -1497,12 +1499,11 @@ static struct rapl_package *rapl_add_package(int cpu, int pkgid) */ static int rapl_cpu_online(unsigned int cpu) { - int pkgid = topology_physical_package_id(cpu); struct rapl_package *rp; - rp = find_package_by_id(pkgid); + rp = rapl_find_package_domain(cpu); if (!rp) { - rp = rapl_add_package(cpu, pkgid); + rp = rapl_add_package(cpu); if (IS_ERR(rp)) return PTR_ERR(rp); } @@ -1512,11 +1513,10 @@ static int rapl_cpu_online(unsigned int cpu) static int rapl_cpu_down_prep(unsigned int cpu) { - int pkgid = topology_physical_package_id(cpu); struct rapl_package *rp; int lead_cpu; - rp = find_package_by_id(pkgid); + rp = rapl_find_package_domain(cpu); if (!rp) return 0; -- 2.18.0-rc0
[PATCH 08/19] thermal/x86_pkg_temp_thermal: Support multi-die/package
From: Zhang Rui Package temperature sensors are actually implemented in hardware per-die. Thus, the new multi-die/package systems sport mulitple package thermal zones for each package. Update the x86_pkg_temp_thermal to be "multi-die-aware", so it can expose multiple zones per package, instead of just one. Signed-off-by: Zhang Rui Signed-off-by: Len Brown --- drivers/thermal/intel/x86_pkg_temp_thermal.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c b/drivers/thermal/intel/x86_pkg_temp_thermal.c index 1ef937d799e4..405b3858900a 100644 --- a/drivers/thermal/intel/x86_pkg_temp_thermal.c +++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c @@ -122,7 +122,7 @@ static int pkg_temp_debugfs_init(void) */ static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu) { - int pkgid = topology_logical_package_id(cpu); + int pkgid = topology_logical_die_id(cpu); if (pkgid >= 0 && pkgid < max_packages) return packages[pkgid]; @@ -353,7 +353,7 @@ static int pkg_thermal_notify(u64 msr_val) static int pkg_temp_thermal_device_add(unsigned int cpu) { - int pkgid = topology_logical_package_id(cpu); + int pkgid = topology_logical_die_id(cpu); u32 tj_max, eax, ebx, ecx, edx; struct pkg_device *pkgdev; int thres_count, err; @@ -449,7 +449,7 @@ static int pkg_thermal_cpu_offline(unsigned int cpu) * worker will see the package anymore. */ if (lastcpu) { - packages[topology_logical_package_id(cpu)] = NULL; + packages[topology_logical_die_id(cpu)] = NULL; /* After this point nothing touches the MSR anymore. */ wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, pkgdev->msr_pkg_therm_low, pkgdev->msr_pkg_therm_high); @@ -515,7 +515,7 @@ static int __init pkg_temp_thermal_init(void) if (!x86_match_cpu(pkg_temp_thermal_ids)) return -ENODEV; - max_packages = topology_max_packages(); + max_packages = topology_max_packages() * topology_max_die_per_package(); packages = kcalloc(max_packages, sizeof(struct pkg_device *), GFP_KERNEL); if (!packages) -- 2.18.0-rc0
[PATCH 02/19] x86 topology: Create topology_max_die_per_package()
From: Len Brown topology_max_packages() is available to size resources to cover all packages in the system. But now we have multi-die/package systems, and some resources are per-die. Create topology_max_die_per_package(), for detecting multi-die/package systems, and sizing any per-die resources. Signed-off-by: Len Brown --- arch/x86/include/asm/processor.h | 1 - arch/x86/include/asm/topology.h | 10 ++ arch/x86/kernel/cpu/topology.c | 5 - 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 00fc03a8da59..87d42c0c6ccc 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -106,7 +106,6 @@ struct cpuinfo_x86 { unsigned long loops_per_jiffy; /* cpuid returned max cores value: */ u16 x86_max_cores; - u16 x86_max_dies; u16 apicid; u16 initial_apicid; u16 x86_clflush_size; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38a1c33..e0232f7042c3 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -115,6 +115,13 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) +extern unsigned int __max_die_per_package; + +static inline int topology_max_die_per_package(void) +{ + return __max_die_per_package; +} + extern int __max_smt_threads; static inline int topology_max_smt_threads(void) @@ -131,6 +138,9 @@ bool topology_smt_supported(void); static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } +static inline int topology_phys_to_logical_die(unsigned int die, + unsigned int cpu) { return 0; } +static inline int topology_max_die_per_package(void) { return 1; } static inline int topology_max_smt_threads(void) { return 1; } static inline bool topology_is_primary_thread(unsigned int cpu) { return true; } static inline bool topology_smt_supported(void) { return false; } diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 4d17e699657d..ee48c3fc8a65 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -26,6 +26,9 @@ #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x) #ifdef CONFIG_SMP +unsigned int __max_die_per_package __read_mostly = 1; +EXPORT_SYMBOL(__max_die_per_package); + /* * Check if given CPUID extended toplogy "leaf" is implemented */ @@ -146,7 +149,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c) c->apicid = apic->phys_pkg_id(c->initial_apicid, 0); c->x86_max_cores = (core_level_siblings / smp_num_siblings); - c->x86_max_dies = (die_level_siblings / core_level_siblings); + __max_die_per_package = (die_level_siblings / core_level_siblings); #endif return 0; } -- 2.18.0-rc0
[PATCH 12/19] topology: Create core_cpus and die_cpus sysfs attributes
From: Len Brown Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list" These attributes represent all of the logical CPUs that share the same core. These attriutes is synonymous with the existing "thread_siblings" and "thread_siblings_list" attribute, which will be deprecated. Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list". These attributes represent all of the logical CPUs that share the same die. Signed-off-by: Len Brown Suggested-by: Brice Goglin --- Documentation/cputopology.txt | 21 +++-- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 22 ++ arch/x86/xen/smp_pv.c | 1 + drivers/base/topology.c | 12 include/linux/topology.h| 3 +++ 7 files changed, 55 insertions(+), 6 deletions(-) diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 48af5c290e20..b90dafcc8237 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -36,15 +36,15 @@ drawer_id: identifier (rather than the kernel's). The actual value is architecture and platform dependent. -thread_siblings: +core_cpus: - internal kernel map of cpuX's hardware threads within the same - core as cpuX. + internal kernel map of CPUs within the same core. + (deprecated name: "thread_siblings") -thread_siblings_list: +core_cpus_list: - human-readable list of cpuX's hardware threads within the same - core as cpuX. + human-readable list of CPUs within the same core. + (deprecated name: "thread_siblings_list"); package_cpus: @@ -56,6 +56,14 @@ package_cpus_list: human-readable list of CPUs sharing the same physical_package_id. (deprecated name: "core_siblings_list") +die_cpus: + + internal kernel map of CPUs within the same die. + +die_cpus_list: + + human-readable list of CPUs within the same die. + book_siblings: internal kernel map of cpuX's hardware threads within the same @@ -93,6 +101,7 @@ these macros in include/asm-XXX/topology.h:: #define topology_drawer_id(cpu) #define topology_sibling_cpumask(cpu) #define topology_core_cpumask(cpu) + #define topology_die_cpumask(cpu) #define topology_book_cpumask(cpu) #define topology_drawer_cpumask(cpu) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df207b2..b673a226ad6c 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -23,6 +23,7 @@ extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); +DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 9de16b4f6023..4b14d2318251 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) #ifdef CONFIG_SMP +#define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu)) #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index a6e01b6c2709..1a19a5171949 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -91,6 +91,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); +/* representing HT, core, and die siblings of each logical CPU */ +DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); +EXPORT_PER_CPU_SYMBOL(cpu_die_map); + DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); /* Per CPU bogomips and other parameters */ @@ -509,6 +513,15 @@ static bool match_pkg(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) return false; } +static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + if ((c->phys_proc_id == o->phys_proc_id) && + (c->cpu_die_id == o->cpu_die_id)) + return true; + return false; +} + + #if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC) static inline int x86_sched_itmt_flags(void) { @@ -571,6 +584,7 @@ void set_cpu_sibling_map(int cpu) cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)
[PATCH 16/19] thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages
From: Len Brown Syntax update only -- no logical or functional change. In response to the new multi-die/package changes, update variable names to use the more generic thermal "zone" terminology, instead of "package", as the zones can refer to either packages or die. Signed-off-by: Len Brown Cc: Zhang Rui --- drivers/thermal/intel/x86_pkg_temp_thermal.c | 142 ++- 1 file changed, 72 insertions(+), 70 deletions(-) diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c b/drivers/thermal/intel/x86_pkg_temp_thermal.c index 405b3858900a..87e929ffb0cb 100644 --- a/drivers/thermal/intel/x86_pkg_temp_thermal.c +++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c @@ -55,7 +55,7 @@ MODULE_PARM_DESC(notify_delay_ms, */ #define MAX_NUMBER_OF_TRIPS2 -struct pkg_device { +struct zone_device { int cpu; boolwork_scheduled; u32 tj_max; @@ -70,10 +70,10 @@ static struct thermal_zone_params pkg_temp_tz_params = { .no_hwmon = true, }; -/* Keep track of how many package pointers we allocated in init() */ -static int max_packages __read_mostly; -/* Array of package pointers */ -static struct pkg_device **packages; +/* Keep track of how many zone pointers we allocated in init() */ +static int max_id __read_mostly; +/* Array of zone pointers */ +static struct zone_device **zones; /* Serializes interrupt notification, work and hotplug */ static DEFINE_SPINLOCK(pkg_temp_lock); /* Protects zone operation in the work function against hotplug removal */ @@ -120,12 +120,12 @@ static int pkg_temp_debugfs_init(void) * * - Other callsites: Must hold pkg_temp_lock */ -static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu) +static struct zone_device *pkg_temp_thermal_get_dev(unsigned int cpu) { - int pkgid = topology_logical_die_id(cpu); + int id = topology_logical_die_id(cpu); - if (pkgid >= 0 && pkgid < max_packages) - return packages[pkgid]; + if (id >= 0 && id < max_id) + return zones[id]; return NULL; } @@ -150,12 +150,13 @@ static int get_tj_max(int cpu, u32 *tj_max) static int sys_get_curr_temp(struct thermal_zone_device *tzd, int *temp) { - struct pkg_device *pkgdev = tzd->devdata; + struct zone_device *zonedev = tzd->devdata; u32 eax, edx; - rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_STATUS, &eax, &edx); + rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_STATUS, + &eax, &edx); if (eax & 0x8000) { - *temp = pkgdev->tj_max - ((eax >> 16) & 0x7f) * 1000; + *temp = zonedev->tj_max - ((eax >> 16) & 0x7f) * 1000; pr_debug("sys_get_curr_temp %d\n", *temp); return 0; } @@ -165,7 +166,7 @@ static int sys_get_curr_temp(struct thermal_zone_device *tzd, int *temp) static int sys_get_trip_temp(struct thermal_zone_device *tzd, int trip, int *temp) { - struct pkg_device *pkgdev = tzd->devdata; + struct zone_device *zonedev = tzd->devdata; unsigned long thres_reg_value; u32 mask, shift, eax, edx; int ret; @@ -181,14 +182,14 @@ static int sys_get_trip_temp(struct thermal_zone_device *tzd, shift = THERM_SHIFT_THRESHOLD0; } - ret = rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT, + ret = rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT, &eax, &edx); if (ret < 0) return ret; thres_reg_value = (eax & mask) >> shift; if (thres_reg_value) - *temp = pkgdev->tj_max - thres_reg_value * 1000; + *temp = zonedev->tj_max - thres_reg_value * 1000; else *temp = 0; pr_debug("sys_get_trip_temp %d\n", *temp); @@ -199,14 +200,14 @@ static int sys_get_trip_temp(struct thermal_zone_device *tzd, static int sys_set_trip_temp(struct thermal_zone_device *tzd, int trip, int temp) { - struct pkg_device *pkgdev = tzd->devdata; + struct zone_device *zonedev = tzd->devdata; u32 l, h, mask, shift, intr; int ret; - if (trip >= MAX_NUMBER_OF_TRIPS || temp >= pkgdev->tj_max) + if (trip >= MAX_NUMBER_OF_TRIPS || temp >= zonedev->tj_max) return -EINVAL; - ret = rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT, + ret = rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT, &l, &h); if (ret < 0) return ret; @@ -228,11 +229,12 @@ sys_set_trip_temp(struct thermal_zone_device *tzd, int trip, int temp
[PATCH 18/19] perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support
From: Kan Liang Syntax update only -- no logical or functional change. In response to the new multi-die/package changes, update variable names to use "die" terminology, instead of "pkg". For previous platforms which doesn't have multi-die, "die" is identical as "pkg". Signed-off-by: Kan Liang Signed-off-by: Len Brown --- arch/x86/events/intel/uncore.c | 74 ++-- arch/x86/events/intel/uncore.h | 4 +- arch/x86/events/intel/uncore_snbep.c | 4 +- 3 files changed, 41 insertions(+), 41 deletions(-) diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index aeb5eae83750..f943e4d0e66c 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -14,7 +14,7 @@ struct pci_driver *uncore_pci_driver; DEFINE_RAW_SPINLOCK(pci2phy_map_lock); struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_extra_dev *uncore_extra_pci_dev; -static int max_packages; +static int max_dies; /* mask of cpus that collect uncore events */ static cpumask_t uncore_cpu_mask; @@ -100,13 +100,13 @@ ssize_t uncore_event_show(struct kobject *kobj, struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu) { - unsigned int pkgid = topology_logical_die_id(cpu); + unsigned int dieid = topology_logical_die_id(cpu); /* * The unsigned check also catches the '-1' return value for non * existent mappings in the topology map. */ - return pkgid < max_packages ? pmu->boxes[pkgid] : NULL; + return dieid < max_dies ? pmu->boxes[dieid] : NULL; } u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event) @@ -311,7 +311,7 @@ static struct intel_uncore_box *uncore_alloc_box(struct intel_uncore_type *type, uncore_pmu_init_hrtimer(box); box->cpu = -1; box->pci_phys_id = -1; - box->pkgid = -1; + box->dieid = -1; /* set default hrtimer timeout */ box->hrtimer_duration = UNCORE_PMU_HRTIMER_INTERVAL; @@ -826,10 +826,10 @@ static void uncore_pmu_unregister(struct intel_uncore_pmu *pmu) static void uncore_free_boxes(struct intel_uncore_pmu *pmu) { - int pkg; + int die; - for (pkg = 0; pkg < max_packages; pkg++) - kfree(pmu->boxes[pkg]); + for (die = 0; die < max_dies; die++) + kfree(pmu->boxes[die]); kfree(pmu->boxes); } @@ -866,7 +866,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid) if (!pmus) return -ENOMEM; - size = max_packages * sizeof(struct intel_uncore_box *); + size = max_dies * sizeof(struct intel_uncore_box *); for (i = 0; i < type->num_boxes; i++) { pmus[i].func_id = setid ? i : -1; @@ -936,21 +936,21 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id struct intel_uncore_type *type; struct intel_uncore_pmu *pmu = NULL; struct intel_uncore_box *box; - int phys_id, pkg, ret; + int phys_id, die, ret; phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; - pkg = (topology_max_die_per_package() > 1) ? phys_id : + die = (topology_max_die_per_package() > 1) ? phys_id : topology_phys_to_logical_pkg(phys_id); - if (pkg < 0) + if (die < 0) return -EINVAL; if (UNCORE_PCI_DEV_TYPE(id->driver_data) == UNCORE_EXTRA_PCI_DEV) { int idx = UNCORE_PCI_DEV_IDX(id->driver_data); - uncore_extra_pci_dev[pkg].dev[idx] = pdev; + uncore_extra_pci_dev[die].dev[idx] = pdev; pci_set_drvdata(pdev, NULL); return 0; } @@ -989,7 +989,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id pmu = &type->pmus[UNCORE_PCI_DEV_IDX(id->driver_data)]; } - if (WARN_ON_ONCE(pmu->boxes[pkg] != NULL)) + if (WARN_ON_ONCE(pmu->boxes[die] != NULL)) return -EINVAL; box = uncore_alloc_box(type, NUMA_NO_NODE); @@ -1003,13 +1003,13 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id atomic_inc(&box->refcnt); box->pci_phys_id = phys_id; - box->pkgid = pkg; + box->dieid = die; box->pci_dev = pdev; box->pmu = pmu; uncore_box_init(box); pci_set_drvdata(pdev, box); - pmu->boxes[pkg] = box; + pmu->boxes[die] = box; if (atomic_inc_return(&pmu->activeboxes) > 1) return 0; @@ -1017,7 +1017,7 @@ static int uncore_pci_probe(struct pci_dev
[PATCH 17/19] hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages
From: Len Brown Syntax update only -- no logical or functional change. In response to the new multi-die/package changes, update variable names to use the more generic thermal "zone" terminology, instead of "package", as the zones can refer to either packages or die. Signed-off-by: Len Brown Cc: Zhang Rui --- drivers/hwmon/coretemp.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c index c64ce32d3214..4ebed90d81aa 100644 --- a/drivers/hwmon/coretemp.c +++ b/drivers/hwmon/coretemp.c @@ -109,10 +109,10 @@ struct platform_data { struct device_attribute name_attr; }; -/* Keep track of how many package pointers we allocated in init() */ -static int max_packages __read_mostly; -/* Array of package pointers. Serialized by cpu hotplug lock */ -static struct platform_device **pkg_devices; +/* Keep track of how many zone pointers we allocated in init() */ +static int max_zones __read_mostly; +/* Array of zone pointers. Serialized by cpu hotplug lock */ +static struct platform_device **zone_devices; static ssize_t show_label(struct device *dev, struct device_attribute *devattr, char *buf) @@ -435,10 +435,10 @@ static int chk_ucode_version(unsigned int cpu) static struct platform_device *coretemp_get_pdev(unsigned int cpu) { - int pkgid = topology_logical_die_id(cpu); + int id = topology_logical_die_id(cpu); - if (pkgid >= 0 && pkgid < max_packages) - return pkg_devices[pkgid]; + if (id >= 0 && id < max_zones) + return zone_devices[id]; return NULL; } @@ -544,7 +544,7 @@ static int coretemp_probe(struct platform_device *pdev) struct device *dev = &pdev->dev; struct platform_data *pdata; - /* Initialize the per-package data structures */ + /* Initialize the per-zone data structures */ pdata = devm_kzalloc(dev, sizeof(struct platform_data), GFP_KERNEL); if (!pdata) return -ENOMEM; @@ -579,13 +579,13 @@ static struct platform_driver coretemp_driver = { static struct platform_device *coretemp_device_add(unsigned int cpu) { - int err, pkgid = topology_logical_die_id(cpu); + int err, zoneid = topology_logical_die_id(cpu); struct platform_device *pdev; - if (pkgid < 0) + if (zoneid < 0) return ERR_PTR(-ENOMEM); - pdev = platform_device_alloc(DRVNAME, pkgid); + pdev = platform_device_alloc(DRVNAME, zoneid); if (!pdev) return ERR_PTR(-ENOMEM); @@ -595,7 +595,7 @@ static struct platform_device *coretemp_device_add(unsigned int cpu) return ERR_PTR(err); } - pkg_devices[pkgid] = pdev; + zone_devices[zoneid] = pdev; return pdev; } @@ -703,7 +703,7 @@ static int coretemp_cpu_offline(unsigned int cpu) * the rest. */ if (cpumask_empty(&pd->cpumask)) { - pkg_devices[topology_logical_die_id(cpu)] = NULL; + zone_devices[topology_logical_die_id(cpu)] = NULL; platform_device_unregister(pdev); return 0; } @@ -741,10 +741,10 @@ static int __init coretemp_init(void) if (!x86_match_cpu(coretemp_ids)) return -ENODEV; - max_packages = topology_max_packages() * topology_max_die_per_package(); - pkg_devices = kcalloc(max_packages, sizeof(struct platform_device *), + max_zones = topology_max_packages() * topology_max_die_per_package(); + zone_devices = kcalloc(max_zones, sizeof(struct platform_device *), GFP_KERNEL); - if (!pkg_devices) + if (!zone_devices) return -ENOMEM; err = platform_driver_register(&coretemp_driver); @@ -760,7 +760,7 @@ static int __init coretemp_init(void) outdrv: platform_driver_unregister(&coretemp_driver); - kfree(pkg_devices); + kfree(zone_devices); return err; } module_init(coretemp_init) @@ -769,7 +769,7 @@ static void __exit coretemp_exit(void) { cpuhp_remove_state(coretemp_hp_online); platform_driver_unregister(&coretemp_driver); - kfree(pkg_devices); + kfree(zone_devices); } module_exit(coretemp_exit) -- 2.18.0-rc0
[PATCH 13/19] perf/x86/intel/uncore: Support multi-die/package
From: Kan Liang Uncore becomes die-scope on Xeon Cascade Lake-AP. Uncore driver needs to support die-scope uncore units. Use topology_logical_die_id() to replace topology_logical_package_id(). For previous platforms which doesn't have multi-die, topology_logical_die_id() is identical as topology_logical_package_id(). In pci_probe/remove, the group id reads from PCI BUS is logical die id for multi-die systems. Use topology_die_cpumask() to replace topology_core_cpumask(). For previous platforms which doesn't have multi-die, topology_die_cpumask() is identical as topology_core_cpumask(). There is no functional change for previous platforms. Signed-off-by: Kan Liang Cc: Peter Zijlstra Signed-off-by: Len Brown --- arch/x86/events/intel/uncore.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index fc40a1473058..aeb5eae83750 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -100,7 +100,7 @@ ssize_t uncore_event_show(struct kobject *kobj, struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu) { - unsigned int pkgid = topology_logical_package_id(cpu); + unsigned int pkgid = topology_logical_die_id(cpu); /* * The unsigned check also catches the '-1' return value for non @@ -942,7 +942,8 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id if (phys_id < 0) return -ENODEV; - pkg = topology_phys_to_logical_pkg(phys_id); + pkg = (topology_max_die_per_package() > 1) ? phys_id : + topology_phys_to_logical_pkg(phys_id); if (pkg < 0) return -EINVAL; @@ -1033,7 +1034,8 @@ static void uncore_pci_remove(struct pci_dev *pdev) box = pci_get_drvdata(pdev); if (!box) { - pkg = topology_phys_to_logical_pkg(phys_id); + pkg = (topology_max_die_per_package() > 1) ? phys_id : + topology_phys_to_logical_pkg(phys_id); for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { if (uncore_extra_pci_dev[pkg].dev[i] == pdev) { uncore_extra_pci_dev[pkg].dev[i] = NULL; @@ -1110,7 +1112,7 @@ static void uncore_change_type_ctx(struct intel_uncore_type *type, int old_cpu, struct intel_uncore_box *box; int i, pkg; - pkg = topology_logical_package_id(old_cpu < 0 ? new_cpu : old_cpu); + pkg = topology_logical_die_id(old_cpu < 0 ? new_cpu : old_cpu); for (i = 0; i < type->num_boxes; i++, pmu++) { box = pmu->boxes[pkg]; if (!box) @@ -1151,7 +1153,7 @@ static int uncore_event_cpu_offline(unsigned int cpu) if (!cpumask_test_and_clear_cpu(cpu, &uncore_cpu_mask)) goto unref; /* Find a new cpu to collect uncore events */ - target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + target = cpumask_any_but(topology_die_cpumask(cpu), cpu); /* Migrate uncore events to the new target */ if (target < nr_cpu_ids) @@ -1164,7 +1166,7 @@ static int uncore_event_cpu_offline(unsigned int cpu) unref: /* Clear the references */ - pkg = topology_logical_package_id(cpu); + pkg = topology_logical_die_id(cpu); for (; *types; types++) { type = *types; pmu = type->pmus; @@ -1223,7 +1225,7 @@ static int uncore_event_cpu_online(unsigned int cpu) struct intel_uncore_box *box; int i, ret, pkg, target; - pkg = topology_logical_package_id(cpu); + pkg = topology_logical_die_id(cpu); ret = allocate_boxes(types, pkg, cpu); if (ret) return ret; @@ -1242,7 +1244,7 @@ static int uncore_event_cpu_online(unsigned int cpu) * Check if there is an online cpu in the package * which collects uncore events already. */ - target = cpumask_any_and(&uncore_cpu_mask, topology_core_cpumask(cpu)); + target = cpumask_any_and(&uncore_cpu_mask, topology_die_cpumask(cpu)); if (target < nr_cpu_ids) return 0; @@ -1417,7 +1419,7 @@ static int __init intel_uncore_init(void) if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) return -ENODEV; - max_packages = topology_max_packages(); + max_packages = topology_max_packages() * topology_max_die_per_package(); uncore_init = (struct intel_uncore_init_fun *)id->driver_data; if (uncore_init->pci_init) { -- 2.18.0-rc0
[PATCH 15/19] perf/x86/intel/cstate: Support multi-die/package
From: Kan Liang Some cstate counters becomes die-scope on Xeon Cascade Lake-AP. Perf cstate driver needs to support die-scope cstate counters. Use topology_die_cpumask() to replace topology_core_cpumask(). For previous platforms which doesn't have multi-die, topology_die_cpumask() is identical as topology_core_cpumask(). There is no functional change for previous platforms. Name the die-scope PMU "cstate_die". Signed-off-by: Kan Liang Cc: Peter Zijlstra Signed-off-by: Len Brown --- arch/x86/events/intel/cstate.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c index 6072f92cb8ea..267d7f8e12ab 100644 --- a/arch/x86/events/intel/cstate.c +++ b/arch/x86/events/intel/cstate.c @@ -302,7 +302,7 @@ static int cstate_pmu_event_init(struct perf_event *event) return -EINVAL; event->hw.event_base = pkg_msr[cfg].msr; cpu = cpumask_any_and(&cstate_pkg_cpu_mask, - topology_core_cpumask(event->cpu)); + topology_die_cpumask(event->cpu)); } else { return -ENOENT; } @@ -385,7 +385,7 @@ static int cstate_cpu_exit(unsigned int cpu) if (has_cstate_pkg && cpumask_test_and_clear_cpu(cpu, &cstate_pkg_cpu_mask)) { - target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + target = cpumask_any_but(topology_die_cpumask(cpu), cpu); /* Migrate events if there is a valid target */ if (target < nr_cpu_ids) { cpumask_set_cpu(target, &cstate_pkg_cpu_mask); @@ -414,7 +414,7 @@ static int cstate_cpu_init(unsigned int cpu) * in the package cpu mask as the designated reader. */ target = cpumask_any_and(&cstate_pkg_cpu_mask, -topology_core_cpumask(cpu)); +topology_die_cpumask(cpu)); if (has_cstate_pkg && target >= nr_cpu_ids) cpumask_set_cpu(cpu, &cstate_pkg_cpu_mask); @@ -663,7 +663,13 @@ static int __init cstate_init(void) } if (has_cstate_pkg) { - err = perf_pmu_register(&cstate_pkg_pmu, cstate_pkg_pmu.name, -1); + if (topology_max_die_per_package() > 1) { + err = perf_pmu_register(&cstate_pkg_pmu, + "cstate_die", -1); + } else { + err = perf_pmu_register(&cstate_pkg_pmu, + cstate_pkg_pmu.name, -1); + } if (err) { has_cstate_pkg = false; pr_info("Failed to register cstate pkg pmu\n"); -- 2.18.0-rc0
[PATCH 09/19] powercap/intel_rapl: Update RAPL domain name and debug messages
From: Zhang Rui The RAPL domain "name" attribute contains "Package-N", which is ambiguous on multi-die per-package systems. Update the name to "package-X-die-Y" on those systems. No change on systems without multi-die/package. Update driver debug messages. Signed-off-by: Zhang Rui Signed-off-by: Len Brown Acked-by: Rafael J. Wysocki Cc: linux...@vger.kernel.org --- drivers/powercap/intel_rapl.c | 57 --- 1 file changed, 32 insertions(+), 25 deletions(-) diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 9202dbcef96d..ad78c1d08260 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -178,12 +178,15 @@ struct rapl_domain { #define power_zone_to_rapl_domain(_zone) \ container_of(_zone, struct rapl_domain, power_zone) +/* maximum rapl package domain name: package-%d-die-%d */ +#define PACKAGE_DOMAIN_NAME_LENGTH 30 -/* Each physical package contains multiple domains, these are the common + +/* Each rapl package contains multiple domains, these are the common * data across RAPL domains within a package. */ struct rapl_package { - unsigned int id; /* physical package/socket id */ + unsigned int id; /* logical die id, equals physical 1-die systems */ unsigned int nr_domains; unsigned long domain_map; /* bit map of active domains */ unsigned int power_unit; @@ -198,6 +201,7 @@ struct rapl_package { int lead_cpu; /* one active cpu per package for access */ /* Track active cpus */ struct cpumask cpumask; + char name[PACKAGE_DOMAIN_NAME_LENGTH]; }; struct rapl_defaults { @@ -926,8 +930,8 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 100 / (1 << value); - pr_debug("Core CPU package %d energy=%dpJ, time=%dus, power=%duW\n", - rp->id, rp->energy_unit, rp->time_unit, rp->power_unit); + pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n", + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit); return 0; } @@ -951,8 +955,8 @@ static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 100 / (1 << value); - pr_debug("Atom package %d energy=%dpJ, time=%dus, power=%duW\n", - rp->id, rp->energy_unit, rp->time_unit, rp->power_unit); + pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n", + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit); return 0; } @@ -1181,7 +1185,7 @@ static void rapl_update_domain_data(struct rapl_package *rp) u64 val; for (dmn = 0; dmn < rp->nr_domains; dmn++) { - pr_debug("update package %d domain %s data\n", rp->id, + pr_debug("update %s domain %s data\n", rp->name, rp->domains[dmn].name); /* exclude non-raw primitives */ for (prim = 0; prim < NR_RAW_PRIMITIVES; prim++) { @@ -1206,7 +1210,6 @@ static void rapl_unregister_powercap(void) static int rapl_package_register_powercap(struct rapl_package *rp) { struct rapl_domain *rd; - char dev_name[17]; /* max domain name = 7 + 1 + 8 for int + 1 for null*/ struct powercap_zone *power_zone = NULL; int nr_pl, ret; @@ -1217,20 +1220,16 @@ static int rapl_package_register_powercap(struct rapl_package *rp) for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) { if (rd->id == RAPL_DOMAIN_PACKAGE) { nr_pl = find_nr_power_limit(rd); - pr_debug("register socket %d package domain %s\n", - rp->id, rd->name); - memset(dev_name, 0, sizeof(dev_name)); - snprintf(dev_name, sizeof(dev_name), "%s-%d", - rd->name, rp->id); + pr_debug("register package domain %s\n", rp->name); power_zone = powercap_register_zone(&rd->power_zone, control_type, - dev_name, NULL, + rp->name, NULL, &zone_ops[rd->id], nr_pl, &constraint_ops); if (IS_ERR(power_zone)) { -
[PATCH 19/19] perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support
From: Kan Liang Syntax update only -- no logical or functional change. In response to the new multi-die/package changes, update variable names to use "die" terminology, instead of "pkg". For previous platforms which doesn't have multi-die, "die" is identical as "pkg". Signed-off-by: Kan Liang Signed-off-by: Len Brown --- arch/x86/events/intel/rapl.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c index 6f5331271563..3992b0e65a55 100644 --- a/arch/x86/events/intel/rapl.c +++ b/arch/x86/events/intel/rapl.c @@ -148,7 +148,7 @@ struct rapl_pmu { struct rapl_pmus { struct pmu pmu; - unsigned intmaxpkg; + unsigned intmaxdie; struct rapl_pmu *pmus[]; }; @@ -161,13 +161,13 @@ static u64 rapl_timer_ms; static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu) { - unsigned int pkgid = topology_logical_die_id(cpu); + unsigned int dieid = topology_logical_die_id(cpu); /* * The unsigned check also catches the '-1' return value for non * existent mappings in the topology map. */ - return pkgid < rapl_pmus->maxpkg ? rapl_pmus->pmus[pkgid] : NULL; + return dieid < rapl_pmus->maxdie ? rapl_pmus->pmus[dieid] : NULL; } static inline u64 rapl_read_counter(struct perf_event *event) @@ -668,22 +668,22 @@ static void cleanup_rapl_pmus(void) { int i; - for (i = 0; i < rapl_pmus->maxpkg; i++) + for (i = 0; i < rapl_pmus->maxdie; i++) kfree(rapl_pmus->pmus[i]); kfree(rapl_pmus); } static int __init init_rapl_pmus(void) { - int maxpkg = topology_max_packages() * topology_max_die_per_package(); + int maxdie = topology_max_packages() * topology_max_die_per_package(); size_t size; - size = sizeof(*rapl_pmus) + maxpkg * sizeof(struct rapl_pmu *); + size = sizeof(*rapl_pmus) + maxdie * sizeof(struct rapl_pmu *); rapl_pmus = kzalloc(size, GFP_KERNEL); if (!rapl_pmus) return -ENOMEM; - rapl_pmus->maxpkg = maxpkg; + rapl_pmus->maxdie = maxdie; rapl_pmus->pmu.attr_groups = rapl_attr_groups; rapl_pmus->pmu.task_ctx_nr = perf_invalid_context; rapl_pmus->pmu.event_init = rapl_pmu_event_init; -- 2.18.0-rc0
[PATCH 14/19] perf/x86/intel/rapl: Support multi-die/package
From: Kan Liang RAPL becomes die-scope on Xeon Cascade Lake-AP. Perf RAPL driver needs to support die-scope RAPL domain. Use topology_logical_die_id() to replace topology_logical_package_id(). For previous platforms which doesn't have multi-die, topology_logical_die_id() is identical as topology_logical_package_id(). Use topology_die_cpumask() to replace topology_core_cpumask(). For previous platforms which doesn't have multi-die, topology_die_cpumask() is identical as topology_core_cpumask(). There is no functional change for previous platforms. Signed-off-by: Kan Liang Cc: Peter Zijlstra Signed-off-by: Len Brown --- arch/x86/events/intel/rapl.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c index 37ebf6fc5415..6f5331271563 100644 --- a/arch/x86/events/intel/rapl.c +++ b/arch/x86/events/intel/rapl.c @@ -161,7 +161,7 @@ static u64 rapl_timer_ms; static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu) { - unsigned int pkgid = topology_logical_package_id(cpu); + unsigned int pkgid = topology_logical_die_id(cpu); /* * The unsigned check also catches the '-1' return value for non @@ -571,7 +571,7 @@ static int rapl_cpu_offline(unsigned int cpu) pmu->cpu = -1; /* Find a new cpu to collect rapl events */ - target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + target = cpumask_any_but(topology_die_cpumask(cpu), cpu); /* Migrate rapl events to the new target */ if (target < nr_cpu_ids) { @@ -598,14 +598,14 @@ static int rapl_cpu_online(unsigned int cpu) pmu->timer_interval = ms_to_ktime(rapl_timer_ms); rapl_hrtimer_init(pmu); - rapl_pmus->pmus[topology_logical_package_id(cpu)] = pmu; + rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu; } /* * Check if there is an online cpu in the package which collects rapl * events already. */ - target = cpumask_any_and(&rapl_cpu_mask, topology_core_cpumask(cpu)); + target = cpumask_any_and(&rapl_cpu_mask, topology_die_cpumask(cpu)); if (target < nr_cpu_ids) return 0; @@ -675,7 +675,7 @@ static void cleanup_rapl_pmus(void) static int __init init_rapl_pmus(void) { - int maxpkg = topology_max_packages(); + int maxpkg = topology_max_packages() * topology_max_die_per_package(); size_t size; size = sizeof(*rapl_pmus) + maxpkg * sizeof(struct rapl_pmu *); -- 2.18.0-rc0
[PATCH 0/19] v6 multi-die/package topology support
This patch series does 4 things. 1. Parse the new CPUID.1F leaf to discover multi-die/package topology 2. Export multi-die topology inside the kernel 3. Update 4 places (coretemp, pkgtemp, rapl, perf) that that need to know the difference between die and package-scope MSR. 4. Export multi-die topology to user-space via sysfs These changes should have no impact on cache topology, NUMA topology, Linux scheduler, or system performance. These topology changes primarily impact parts of the kernel and some applications that care about package MSR scope. Also, some software is licensed per package, and other tools, such as benchmark reporting software sometimes cares about packages. Changes since v5: The patch numbers have decremented by 3, since the first 3 patches in the original series are now upstream. The last two "Cosmetic rename" patches have been replaced with those by Kan, who did a more thorough variable rename than I had originally proposed. [PATCH 18/19] perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support [PATCH 19/19] perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support I'm not aware of any outstanding feedback on this series, functional or cosmetic. thanks, Len Brown, Intel Opensource Technology Center -- The following changes since commit a13f0655503a4a89df67fdc7cac6a7810795d4b3: Merge tag 'iommu-updates-v5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu (2019-05-13 09:23:18 -0400) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86 for you to fetch changes up to 0ddb97e121397d37933233da303556141814fa47: perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support (2019-05-13 13:41:50 -0400) Kan Liang (5): perf/x86/intel/uncore: Support multi-die/package perf/x86/intel/rapl: Support multi-die/package perf/x86/intel/cstate: Support multi-die/package perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support Len Brown (9): x86 topology: Add CPUID.1F multi-die/package support x86 topology: Create topology_max_die_per_package() cpu topology: Export die_id x86 topology: Define topology_die_id() x86 topology: Define topology_logical_die_id() topology: Create package_cpus sysfs attribute topology: Create core_cpus and die_cpus sysfs attributes thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages Zhang Rui (5): powercap/intel_rapl: Simplify rapl_find_package() powercap/intel_rapl: Support multi-die/package thermal/x86_pkg_temp_thermal: Support multi-die/package powercap/intel_rapl: Update RAPL domain name and debug messages hwmon/coretemp: Support multi-die/package Documentation/cputopology.txt| 48 ++--- Documentation/x86/topology.rst | 4 + arch/x86/events/intel/cstate.c | 14 ++- arch/x86/events/intel/rapl.c | 20 ++-- arch/x86/events/intel/uncore.c | 80 +++ arch/x86/events/intel/uncore.h | 4 +- arch/x86/events/intel/uncore_snbep.c | 4 +- arch/x86/include/asm/processor.h | 4 +- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 17 arch/x86/kernel/cpu/common.c | 1 + arch/x86/kernel/cpu/topology.c | 88 + arch/x86/kernel/smpboot.c| 69 + arch/x86/xen/smp_pv.c| 1 + drivers/base/topology.c | 22 + drivers/hwmon/coretemp.c | 36 +++ drivers/powercap/intel_rapl.c| 75 +++--- drivers/thermal/intel/x86_pkg_temp_thermal.c | 142 ++- include/linux/topology.h | 6 ++ 19 files changed, 422 insertions(+), 214 deletions(-)
[PATCH 11/19] topology: Create package_cpus sysfs attribute
From: Len Brown The existing sysfs cpu/topology/core_siblings (and core_siblings_list) attributes are documented, implemented, and used by programs to represent set of logical CPUs sharing the same package. This makes sense if the next topology level above a core is always a package. But on systems where there is a die topology level between a core and a package, the name and its definition become inconsistent. So without changing its function, add a name for this map that describes what it actually is -- package CPUs -- the set of CPUs that share the same package. This new name will be immune to changes in topology, since it describes threads at the current level, not siblings at a contained level. Signed-off-by: Len Brown Suggested-by: Brice Goglin --- Documentation/cputopology.txt | 12 ++-- drivers/base/topology.c | 6 ++ 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 2ff8a1e9a2db..48af5c290e20 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -46,15 +46,15 @@ thread_siblings_list: human-readable list of cpuX's hardware threads within the same core as cpuX. -core_siblings: +package_cpus: - internal kernel map of cpuX's hardware threads within the same - physical_package_id. + internal kernel map of the CPUs sharing the same physical_package_id. + (deprecated name: "core_siblings") -core_siblings_list: +package_cpus_list: - human-readable list of cpuX's hardware threads within the same - physical_package_id. + human-readable list of CPUs sharing the same physical_package_id. + (deprecated name: "core_siblings_list") book_siblings: diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 50352cf96f85..dc3c19b482f3 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -57,6 +57,10 @@ define_siblings_show_func(core_siblings, core_cpumask); static DEVICE_ATTR_RO(core_siblings); static DEVICE_ATTR_RO(core_siblings_list); +define_siblings_show_func(package_cpus, core_cpumask); +static DEVICE_ATTR_RO(package_cpus); +static DEVICE_ATTR_RO(package_cpus_list); + #ifdef CONFIG_SCHED_BOOK define_id_show_func(book_id); static DEVICE_ATTR_RO(book_id); @@ -81,6 +85,8 @@ static struct attribute *default_attrs[] = { &dev_attr_thread_siblings_list.attr, &dev_attr_core_siblings.attr, &dev_attr_core_siblings_list.attr, + &dev_attr_package_cpus.attr, + &dev_attr_package_cpus_list.attr, #ifdef CONFIG_SCHED_BOOK &dev_attr_book_id.attr, &dev_attr_book_siblings.attr, -- 2.18.0-rc0
[PATCH 04/19] x86 topology: Define topology_die_id()
From: Len Brown topology_die_id(cpu) is a simple macro for use inside the kernel to get the die_id associated with the given cpu. Signed-off-by: Len Brown --- arch/x86/include/asm/topology.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index e0232f7042c3..3777dbe9c0ff 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) +#define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) #ifdef CONFIG_SMP -- 2.18.0-rc0
[PATCH 07/19] powercap/intel_rapl: Support multi-die/package
From: Zhang Rui RAPL "package" domains are actually implemented in hardware per-die. Thus, the new multi-die/package systems have mulitple domains within each physical package. Update the intel_rapl driver to be "die aware" -- exporting multiple domains within a single package, when present. No change on single die/package systems. Signed-off-by: Zhang Rui Signed-off-by: Len Brown Acked-by: Rafael J. Wysocki Cc: linux...@vger.kernel.org --- drivers/powercap/intel_rapl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 3c3c0c23180b..9202dbcef96d 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -266,7 +266,7 @@ static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */ /* caller to ensure CPU hotplug lock is held */ static struct rapl_package *rapl_find_package_domain(int cpu) { - int id = topology_physical_package_id(cpu); + int id = topology_logical_die_id(cpu); struct rapl_package *rp; list_for_each_entry(rp, &rapl_packages, plist) { @@ -1459,7 +1459,7 @@ static void rapl_remove_package(struct rapl_package *rp) /* called from CPU hotplug notifier, hotplug lock held */ static struct rapl_package *rapl_add_package(int cpu) { - int id = topology_physical_package_id(cpu); + int id = topology_logical_die_id(cpu); struct rapl_package *rp; int ret; -- 2.18.0-rc0
Re: [PATCH 21/22] perf/x86/intel/uncore: renames in response to multi-die/pkg support
On Thu, May 9, 2019 at 11:02 AM Liang, Kan wrote: > > I think the "box" terminology in perf uncore has different meaning. It > stands for an uncore PMU unit on a socket/die. > I think it may be better use "die" to replace the "pkg". > How about the patch as below? Also fine with me. And I've replaced my rename patch with yours here too. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH 22/22] perf/x86/intel/rapl: rename internal variables in response to multi-die/pkg support
On Thu, May 9, 2019 at 11:04 AM Liang, Kan wrote: > The perf rapl "pmu" in the code is cross the pkg/die. We only register > one rapl pmu for whole system for now. > I think it may be better use "die" to replace the "pkg" as well. > How about the patch as below? Fine with me! I've replaced my patch with yours in the series. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH 16/22] perf/x86/intel/uncore: Support multi-die/package
On Tue, May 7, 2019 at 8:22 AM Peter Zijlstra wrote: > > On Mon, May 06, 2019 at 05:26:11PM -0400, Len Brown wrote: > > @@ -1223,7 +1225,7 @@ static int uncore_event_cpu_online(unsigned int cpu) > > struct intel_uncore_box *box; > > int i, ret, pkg, target; > > > > - pkg = topology_logical_package_id(cpu); > > + pkg = topology_logical_die_id(cpu); > > ret = allocate_boxes(types, pkg, cpu); > > if (ret) > > return ret; > > 'pkg' != 'die' To keep the patch with this functional change minimal and obvious, the rename of this variable was segregated into patch 22, which is re-names only. Len Brown, Intel Open Source Technology Center
Re: [PATCH 16/22] perf/x86/intel/uncore: Support multi-die/package
On Tue, May 7, 2019 at 8:21 AM Peter Zijlstra wrote: > > On Mon, May 06, 2019 at 05:26:11PM -0400, Len Brown wrote: > > @@ -1411,7 +1413,7 @@ static int __init intel_uncore_init(void) > > if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) > > return -ENODEV; > > > > - max_packages = topology_max_packages(); > > + max_packages = topology_max_packages() * > > topology_max_die_per_package(); > > That's max_dies.. To keep the patch with this functional change minimal and obvious, the rename of this variable was segregated into patch 22, which is re-names only. Len Brown, Intel Open Source Technology Center
Re: [PATCH 10/22] powercap/intel_rapl: Support multi-die/package
On Tue, May 7, 2019 at 8:15 AM Peter Zijlstra wrote: > > On Mon, May 06, 2019 at 05:26:05PM -0400, Len Brown wrote: > > From: Zhang Rui > > > > RAPL "package" domains are actually implemented in hardware per-die. > > Thus, the new multi-die/package systems have mulitple domains > > within each physical package. > > > > Update the intel_rapl driver to be "die aware" -- exporting multiple > > domains within a single package, when present. > > No change on single die/package systems. > > > > Signed-off-by: Zhang Rui > > Signed-off-by: Len Brown > > Acked-by: Rafael J. Wysocki > > Cc: linux...@vger.kernel.org > > --- > > drivers/powercap/intel_rapl.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c > > index 3c3c0c23180b..9202dbcef96d 100644 > > --- a/drivers/powercap/intel_rapl.c > > +++ b/drivers/powercap/intel_rapl.c > > @@ -266,7 +266,7 @@ static struct rapl_domain *platform_rapl_domain; /* > > Platform (PSys) domain */ > > /* caller to ensure CPU hotplug lock is held */ > > static struct rapl_package *rapl_find_package_domain(int cpu) > > { > > - int id = topology_physical_package_id(cpu); > > + int id = topology_logical_die_id(cpu); > > struct rapl_package *rp; > Both functions are still misnomers. rapl_find_package_domain() does in > fact now do rapl_find_die_domain(), right? Same for rapl_add_package() A "RAPL Package Domain" (rapl_package, above) is a known proper noun -- it is a known documented capability. When there could be just 1 die in a package, the name of this capability also always matched the scope of a physical package. Now that some products have two die in the same package, there can be two of these inside a package, and they are associated with each die. There are no plans to re-name the Package RAPL Domain capability in the hardware documentation. Similarly, there are no plans to re-name any of the other "PACKAGE" scoped MSRs to have "DIE" in their name instead. The ship with those names sailed long ago. I think the code above reflects its function, and that somebody maintaining it will be clear on this. That is important, because in the future, there will be a concept of PACKAGE scoped MSRs that span multiple DIE... cheers, Len Brown, Intel Open Source Technology Center
[PATCH 04/22] x86 topology: Add CPUID.1F multi-die/package support
From: Len Brown Some new systems have multiple software-visible die within each package. Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle either CPUID.B, or the new CPUID.1F. Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result. die_id will be non-zero only for multi-die/package systems. Signed-off-by: Len Brown Cc: linux-...@vger.kernel.org --- Documentation/x86/topology.txt | 4 ++ arch/x86/include/asm/processor.h | 4 +- arch/x86/kernel/cpu/topology.c | 85 +--- arch/x86/kernel/smpboot.c| 2 + 4 files changed, 75 insertions(+), 20 deletions(-) diff --git a/Documentation/x86/topology.txt b/Documentation/x86/topology.txt index 06b3cdbc4048..8107b6cfc9ea 100644 --- a/Documentation/x86/topology.txt +++ b/Documentation/x86/topology.txt @@ -46,6 +46,10 @@ The topology of a system is described in the units of: The number of cores in a package. This information is retrieved via CPUID. + - cpuinfo_x86.x86_max_dies: + +The number of dies in a package. This information is retrieved via CPUID. + - cpuinfo_x86.phys_proc_id: The physical ID of the package. This information is retrieved via CPUID diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 2bb3a648fc12..2507edc30cc2 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -105,7 +105,8 @@ struct cpuinfo_x86 { int x86_power; unsigned long loops_per_jiffy; /* cpuid returned max cores value: */ - u16 x86_max_cores; + u16 x86_max_cores; + u16 x86_max_dies; u16 apicid; u16 initial_apicid; u16 x86_clflush_size; @@ -117,6 +118,7 @@ struct cpuinfo_x86 { u16 logical_proc_id; /* Core id: */ u16 cpu_core_id; + u16 cpu_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 8f6c784141d1..4d17e699657d 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -15,33 +15,63 @@ /* leaf 0xb SMT level */ #define SMT_LEVEL 0 -/* leaf 0xb sub-leaf types */ +/* extended topology sub-leaf types */ #define INVALID_TYPE 0 #define SMT_TYPE 1 #define CORE_TYPE 2 +#define DIE_TYPE 5 #define LEAFB_SUBTYPE(ecx) (((ecx) >> 8) & 0xff) #define BITS_SHIFT_NEXT_LEVEL(eax) ((eax) & 0x1f) #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x) -int detect_extended_topology_early(struct cpuinfo_x86 *c) -{ #ifdef CONFIG_SMP +/* + * Check if given CPUID extended toplogy "leaf" is implemented + */ +static int check_extended_topology_leaf(int leaf) +{ unsigned int eax, ebx, ecx, edx; - if (c->cpuid_level < 0xb) + cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + + if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE)) return -1; - cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + return 0; +} +/* + * Return best CPUID Extended Toplogy Leaf supported + */ +static int detect_extended_topology_leaf(struct cpuinfo_x86 *c) +{ + if (c->cpuid_level >= 0x1f) { + if (check_extended_topology_leaf(0x1f) == 0) + return 0x1f; + } - /* -* check if the cpuid leaf 0xb is actually implemented. -*/ - if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE)) + if (c->cpuid_level >= 0xb) { + if (check_extended_topology_leaf(0xb) == 0) + return 0xb; + } + + return -1; +} +#endif + +int detect_extended_topology_early(struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + unsigned int eax, ebx, ecx, edx; + int leaf; + + leaf = detect_extended_topology_leaf(c); + if (leaf < 0) return -1; set_cpu_cap(c, X86_FEATURE_XTOPOLOGY); + cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx); /* * initial apic id, which also represents 32-bit extended x2apic id. */ @@ -52,7 +82,7 @@ int detect_extended_topology_early(struct cpuinfo_x86 *c) } /* - * Check for extended topology enumeration cpuid leaf 0xb and if it + * Check for extended topology enumeration cpuid leaf, and if it * exists, use it for populating initial_apicid and cpu topology * detection. */ @@ -60,22 +90,28 @@ int detect_extended_topology(struct cpuinfo_x86 *c) { #ifdef CONFIG_SMP unsigned int eax, ebx, ecx, edx, sub_index; - unsigned int ht_mask
[PATCH 05/22] x86 topology: Create topology_max_die_per_package()
From: Len Brown topology_max_packages() is available to size resources to cover all packages in the system. But now we have multi-die/package systems, and some resources are per-die. Create topology_max_die_per_package(), for detecting multi-die/package systems, and sizing any per-die resources. Signed-off-by: Len Brown --- arch/x86/include/asm/processor.h | 1 - arch/x86/include/asm/topology.h | 10 ++ arch/x86/kernel/cpu/topology.c | 5 - 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 2507edc30cc2..5f45488b1a9d 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -106,7 +106,6 @@ struct cpuinfo_x86 { unsigned long loops_per_jiffy; /* cpuid returned max cores value: */ u16 x86_max_cores; - u16 x86_max_dies; u16 apicid; u16 initial_apicid; u16 x86_clflush_size; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38a1c33..e0232f7042c3 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -115,6 +115,13 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) +extern unsigned int __max_die_per_package; + +static inline int topology_max_die_per_package(void) +{ + return __max_die_per_package; +} + extern int __max_smt_threads; static inline int topology_max_smt_threads(void) @@ -131,6 +138,9 @@ bool topology_smt_supported(void); static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } +static inline int topology_phys_to_logical_die(unsigned int die, + unsigned int cpu) { return 0; } +static inline int topology_max_die_per_package(void) { return 1; } static inline int topology_max_smt_threads(void) { return 1; } static inline bool topology_is_primary_thread(unsigned int cpu) { return true; } static inline bool topology_smt_supported(void) { return false; } diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 4d17e699657d..ee48c3fc8a65 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -26,6 +26,9 @@ #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x) #ifdef CONFIG_SMP +unsigned int __max_die_per_package __read_mostly = 1; +EXPORT_SYMBOL(__max_die_per_package); + /* * Check if given CPUID extended toplogy "leaf" is implemented */ @@ -146,7 +149,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c) c->apicid = apic->phys_pkg_id(c->initial_apicid, 0); c->x86_max_cores = (core_level_siblings / smp_num_siblings); - c->x86_max_dies = (die_level_siblings / core_level_siblings); + __max_die_per_package = (die_level_siblings / core_level_siblings); #endif return 0; } -- 2.18.0-rc0
[PATCH 08/22] x86 topology: Define topology_logical_die_id()
From: Len Brown Define topology_logical_die_id() ala existing topology_logical_package_id() Tested-by: Zhang Rui Signed-off-by: Len Brown --- arch/x86/include/asm/processor.h | 1 + arch/x86/include/asm/topology.h | 5 arch/x86/kernel/cpu/common.c | 1 + arch/x86/kernel/smpboot.c| 45 4 files changed, 52 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 5f45488b1a9d..def963d0b805 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -118,6 +118,7 @@ struct cpuinfo_x86 { /* Core id: */ u16 cpu_core_id; u16 cpu_die_id; + u16 logical_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 3777dbe9c0ff..9de16b4f6023 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) +#define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id) #define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) @@ -131,13 +132,17 @@ static inline int topology_max_smt_threads(void) } int topology_update_package_map(unsigned int apicid, unsigned int cpu); +int topology_update_die_map(unsigned int dieid, unsigned int cpu); int topology_phys_to_logical_pkg(unsigned int pkg); +int topology_phys_to_logical_die(unsigned int die, unsigned int cpu); bool topology_is_primary_thread(unsigned int cpu); bool topology_smt_supported(void); #else #define topology_max_packages()(1) static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } +static inline int +topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } static inline int topology_phys_to_logical_die(unsigned int die, unsigned int cpu) { return 0; } diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index cb28e98a0659..24f96c9771af 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1285,6 +1285,7 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c) cpu, apicid, c->initial_apicid); } BUG_ON(topology_update_package_map(c->phys_proc_id, cpu)); + BUG_ON(topology_update_die_map(c->cpu_die_id, cpu)); #else c->logical_proc_id = 0; #endif diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 05f9cfdddffd..a114375e14f7 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,6 +101,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info); unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); static unsigned int logical_packages __read_mostly; +static unsigned int logical_die __read_mostly; /* Maximum number of SMT threads on any online core */ int __read_mostly __max_smt_threads = 1; @@ -302,6 +303,26 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg) return -1; } EXPORT_SYMBOL(topology_phys_to_logical_pkg); +/** + * topology_phys_to_logical_die - Map a physical die id to logical + * + * Returns logical die id or -1 if not found + */ +int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu) +{ + int cpu; + int proc_id = cpu_data(cur_cpu).phys_proc_id; + + for_each_possible_cpu(cpu) { + struct cpuinfo_x86 *c = &cpu_data(cpu); + + if (c->initialized && c->cpu_die_id == die_id && + c->phys_proc_id == proc_id) + return c->logical_die_id; + } + return -1; +} +EXPORT_SYMBOL(topology_phys_to_logical_die); /** * topology_update_package_map - Update the physical to logical package map @@ -326,6 +347,29 @@ int topology_update_package_map(unsigned int pkg, unsigned int cpu) cpu_data(cpu).logical_proc_id = new; return 0; } +/** + * topology_update_die_map - Update the physical to logical die map + * @die: The die id as retrieved via CPUID + * @cpu: The cpu for which this is updated + */ +int topology_update_die_map(unsigned int die, unsigned int cpu) +{ + int new; + + /* Already available somewhere? */ + new = topology_phys_to_logical_die(die, cpu); + if (new >= 0) + goto found; + + new = logical_die++; + if
[PATCH 15/22] topology: Create core_cpus and die_cpus sysfs attributes
From: Len Brown Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list" These attributes represent all of the logical CPUs that share the same core. These attriutes is synonymous with the existing "thread_siblings" and "thread_siblings_list" attribute, which will be deprecated. Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list". These attributes represent all of the logical CPUs that share the same die. Signed-off-by: Len Brown Suggested-by: Brice Goglin --- Documentation/cputopology.txt | 21 +++-- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 22 ++ arch/x86/xen/smp_pv.c | 1 + drivers/base/topology.c | 12 include/linux/topology.h| 3 +++ 7 files changed, 55 insertions(+), 6 deletions(-) diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 48af5c290e20..b90dafcc8237 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -36,15 +36,15 @@ drawer_id: identifier (rather than the kernel's). The actual value is architecture and platform dependent. -thread_siblings: +core_cpus: - internal kernel map of cpuX's hardware threads within the same - core as cpuX. + internal kernel map of CPUs within the same core. + (deprecated name: "thread_siblings") -thread_siblings_list: +core_cpus_list: - human-readable list of cpuX's hardware threads within the same - core as cpuX. + human-readable list of CPUs within the same core. + (deprecated name: "thread_siblings_list"); package_cpus: @@ -56,6 +56,14 @@ package_cpus_list: human-readable list of CPUs sharing the same physical_package_id. (deprecated name: "core_siblings_list") +die_cpus: + + internal kernel map of CPUs within the same die. + +die_cpus_list: + + human-readable list of CPUs within the same die. + book_siblings: internal kernel map of cpuX's hardware threads within the same @@ -93,6 +101,7 @@ these macros in include/asm-XXX/topology.h:: #define topology_drawer_id(cpu) #define topology_sibling_cpumask(cpu) #define topology_core_cpumask(cpu) + #define topology_die_cpumask(cpu) #define topology_book_cpumask(cpu) #define topology_drawer_cpumask(cpu) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 2e95b6c1bca3..39266d193597 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -23,6 +23,7 @@ extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); +DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 9de16b4f6023..4b14d2318251 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id) #ifdef CONFIG_SMP +#define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu)) #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index a114375e14f7..48a671443266 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -91,6 +91,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); +/* representing HT, core, and die siblings of each logical CPU */ +DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); +EXPORT_PER_CPU_SYMBOL(cpu_die_map); + DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); /* Per CPU bogomips and other parameters */ @@ -509,6 +513,15 @@ static bool match_pkg(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) return false; } +static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + if ((c->phys_proc_id == o->phys_proc_id) && + (c->cpu_die_id == o->cpu_die_id)) + return true; + return false; +} + + #if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC) static inline int x86_sched_itmt_flags(void) { @@ -571,6 +584,7 @@ void set_cpu_sibling_map(int cpu) cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)
[PATCH 03/22] x86 smpboot: Rename match_die() to match_pkg()
From: Len Brown Syntax only, no functional or semantic change. This routine matches packages, not die, so name it thus. Signed-off-by: Len Brown --- arch/x86/kernel/smpboot.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index ce1a67b70168..3f8bbfb26c18 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -455,7 +455,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) * multicore group inside a NUMA node. If this happens, we will * discard the MC level of the topology later. */ -static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +static bool match_pkg(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) { if (c->phys_proc_id == o->phys_proc_id) return true; @@ -546,7 +546,7 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); - if ((i == cpu) || (has_mp && match_die(c, o))) { + if ((i == cpu) || (has_mp && match_pkg(c, o))) { link_mask(topology_core_cpumask, cpu, i); /* @@ -570,7 +570,7 @@ void set_cpu_sibling_map(int cpu) } else if (i != cpu && !c->booted_cores) c->booted_cores = cpu_data(i).booted_cores; } - if (match_die(c, o) && !topology_same_node(c, o)) + if (match_pkg(c, o) && !topology_same_node(c, o)) x86_has_numa_in_package = true; } -- 2.18.0-rc0
[PATCH 09/22] powercap/intel_rapl: Simplify rapl_find_package()
From: Zhang Rui Syntax only, no functional or semantic change. Simplify how the code to discover a package is called. Rename find_package_by_id() to rapl_find_package_domain() Signed-off-by: Zhang Rui Signed-off-by: Len Brown Acked-by: Rafael J. Wysocki Cc: linux...@vger.kernel.org --- drivers/powercap/intel_rapl.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 4347f15165f8..3c3c0c23180b 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -264,8 +264,9 @@ static struct powercap_control_type *control_type; /* PowerCap Controller */ static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */ /* caller to ensure CPU hotplug lock is held */ -static struct rapl_package *find_package_by_id(int id) +static struct rapl_package *rapl_find_package_domain(int cpu) { + int id = topology_physical_package_id(cpu); struct rapl_package *rp; list_for_each_entry(rp, &rapl_packages, plist) { @@ -1300,7 +1301,7 @@ static int __init rapl_register_psys(void) rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; rd->rpl[1].name = pl2_name; - rd->rp = find_package_by_id(0); + rd->rp = rapl_find_package_domain(0); power_zone = powercap_register_zone(&rd->power_zone, control_type, "psys", NULL, @@ -1456,8 +1457,9 @@ static void rapl_remove_package(struct rapl_package *rp) } /* called from CPU hotplug notifier, hotplug lock held */ -static struct rapl_package *rapl_add_package(int cpu, int pkgid) +static struct rapl_package *rapl_add_package(int cpu) { + int id = topology_physical_package_id(cpu); struct rapl_package *rp; int ret; @@ -1466,7 +1468,7 @@ static struct rapl_package *rapl_add_package(int cpu, int pkgid) return ERR_PTR(-ENOMEM); /* add the new package to the list */ - rp->id = pkgid; + rp->id = id; rp->lead_cpu = cpu; /* check if the package contains valid domains */ @@ -1497,12 +1499,11 @@ static struct rapl_package *rapl_add_package(int cpu, int pkgid) */ static int rapl_cpu_online(unsigned int cpu) { - int pkgid = topology_physical_package_id(cpu); struct rapl_package *rp; - rp = find_package_by_id(pkgid); + rp = rapl_find_package_domain(cpu); if (!rp) { - rp = rapl_add_package(cpu, pkgid); + rp = rapl_add_package(cpu); if (IS_ERR(rp)) return PTR_ERR(rp); } @@ -1512,11 +1513,10 @@ static int rapl_cpu_online(unsigned int cpu) static int rapl_cpu_down_prep(unsigned int cpu) { - int pkgid = topology_physical_package_id(cpu); struct rapl_package *rp; int lead_cpu; - rp = find_package_by_id(pkgid); + rp = rapl_find_package_domain(cpu); if (!rp) return 0; -- 2.18.0-rc0
[PATCH 11/22] thermal/x86_pkg_temp_thermal: Support multi-die/package
From: Zhang Rui Package temperature sensors are actually implemented in hardware per-die. Thus, the new multi-die/package systems sport mulitple package thermal zones for each package. Update the x86_pkg_temp_thermal to be "multi-die-aware", so it can expose multiple zones per package, instead of just one. Signed-off-by: Zhang Rui Signed-off-by: Len Brown --- drivers/thermal/intel/x86_pkg_temp_thermal.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c b/drivers/thermal/intel/x86_pkg_temp_thermal.c index 1ef937d799e4..405b3858900a 100644 --- a/drivers/thermal/intel/x86_pkg_temp_thermal.c +++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c @@ -122,7 +122,7 @@ static int pkg_temp_debugfs_init(void) */ static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu) { - int pkgid = topology_logical_package_id(cpu); + int pkgid = topology_logical_die_id(cpu); if (pkgid >= 0 && pkgid < max_packages) return packages[pkgid]; @@ -353,7 +353,7 @@ static int pkg_thermal_notify(u64 msr_val) static int pkg_temp_thermal_device_add(unsigned int cpu) { - int pkgid = topology_logical_package_id(cpu); + int pkgid = topology_logical_die_id(cpu); u32 tj_max, eax, ebx, ecx, edx; struct pkg_device *pkgdev; int thres_count, err; @@ -449,7 +449,7 @@ static int pkg_thermal_cpu_offline(unsigned int cpu) * worker will see the package anymore. */ if (lastcpu) { - packages[topology_logical_package_id(cpu)] = NULL; + packages[topology_logical_die_id(cpu)] = NULL; /* After this point nothing touches the MSR anymore. */ wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, pkgdev->msr_pkg_therm_low, pkgdev->msr_pkg_therm_high); @@ -515,7 +515,7 @@ static int __init pkg_temp_thermal_init(void) if (!x86_match_cpu(pkg_temp_thermal_ids)) return -ENODEV; - max_packages = topology_max_packages(); + max_packages = topology_max_packages() * topology_max_die_per_package(); packages = kcalloc(max_packages, sizeof(struct pkg_device *), GFP_KERNEL); if (!packages) -- 2.18.0-rc0
[PATCH 20/22] hwmon/coretemp: rename internal variables to zones from packages
From: Len Brown Syntax update only -- no logical or functional change. In response to the new multi-die/package changes, update variable names to use the more generic thermal "zone" terminology, instead of "package", as the zones can refer to either packages or die. Signed-off-by: Len Brown Cc: Zhang Rui --- drivers/hwmon/coretemp.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c index c64ce32d3214..4ebed90d81aa 100644 --- a/drivers/hwmon/coretemp.c +++ b/drivers/hwmon/coretemp.c @@ -109,10 +109,10 @@ struct platform_data { struct device_attribute name_attr; }; -/* Keep track of how many package pointers we allocated in init() */ -static int max_packages __read_mostly; -/* Array of package pointers. Serialized by cpu hotplug lock */ -static struct platform_device **pkg_devices; +/* Keep track of how many zone pointers we allocated in init() */ +static int max_zones __read_mostly; +/* Array of zone pointers. Serialized by cpu hotplug lock */ +static struct platform_device **zone_devices; static ssize_t show_label(struct device *dev, struct device_attribute *devattr, char *buf) @@ -435,10 +435,10 @@ static int chk_ucode_version(unsigned int cpu) static struct platform_device *coretemp_get_pdev(unsigned int cpu) { - int pkgid = topology_logical_die_id(cpu); + int id = topology_logical_die_id(cpu); - if (pkgid >= 0 && pkgid < max_packages) - return pkg_devices[pkgid]; + if (id >= 0 && id < max_zones) + return zone_devices[id]; return NULL; } @@ -544,7 +544,7 @@ static int coretemp_probe(struct platform_device *pdev) struct device *dev = &pdev->dev; struct platform_data *pdata; - /* Initialize the per-package data structures */ + /* Initialize the per-zone data structures */ pdata = devm_kzalloc(dev, sizeof(struct platform_data), GFP_KERNEL); if (!pdata) return -ENOMEM; @@ -579,13 +579,13 @@ static struct platform_driver coretemp_driver = { static struct platform_device *coretemp_device_add(unsigned int cpu) { - int err, pkgid = topology_logical_die_id(cpu); + int err, zoneid = topology_logical_die_id(cpu); struct platform_device *pdev; - if (pkgid < 0) + if (zoneid < 0) return ERR_PTR(-ENOMEM); - pdev = platform_device_alloc(DRVNAME, pkgid); + pdev = platform_device_alloc(DRVNAME, zoneid); if (!pdev) return ERR_PTR(-ENOMEM); @@ -595,7 +595,7 @@ static struct platform_device *coretemp_device_add(unsigned int cpu) return ERR_PTR(err); } - pkg_devices[pkgid] = pdev; + zone_devices[zoneid] = pdev; return pdev; } @@ -703,7 +703,7 @@ static int coretemp_cpu_offline(unsigned int cpu) * the rest. */ if (cpumask_empty(&pd->cpumask)) { - pkg_devices[topology_logical_die_id(cpu)] = NULL; + zone_devices[topology_logical_die_id(cpu)] = NULL; platform_device_unregister(pdev); return 0; } @@ -741,10 +741,10 @@ static int __init coretemp_init(void) if (!x86_match_cpu(coretemp_ids)) return -ENODEV; - max_packages = topology_max_packages() * topology_max_die_per_package(); - pkg_devices = kcalloc(max_packages, sizeof(struct platform_device *), + max_zones = topology_max_packages() * topology_max_die_per_package(); + zone_devices = kcalloc(max_zones, sizeof(struct platform_device *), GFP_KERNEL); - if (!pkg_devices) + if (!zone_devices) return -ENOMEM; err = platform_driver_register(&coretemp_driver); @@ -760,7 +760,7 @@ static int __init coretemp_init(void) outdrv: platform_driver_unregister(&coretemp_driver); - kfree(pkg_devices); + kfree(zone_devices); return err; } module_init(coretemp_init) @@ -769,7 +769,7 @@ static void __exit coretemp_exit(void) { cpuhp_remove_state(coretemp_hp_online); platform_driver_unregister(&coretemp_driver); - kfree(pkg_devices); + kfree(zone_devices); } module_exit(coretemp_exit) -- 2.18.0-rc0
[PATCH 18/22] perf/x86/intel/cstate: Support multi-die/package
From: Kan Liang Some cstate counters becomes die-scope on Xeon Cascade Lake-AP. Perf cstate driver needs to support die-scope cstate counters. Use topology_die_cpumask() to replace topology_core_cpumask(). For previous platforms which doesn't have multi-die, topology_die_cpumask() is identical as topology_core_cpumask(). There is no functional change for previous platforms. Name the die-scope PMU "cstate_die". Signed-off-by: Kan Liang Cc: Peter Zijlstra Signed-off-by: Len Brown --- arch/x86/events/intel/cstate.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c index 94a4b7fc75d0..52c5fea29457 100644 --- a/arch/x86/events/intel/cstate.c +++ b/arch/x86/events/intel/cstate.c @@ -302,7 +302,7 @@ static int cstate_pmu_event_init(struct perf_event *event) return -EINVAL; event->hw.event_base = pkg_msr[cfg].msr; cpu = cpumask_any_and(&cstate_pkg_cpu_mask, - topology_core_cpumask(event->cpu)); + topology_die_cpumask(event->cpu)); } else { return -ENOENT; } @@ -385,7 +385,7 @@ static int cstate_cpu_exit(unsigned int cpu) if (has_cstate_pkg && cpumask_test_and_clear_cpu(cpu, &cstate_pkg_cpu_mask)) { - target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + target = cpumask_any_but(topology_die_cpumask(cpu), cpu); /* Migrate events if there is a valid target */ if (target < nr_cpu_ids) { cpumask_set_cpu(target, &cstate_pkg_cpu_mask); @@ -414,7 +414,7 @@ static int cstate_cpu_init(unsigned int cpu) * in the package cpu mask as the designated reader. */ target = cpumask_any_and(&cstate_pkg_cpu_mask, -topology_core_cpumask(cpu)); +topology_die_cpumask(cpu)); if (has_cstate_pkg && target >= nr_cpu_ids) cpumask_set_cpu(cpu, &cstate_pkg_cpu_mask); @@ -661,7 +661,13 @@ static int __init cstate_init(void) } if (has_cstate_pkg) { - err = perf_pmu_register(&cstate_pkg_pmu, cstate_pkg_pmu.name, -1); + if (topology_max_die_per_package() > 1) { + err = perf_pmu_register(&cstate_pkg_pmu, + "cstate_die", -1); + } else { + err = perf_pmu_register(&cstate_pkg_pmu, + cstate_pkg_pmu.name, -1); + } if (err) { has_cstate_pkg = false; pr_info("Failed to register cstate pkg pmu\n"); -- 2.18.0-rc0
[PATCH 12/22] powercap/intel_rapl: update rapl domain name and debug messages
From: Zhang Rui The RAPL domain "name" attribute contains "Package-N", which is ambiguous on multi-die per-package systems. Update the name to "package-X-die-Y" on those systems. No change on systems without multi-die/package. Update driver debug messages. Signed-off-by: Zhang Rui Signed-off-by: Len Brown Acked-by: Rafael J. Wysocki Cc: linux...@vger.kernel.org --- drivers/powercap/intel_rapl.c | 57 --- 1 file changed, 32 insertions(+), 25 deletions(-) diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 9202dbcef96d..ad78c1d08260 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -178,12 +178,15 @@ struct rapl_domain { #define power_zone_to_rapl_domain(_zone) \ container_of(_zone, struct rapl_domain, power_zone) +/* maximum rapl package domain name: package-%d-die-%d */ +#define PACKAGE_DOMAIN_NAME_LENGTH 30 -/* Each physical package contains multiple domains, these are the common + +/* Each rapl package contains multiple domains, these are the common * data across RAPL domains within a package. */ struct rapl_package { - unsigned int id; /* physical package/socket id */ + unsigned int id; /* logical die id, equals physical 1-die systems */ unsigned int nr_domains; unsigned long domain_map; /* bit map of active domains */ unsigned int power_unit; @@ -198,6 +201,7 @@ struct rapl_package { int lead_cpu; /* one active cpu per package for access */ /* Track active cpus */ struct cpumask cpumask; + char name[PACKAGE_DOMAIN_NAME_LENGTH]; }; struct rapl_defaults { @@ -926,8 +930,8 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 100 / (1 << value); - pr_debug("Core CPU package %d energy=%dpJ, time=%dus, power=%duW\n", - rp->id, rp->energy_unit, rp->time_unit, rp->power_unit); + pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n", + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit); return 0; } @@ -951,8 +955,8 @@ static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 100 / (1 << value); - pr_debug("Atom package %d energy=%dpJ, time=%dus, power=%duW\n", - rp->id, rp->energy_unit, rp->time_unit, rp->power_unit); + pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n", + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit); return 0; } @@ -1181,7 +1185,7 @@ static void rapl_update_domain_data(struct rapl_package *rp) u64 val; for (dmn = 0; dmn < rp->nr_domains; dmn++) { - pr_debug("update package %d domain %s data\n", rp->id, + pr_debug("update %s domain %s data\n", rp->name, rp->domains[dmn].name); /* exclude non-raw primitives */ for (prim = 0; prim < NR_RAW_PRIMITIVES; prim++) { @@ -1206,7 +1210,6 @@ static void rapl_unregister_powercap(void) static int rapl_package_register_powercap(struct rapl_package *rp) { struct rapl_domain *rd; - char dev_name[17]; /* max domain name = 7 + 1 + 8 for int + 1 for null*/ struct powercap_zone *power_zone = NULL; int nr_pl, ret; @@ -1217,20 +1220,16 @@ static int rapl_package_register_powercap(struct rapl_package *rp) for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) { if (rd->id == RAPL_DOMAIN_PACKAGE) { nr_pl = find_nr_power_limit(rd); - pr_debug("register socket %d package domain %s\n", - rp->id, rd->name); - memset(dev_name, 0, sizeof(dev_name)); - snprintf(dev_name, sizeof(dev_name), "%s-%d", - rd->name, rp->id); + pr_debug("register package domain %s\n", rp->name); power_zone = powercap_register_zone(&rd->power_zone, control_type, - dev_name, NULL, + rp->name, NULL, &zone_ops[rd->id], nr_pl, &constraint_ops); if (IS_ERR(power_zone)) { -