Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-04-19 Thread changhuaixin



> On Mar 18, 2021, at 11:10 PM, Peter Zijlstra  wrote:
> 
> On Thu, Mar 18, 2021 at 08:59:44AM -0400, Phil Auld wrote:
>> I admit to not having followed all the history of this patch set. That
>> said, when I see the above I just think your quota is too low for your
>> workload.
> 
> This.
> 
>> The burst (mis?)feature seems to be a way to bypass the quota.  And it
>> sort of assumes cooperative containers that will only burst when they
>> need it and then go back to normal. 
> 
> Its not entirely unreasonable or unheard of. There's soft realtime
> systems that use this to increase the utilization with the trade-off
> that you're going to miss deadlines once every so often.
> 
> If you do it right, you can calculate the probabilities. Or usually the
> other way around, you calculate the allowed variance/burst given a P
> value for making the deadline.
> 
> Input then isn't the WCET for each task, but a runtime distribution as
> measured for your workload on your system etc..
> 
> I used to have papers on this, but I can't seem to find them in a hurry.
> 

Hi, I have done some reading on queueing theory and done some problem 
definition.

Divide real time into discrete periods as cfs_b does. Assume there are m 
cgroups using
CFS Bandwidth Control. During each period, the i-th cgroup demands u_i CPU time,
where we assume u_i is under some distribution(exponential, Poisson or else).
At the end of a period, if the sum of u_i is under or equal to 100%, we call it 
an "idle" state.
The number of periods between two "idle" states stands for the WCET of tasks 
during these
periods.

Originally using quota, it is guaranteed that "idle" state comes at the end of 
each period. Thus,
the WCET is the length of period. When enabling CPU Burst, the sum of u_i may 
exceed 100%,
and the exceeded workload is handled in the following periods. The WCET is the 
number of periods
between two "idle" states.

Then, we are going to calculate the probabilities that WCET is longer than a 
period, and the average
WCET when using certain burst under some runtime distribution.

Basically, these are based on pervious mails. I am sending this email to see if 
there is anything wrong
on problem definition.


Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-19 Thread changhuaixin



> On Mar 19, 2021, at 8:39 PM, changhuaixin  
> wrote:
> 
> 
> 
>> On Mar 18, 2021, at 11:05 PM, Peter Zijlstra  wrote:
>> 
>> On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
>>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
>> 
>>>> So what is the typical avg,stdev,max and mode for the workloads where you 
>>>> find
>>>> you need this?
>>>> 
>>>> I would really like to put a limit on the burst. IMO a workload that has
>>>> a burst many times longer than the quota is plain broken.
>>> 
>>> I see. Then the problem comes down to how large the limit on burst shall be.
>>> 
>>> I have sampled the CPU usage of a bursty container in 100ms periods. The 
>>> statistics are:
>> 
>> So CPU usage isn't exactly what is required, job execution time is what
>> you're after. Assuming there is a relation...
>> 
> 
> Yes, job execution time is important. To be specific, it is to improve the 
> CPU usage of the whole
> system to reduce the total cost of ownership, while not damaging job 
> execution time. This
> requires lower the average CPU resource of underutilized cgroups, and 
> allowing their bursts
> at the same time.
> 
>>> average : 42.2%
>>> stddev  : 81.5%
>>> max : 844.5%
>>> P95 : 183.3%
>>> P99 : 437.0%
>> 
>> Then your WCET is 844% of 100ms ? , which is .84s.
>> 
>> But you forgot your mode; what is the most common duration, given P95 is
>> so high, I doubt that avg is representative of the most common duration.
>> 
> 
> It is true.
> 
>>> If quota is 10ms, burst buffer needs to be 8 times more in order
>>> for this workload not to be throttled.
>> 
>> Where does that 100s come from? And an 800s burst is bizarre.
>> 
>> Did you typo [us] as [ms] ?
>> 
> 
> Sorry, it should be 10us.
> 
>>> I can't say this is typical, but these workloads exist. On a machine
>>> running Kubernetes containers, where there is often room for such
>>> burst and the interference is hard to notice, users would prefer
>>> allowing such burst to being throttled occasionally.
>> 
>> Users also want ponies. I've no idea what kubernetes actually is or what
>> it has to do with containers. That's all just word salad.
>> 
>>> In this sense, I suggest limit burst buffer to 16 times of quota or
>>> around. That should be enough for users to improve tail latency caused
>>> by throttling. And users might choose a smaller one or even none, if
>>> the interference is unacceptable. What do you think?
>> 
>> Well, normal RT theory would suggest you pick your runtime around 200%
>> to get that P95 and then allow a full period burst to get your P99, but
>> that same RT theory would also have you calculate the resulting
>> interference and see if that works with the rest of the system...
>> 
> 
> I am sorry that I don't know much about the RT theory you mentioned, and 
> can't provide
> the desired calculation now. But I'd like to try and do some reading if that 
> is needed.
> 
>> 16 times is horrific.
> 
> So can we decide on a more relative value now? Or is the interference 
> probabilities still the
> missing piece?

A more [realistic] value, I mean.

> 
> Is the paper you mentioned about called "Insensitivity results in statistical 
> bandwidth sharing",
> or some related ones on statistical bandwidth results under some kind of 
> fairness?



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-19 Thread changhuaixin



> On Mar 18, 2021, at 8:59 PM, Phil Auld  wrote:
> 
> On Thu, Mar 18, 2021 at 09:26:58AM +0800 changhuaixin wrote:
>> 
>> 
>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
>>> 
>>> On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
>>> 
>>>>> Why do you allow such a large burst? I would expect something like:
>>>>> 
>>>>>   if (burst > quote)
>>>>>   return -EINVAL;
>>>>> 
>>>>> That limits the variance in the system. Allowing super long bursts seems
>>>>> to defeat the entire purpose of bandwidth control.
>>>> 
>>>> I understand your concern. Surely large burst value might allow super
>>>> long bursts thus preventing bandwidth control entirely for a long
>>>> time.
>>>> 
>>>> However, I am afraid it is hard to decide what the maximum burst
>>>> should be from the bandwidth control mechanism itself. Allowing some
>>>> burst to the maximum of quota is helpful, but not enough. There are
>>>> cases where workloads are bursty that they need many times more than
>>>> quota in a single period. In such cases, limiting burst to the maximum
>>>> of quota fails to meet the needs.
>>>> 
>>>> Thus, I wonder whether is it acceptable to leave the maximum burst to
>>>> users. If the desired behavior is to allow some burst, configure burst
>>>> accordingly. If that is causing variance, use share or other fairness
>>>> mechanism. And if fairness mechanism still fails to coordinate, do not
>>>> use burst maybe.
>>> 
>>> It's not fairness, bandwidth control is about isolation, and burst
>>> introduces interference.
>>> 
>>>> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
>>>> still needed maybe.
>>> 
>>> So what is the typical avg,stdev,max and mode for the workloads where you 
>>> find
>>> you need this?
>>> 
>>> I would really like to put a limit on the burst. IMO a workload that has
>>> a burst many times longer than the quota is plain broken.
>> 
>> I see. Then the problem comes down to how large the limit on burst shall be.
>> 
>> I have sampled the CPU usage of a bursty container in 100ms periods. The 
>> statistics are:
>> average  : 42.2%
>> stddev   : 81.5%
>> max  : 844.5%
>> P95  : 183.3%
>> P99  : 437.0%
>> 
>> If quota is 10ms, burst buffer needs to be 8 times more in order for 
>> this workload not to be throttled.
>> I can't say this is typical, but these workloads exist. On a machine running 
>> Kubernetes containers,
>> where there is often room for such burst and the interference is hard to 
>> notice, users would prefer
>> allowing such burst to being throttled occasionally.
>> 
> 
> I admit to not having followed all the history of this patch set. That said, 
> when I see the above I just
> think your quota is too low for your workload.
> 

Yeah, more quota is helpful for this workload. But that usually prevents us 
from improving the total CPU
usage by putting more work onto a single machine.

> The burst (mis?)feature seems to be a way to bypass the quota.  And it sort 
> of assumes cooperative
> containers that will only burst when they need it and then go back to normal. 
> 
>> In this sense, I suggest limit burst buffer to 16 times of quota or around. 
>> That should be enough for users to
>> improve tail latency caused by throttling. And users might choose a smaller 
>> one or even none, if the interference
>> is unacceptable. What do you think?
>> 
> 
> Having quotas that can regularly be exceeded by 16 times seems to make the 
> concept of a quota
> meaningless.  I'd have thought a burst would be some small percentage.
> 
> What if several such containers burst at the same time? Can't that lead to 
> overcommit that can effect
> other well-behaved containers?
> 

I see. Maybe there should be some calculation on the probabilities of that, as 
Peter has replied.

> 
> Cheers,
> Phil
> 
> -- 



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-19 Thread changhuaixin



> On Mar 18, 2021, at 11:05 PM, Peter Zijlstra  wrote:
> 
> On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
> 
>>> So what is the typical avg,stdev,max and mode for the workloads where you 
>>> find
>>> you need this?
>>> 
>>> I would really like to put a limit on the burst. IMO a workload that has
>>> a burst many times longer than the quota is plain broken.
>> 
>> I see. Then the problem comes down to how large the limit on burst shall be.
>> 
>> I have sampled the CPU usage of a bursty container in 100ms periods. The 
>> statistics are:
> 
> So CPU usage isn't exactly what is required, job execution time is what
> you're after. Assuming there is a relation...
> 

Yes, job execution time is important. To be specific, it is to improve the CPU 
usage of the whole
system to reduce the total cost of ownership, while not damaging job execution 
time. This
requires lower the average CPU resource of underutilized cgroups, and allowing 
their bursts
at the same time.

>> average  : 42.2%
>> stddev   : 81.5%
>> max  : 844.5%
>> P95  : 183.3%
>> P99  : 437.0%
> 
> Then your WCET is 844% of 100ms ? , which is .84s.
> 
> But you forgot your mode; what is the most common duration, given P95 is
> so high, I doubt that avg is representative of the most common duration.
> 

It is true.

>> If quota is 10ms, burst buffer needs to be 8 times more in order
>> for this workload not to be throttled.
> 
> Where does that 100s come from? And an 800s burst is bizarre.
> 
> Did you typo [us] as [ms] ?
> 

Sorry, it should be 10us.

>> I can't say this is typical, but these workloads exist. On a machine
>> running Kubernetes containers, where there is often room for such
>> burst and the interference is hard to notice, users would prefer
>> allowing such burst to being throttled occasionally.
> 
> Users also want ponies. I've no idea what kubernetes actually is or what
> it has to do with containers. That's all just word salad.
> 
>> In this sense, I suggest limit burst buffer to 16 times of quota or
>> around. That should be enough for users to improve tail latency caused
>> by throttling. And users might choose a smaller one or even none, if
>> the interference is unacceptable. What do you think?
> 
> Well, normal RT theory would suggest you pick your runtime around 200%
> to get that P95 and then allow a full period burst to get your P99, but
> that same RT theory would also have you calculate the resulting
> interference and see if that works with the rest of the system...
> 

I am sorry that I don't know much about the RT theory you mentioned, and can't 
provide
the desired calculation now. But I'd like to try and do some reading if that is 
needed.

> 16 times is horrific.

So can we decide on a more relative value now? Or is the interference 
probabilities still the
missing piece?

Is the paper you mentioned about called "Insensitivity results in statistical 
bandwidth sharing",
or some related ones on statistical bandwidth results under some kind of 
fairness?



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-17 Thread changhuaixin



> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
> 
> On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
> 
>>> Why do you allow such a large burst? I would expect something like:
>>> 
>>> if (burst > quote)
>>> return -EINVAL;
>>> 
>>> That limits the variance in the system. Allowing super long bursts seems
>>> to defeat the entire purpose of bandwidth control.
>> 
>> I understand your concern. Surely large burst value might allow super
>> long bursts thus preventing bandwidth control entirely for a long
>> time.
>> 
>> However, I am afraid it is hard to decide what the maximum burst
>> should be from the bandwidth control mechanism itself. Allowing some
>> burst to the maximum of quota is helpful, but not enough. There are
>> cases where workloads are bursty that they need many times more than
>> quota in a single period. In such cases, limiting burst to the maximum
>> of quota fails to meet the needs.
>> 
>> Thus, I wonder whether is it acceptable to leave the maximum burst to
>> users. If the desired behavior is to allow some burst, configure burst
>> accordingly. If that is causing variance, use share or other fairness
>> mechanism. And if fairness mechanism still fails to coordinate, do not
>> use burst maybe.
> 
> It's not fairness, bandwidth control is about isolation, and burst
> introduces interference.
> 
>> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
>> still needed maybe.
> 
> So what is the typical avg,stdev,max and mode for the workloads where you find
> you need this?
> 
> I would really like to put a limit on the burst. IMO a workload that has
> a burst many times longer than the quota is plain broken.

I see. Then the problem comes down to how large the limit on burst shall be.

I have sampled the CPU usage of a bursty container in 100ms periods. The 
statistics are:
average : 42.2%
stddev  : 81.5%
max : 844.5%
P95 : 183.3%
P99 : 437.0%

If quota is 10ms, burst buffer needs to be 8 times more in order for this 
workload not to be throttled.
I can't say this is typical, but these workloads exist. On a machine running 
Kubernetes containers,
where there is often room for such burst and the interference is hard to 
notice, users would prefer
allowing such burst to being throttled occasionally.

In this sense, I suggest limit burst buffer to 16 times of quota or around. 
That should be enough for users to
improve tail latency caused by throttling. And users might choose a smaller one 
or even none, if the interference
is unacceptable. What do you think?


Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-17 Thread changhuaixin



> On Mar 16, 2021, at 5:54 PM, Peter Zijlstra  wrote:
> 
> On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
>> @@ -8982,6 +8983,12 @@ static int tg_set_cfs_bandwidth(struct task_group 
>> *tg, u64 period, u64 quota)
>>  if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>>  return -EINVAL;
>> 
>> +/*
>> + * Bound burst to defend burst against overflow during bandwidth shift.
>> + */
>> +if (burst > max_cfs_runtime)
>> +return -EINVAL;
> 
> Why do you allow such a large burst? I would expect something like:
> 
>   if (burst > quote)
>   return -EINVAL;
> 
> That limits the variance in the system. Allowing super long bursts seems
> to defeat the entire purpose of bandwidth control.

I understand your concern. Surely large burst value might allow super long 
bursts
thus preventing bandwidth control entirely for a long time.

However, I am afraid it is hard to decide what the maximum burst should be from
the bandwidth control mechanism itself. Allowing some burst to the maximum of
quota is helpful, but not enough. There are cases where workloads are bursty
that they need many times more than quota in a single period. In such cases, 
limiting
burst to the maximum of quota fails to meet the needs.

Thus, I wonder whether is it acceptable to leave the maximum burst to users. If 
the desired
behavior is to allow some burst, configure burst accordingly. If that is 
causing variance, use share
or other fairness mechanism. And if fairness mechanism still fails to 
coordinate, do not use
burst maybe.

In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is still 
needed maybe.


Re: [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable

2021-03-12 Thread changhuaixin



> On Mar 10, 2021, at 9:04 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jan 21, 2021 at 07:04:51PM +0800, Huaixin Chang wrote:
>> Accumulate unused quota from previous periods, thus accumulated
>> bandwidth runtime can be used in the following periods. During
>> accumulation, take care of runtime overflow. Previous non-burstable
>> CFS bandwidth controller only assign quota to runtime, that saves a lot.
>> 
>> A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
>> denote how many percent of burst is given on setting cfs bandwidth. By
>> default it is 0, which means on burst is allowed unless accumulated.
>> 
>> Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
>> switch for burst. It is enabled by default.
>> 
>> Signed-off-by: Huaixin Chang 
>> Signed-off-by: Shanpei Chen 
> 
> Identical invalid SoB chain.
> 
>> Reported-by: kernel test robot 
> 
> What exactly did the robot report; the whole patch?

A warning is reported by the robot. And I have fixed it in this series. I'll 
remove this line,
since it seems unnecessary.

> 
>> ---
>> include/linux/sched/sysctl.h |  2 ++
>> kernel/sched/core.c  | 31 +
>> kernel/sched/fair.c  | 47 
>> 
>> kernel/sched/sched.h |  4 ++--
>> kernel/sysctl.c  | 18 +
>> 5 files changed, 88 insertions(+), 14 deletions(-)
>> 
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index 3c31ba88aca5..3400828eaf2d 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -72,6 +72,8 @@ extern unsigned int 
>> sysctl_sched_uclamp_util_min_rt_default;
>> 
>> #ifdef CONFIG_CFS_BANDWIDTH
>> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>> #endif
>> 
>> #ifdef CONFIG_SCHED_AUTOGROUP
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 48d3bad12be2..fecf0f05ef0c 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
>>  */
>> const_debug unsigned int sysctl_sched_nr_migrate = 32;
>> 
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/*
>> + * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
>> + * 0 by default.
>> + */
>> +unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
>> +
>> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
>> +#endif
> 
> There's already an #ifdef block that contains that bandwidth_slice
> thing, see the previous hunk, so why create a new #ifdef here?
> 
> Also, personally I think percentages are over-represented as members of
> Q.
> 
Sorry, I don't quite understand the "members of Q". Is this saying that the 
percentages
are over-designed here?

>> @@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
>> const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
>> static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
>> /* More than 203 days if BW_SHIFT equals 20. */
>> -static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>> +const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>> 
>> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>> 
>> @@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
>> u64 period, u64 quota,
>> {
>>  int i, ret = 0, runtime_enabled, runtime_was_enabled;
>>  struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
>> -u64 buffer;
>> +u64 buffer, burst_onset;
>> 
>>  if (tg == _task_group)
>>  return -EINVAL;
>> @@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group 
>> *tg, u64 period, u64 quota,
>>  cfs_b->burst = burst;
>>  cfs_b->buffer = buffer;
>> 
>> -__refill_cfs_bandwidth_runtime(cfs_b);
>> +cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
>> +cfs_b->runtime = cfs_b->quota;
>> +
>> +/* burst_onset needed */
>> +if (cfs_b->quota != RUNTIME_INF &&
>> +sysctl_sched_cfs_bw_burst_enabled &&
>> +sysctl_sched_cfs_bw_burst_onset_percent > 0) {
> 
> 'creative' indentation again...
> 
> Also, this gives rise to the question as to why onset_percent is
> separate from enabled.

Odin noticed the precent thing, too. Maybe I will remove this and let cfsb 
start with 0 burst.
In this way, this if statement can be removed too.

> 
>> +
>> +burst_onset = do_div(burst, 100) *
>> +sysctl_sched_cfs_bw_burst_onset_percent;
> 
> and again..
> 
>> +
>> +cfs_b->runtime += burst_onset;
>> +cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
>> +}
>> 
>>  /* Restart the period timer (if active) to handle new period expiry: */
>>  if (runtime_enabled)
>> -start_cfs_bandwidth(cfs_b);
>> +

Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-03-12 Thread changhuaixin



> On Mar 10, 2021, at 7:11 PM, Odin Ugedal  wrote:
> 
> Hi,
> 
>> If there are cases where the "start bandwidth" matters, I think there is 
>> need to expose the
>> "start bandwidth" explicitly too. However, I doubt the existence of such 
>> cases from my view
>> and the two examples above.
> 
> Yeah, I don't think there will be any cases where users will be
> "depending" on having burst available,
> so I agree in that sense.
> 
>> In my thoughts, this patchset keeps cgroup usage within the quota in the 
>> longer term, and allows
>> cgroup to respond to a burst of work with the help of a reasonable burst 
>> buffer. If quota is set correctly
>> above average usage, and enough burst buffer is set to meet the needs of 
>> bursty work. In this
>> case, it makes no difference whether this cgroup runs with 0 start bandwidth 
>> or all of it.
>> Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start 
>> bandwidth
>> to leave some convenience here. If this sysctl interface is confusing, I 
>> wonder whether it
>> is a good idea not to expose this interface.
>> 
>> For the first case mentioned above, if Kubernet users care the "start 
>> bandwidth" for process startup,
>> maybe it is better to give all of it rather than a part?
> 
> Yeah, I am a bit afraid there will be some confusion, so not sure if
> the sysctl is the best way to do it.
> 
> But I would like feedback from others to highlight the problem as
> well, that would be helpful. I think a simple "API"
> where you get 0 burst or full burst on "set" (the one we decide on)
> would be best to avoid unnecessary complexity.
> 
> Start burst when starting up a new process in a new cgroup might be
> helpful, so maybe that is a vote for
> full burst? However, in long term that doesn't matter, so 0 burst on
> start would work as well.
> 
>> For the second case with quota changes over time, I think it is important 
>> making sure each change works
>> long enough to enforce average quota limit. Does it really matter to control 
>> "start burst" on each change?
> 
> No, I don't think so. Doing so would be another thing to set per
> cgroup, and that would just clutter the api
> more than necessary imo., since we cannot come up with any real use cases.
> 
>> It is an copy of runtime at period start, and used to calculate burst time 
>> during a period.
>> Not quite remaining_runtime_prev_period.
> 
> Ahh, I see, I misunderstood the code. So in a essence it is
> "runtime_at_period_start"?
> 

Yes, it is "runtime_at_preiod_start".

>> Yeah, there is the updating problem. It is okey not to expose cfs_b->runtime 
>> then.
> 
> Yeah, I think dropping it all together is the best solution.
> 
> 
>> This comment does not mean any loss any unnecessary throttle for present 
>> cfsb.
>> All this means is that all quota refilling that is not done during timer 
>> stop should be
>> refilled on timer start, for the burstable cfsb.
>> 
>> Maybe I shall change this comment in some way if it is misleading?
> 
> I think I formulated my question badly. The comment makes sense, I am
> just trying to compare how "start_cfs_bandwidth"
> works after your patch compared to how it works currently. As I
> understand, without this patch "start_cfs_bandwidth" will
> never refill runtime, while with your patch, it will refill even when
> overrun=0 with burst disabled. Is that an intended change in
> behavior, or am I not understanding the patch?
> 

Good point. The way "start_cfs_bandwidth" works is changed indeed. The present 
cfs_b doesn't
have to refill bandwidth because quota is not used during the period before 
timer stops. With this patch,
runtime is refilled no matter burst is enabled or not. Do you suggest not 
refilling runtime unless burst
is enabled here?

> 
> On another note, I have also been testing this patch, and I am not
> able to reproduce your schbench results. Both with and without burst,
> it gets the same result, and no nr_throttled stays at 0 (tested on a
> 32-core system). Can you try to rerun your tests with the mainline
> to see if you still get the same results? (Also, I see you are running
> with 30 threads. How many cores do your test setup have?). To actually
> say that the result is real, all cores used should maybe be
> exclusively reserved as well, to avoid issues where other processes
> cause a
> spike in latency.
> 

Spikes indeed cause trouble. If nr_throttle stays at 0, I suggest change quota 
from 70 to 60,
which is still above the average utilization 500%. I have rerun on a 64-core 
system and reproduced the
results. And I think it should work on a 32-core system too, as there are 20 
active workers in each round.

If you still have trouble, I suggest test in the following way. And it should 
work on a two-core system.

mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 10 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 30 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

./schbench -m 1 -t 3 -r 

Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-02-27 Thread changhuaixin
Hi,

Sorry for my late reply.

> On Feb 9, 2021, at 9:17 PM, Odin Ugedal  wrote:
> 
> 
> Hi! This looks quite useful, but I have a few quick thoughts. :)
> 
> I know of a lot of people who would love this (especially some
> Kubernetes users)! I really like how this allow users to use cfs
> in a more dynamic and flexible way, without interfering with those
> who like the enforce strict quotas.
> 
> 
>> +++ b/kernel/sched/core.c
>> @ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
>> u64 period, u64
>> [...]
>> +/* burst_onset needed */
>> +if (cfs_b->quota != RUNTIME_INF &&
>> +sysctl_sched_cfs_bw_burst_enabled &&
>> +sysctl_sched_cfs_bw_burst_onset_percent > 0) {
>> +
>> +burst_onset = do_div(burst, 100) *
>> +sysctl_sched_cfs_bw_burst_onset_percent;
>> +
>> +cfs_b->runtime += burst_onset;
>> +cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
>> +}
> 
> I saw a comment about this behavior, but I think this can lead to a bit of
> confusion. If sysctl_sched_cfs_bw_burst_onset_percent=0, the amount of
> bandwidth when the first process starts up will depend on the time between
> the quota was set and the startup of the process, and that feel a bit like
> a "timing" race that end user application then will have to think about.
> 
> I suspect contianer runtimes and/or tools like Kubernetes will then have
> to tell users to set the value to a certan amount in order to make it
> work as expected.
> 
> Another thing is that when a cgroup has saved some time into the
> "burst quota", updating the quota, period or burst will then reset the
> "burst quota", even though eg. only the burst was changed. Some tools
> use dynamic quotas, resulting in multiple changes in the quota over time,
> and I am a bit scared that don't allowing them to control "start burst"
> on a write can be limiting.
> 
> Maybe we can allow people to set the "start bandwidth" explicitly when setting
> cfs_burst if they want to do that? (edit: that might be hard for cgroup v1, 
> but
> would I think that is a good solution on cgroup v2).
> 
> This is however just my thoughts, and I am not 100% sure about what the
> best solution is, but if we merge a certain behavior, we have no real
> chance of changing it later.
> 

If there are cases where the "start bandwidth" matters, I think there is need 
to expose the
"start bandwidth" explicitly too. However, I doubt the existence of such cases 
from my view
and the two examples above.

In my thoughts, this patchset keeps cgroup usage within the quota in the longer 
term, and allows 
cgroup to respond to a burst of work with the help of a reasonable burst 
buffer. If quota is set correctly
above average usage, and enough burst buffer is set to meet the needs of bursty 
work. In this
case, it makes no difference whether this cgroup runs with 0 start bandwidth or 
all of it.
Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start 
bandwidth
to leave some convenience here. If this sysctl interface is confusing, I wonder 
whether it
is a good idea not to expose this interface.

For the first case mentioned above, if Kubernet users care the "start 
bandwidth" for process startup,
maybe it is better to give all of it rather than a part?

For the second case with quota changes over time, I think it is important 
making sure each change works
long enough to enforce average quota limit. Does it really matter to control 
"start burst" on each change?



> 
>> +++ b/kernel/sched/sched.h
>> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
>>  u64 burst;
>>  u64 buffer;
>>  u64 max_overrun;
>> +u64 previous_runtime;
>>  s64 hierarchical_quota;
> 
> Maybe indicate that this was the remaining runtime _after_ the previous
> period ended? Not 100% sure, but maybe sometihing like
> 'remaining_runtime_prev_period' or 'end_runtime_prev_period'(as inspiration). 
>   
> 

It is an copy of runtime at period start, and used to calculate burst time 
during a period.
Not quite remaining_runtime_prev_period.

> 
>> +++ b/kernel/sched/core.c
>> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, 
>> void *v)
>>  seq_printf(sf, "wait_sum %llu\n", ws);
>>  }
>> 
>> +seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
>> +seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
>> +seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
>> +
>>  return 0;
>> }
> 
> Looks like these metrics are missing from the cgroup v2 stats.
> 

Thanks, cgroup v2 stats and documentation should be handled. I will add them
in the following patchset.

> Are we sure it is smart to start exposing cfs_b->runtime, since it makes it
> harder to change the implementation at a later time? I don't thin it is that
> usefull, and if it is only 

Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-01-26 Thread changhuaixin



> On Jan 21, 2021, at 7:04 PM, Huaixin Chang  
> wrote:
> 
> Changelog
> 
> v3:
> 1. Fix another issue reported by test robot.
> 2. Update docs as Randy Dunlap suggested.
> 
> v2:
> 1. Fix an issue reported by test robot.
> 2. Rewriting docs. Appreciate any further suggestions or help.
> 
> The CFS bandwidth controller limits CPU requests of a task group to
> quota during each period. However, parallel workloads might be bursty
> so that they get throttled. And they are latency sensitive at the same
> time so that throttling them is undesired.
> 
> Scaling up period and quota allows greater burst capacity. But it might
> cause longer stuck till next refill. We introduce "burst" to allow
> accumulating unused quota from previous periods, and to be assigned when
> a task group requests more CPU than quota during a specific period. Thus
> allowing CPU time requests as long as the average requested CPU time is
> below quota on the long run. The maximum accumulation is capped by burst
> and is set 0 by default, thus the traditional behaviour remains.
> 
> A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
> real java workloads when using burst. Similar drops are seen when
> testing with schbench too:
> 
>   echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>   echo 70 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>   echo 10 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
>   echo 40 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
> 
>   # The average CPU usage is around 500%, which is 200ms CPU time
>   # every 40ms.
>   ./schbench -m 1 -t 30 -r 60 -c 1 -R 500
> 
>   Without burst:
> 
>   Latency percentiles (usec)
>   50.th: 7
>   75.th: 8
>   90.th: 9
>   95.th: 10
>   *99.th: 933
>   99.5000th: 981
>   99.9000th: 3068
>   min=0, max=20054
>   rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 
> 9.33%
> 
>   With burst:
> 
>   Latency percentiles (usec)
>   50.th: 7
>   75.th: 8
>   90.th: 9
>   95.th: 9
>   *99.th: 12
>   99.5000th: 13
>   99.9000th: 19
>   min=0, max=406
>   rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 
> 0.12%
> 
> How much workloads with benefit from burstable CFS bandwidth control
> depends on how bursty and how latency sensitive they are.
> 
> Previously, Cong Wang and Konstantin Khlebnikov proposed similar
> feature:
> https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangc...@gmail.com/
> https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/
> 
> This time we present more latency statistics and handle overflow while
> accumulating.
> 
> Huaixin Chang (4):
>  sched/fair: Introduce primitives for CFS bandwidth burst
>  sched/fair: Make CFS bandwidth controller burstable
>  sched/fair: Add cfs bandwidth burst statistics
>  sched/fair: Add document for burstable CFS bandwidth control
> 
> Documentation/scheduler/sched-bwc.rst |  49 +++--
> include/linux/sched/sysctl.h  |   2 +
> kernel/sched/core.c   | 126 +-
> kernel/sched/fair.c   |  58 +---
> kernel/sched/sched.h  |   9 ++-
> kernel/sysctl.c   |  18 +
> 6 files changed, 232 insertions(+), 30 deletions(-)
> 
> -- 
> 2.14.4.44.g2045bb6

Ping, any new comments on this patchset? If there're no other concerns, I think 
it's ready to be merged?




Re: [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-01-12 Thread changhuaixin



> On Dec 21, 2020, at 9:53 PM, changhuaixin  
> wrote:
> 
> 
> 
>> On Dec 17, 2020, at 9:36 PM, Peter Zijlstra  wrote:
>> 
>> On Thu, Dec 17, 2020 at 03:46:17PM +0800, Huaixin Chang wrote:
>>> In this patch, we introduce the notion of CFS bandwidth burst. Unused
>>> "quota" from pervious "periods" might be accumulated and used in the
>>> following "periods". The maximum amount of accumulated bandwidth is
>>> bounded by "burst". And the maximun amount of CPU a group can consume in
>>> a given period is "buffer" which is equivalent to "quota" + "burst in
>>> case that this group has done enough accumulation.
>> 
>> Oh man, Juri, wasn't there a paper about statistical bandwidth
>> accounting somewhere? Where, if you replace every utilization by a
>> statistical variable, the end result is still useful?
>> 
>> That is, instead of something like; \Sum u_i <= 1, you get something
>> like: \Sum {avg(u),var(u)}_i <= {1, sqrt(\Sum var_i^2)} and you can
>> still proof bounded tardiness etc.. (assuming a gaussian distribution).
>> 
>> The proposed seems close to that, but not quite, and I'm afraid it's not
>> quite strong enough to still provide any guarantees.
> 
> After reading some papers on statistical bandwidth sharing, it occurs to me 
> that statistical bandwidth sharing is about the way bandwidth is shared 
> between competitors. I wonder if the paper you mentioned was "Insensitivity 
> results in statistical bandwidth sharing" or some paper referenced, which 
> showed the end result is insensitive to maybe the distribution or the arrival 
> pattern.
> 
> I am sorry that I cannot prove using statistical bandwidth sharing theory 
> now. However, I wonder if it is more acceptable to put rate-based cfsb after 
> share fairness, because the input stream of cfsb account is the output stream 
> of fairness. And that is in contrast with what statistical bandwidth sharing 
> does, in which fairness is used to share bandwidth upon the output stream of 
> rate-based control. In this way, maybe guarantees can be provided by share 
> fairness, and cfsb may use a larger buffer to allow more jitters.
> 
> A buffer, which is several times of quota, is able to handle large bursts, 
> and throttle threads soon when overloaded. The present cfsb, however, has to 
> be configured several times larger to handle jitters, which is ineffective 
> and does not provide the designed rate-based control.

Hi, peter.

Will you please have a look at my earlier reply.

The guarantee provided by cfsb is max-min fairness with strict upper bound, I 
think. And after this modification, it is still roughly that. As for longer 
periods, each cfsb group is still limited to their quota on average.

When users configure a huge burst buffer, cfsb doesn't really throttle threads 
and upper bound is exceeded. In that case, guarantees are provided by share, 
which provides proportional fairness.

Thanks,
Huaixin


Re: [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2020-12-21 Thread changhuaixin



> On Dec 17, 2020, at 9:36 PM, Peter Zijlstra  wrote:
> 
> On Thu, Dec 17, 2020 at 03:46:17PM +0800, Huaixin Chang wrote:
>> In this patch, we introduce the notion of CFS bandwidth burst. Unused
>> "quota" from pervious "periods" might be accumulated and used in the
>> following "periods". The maximum amount of accumulated bandwidth is
>> bounded by "burst". And the maximun amount of CPU a group can consume in
>> a given period is "buffer" which is equivalent to "quota" + "burst in
>> case that this group has done enough accumulation.
> 
> Oh man, Juri, wasn't there a paper about statistical bandwidth
> accounting somewhere? Where, if you replace every utilization by a
> statistical variable, the end result is still useful?
> 
> That is, instead of something like; \Sum u_i <= 1, you get something
> like: \Sum {avg(u),var(u)}_i <= {1, sqrt(\Sum var_i^2)} and you can
> still proof bounded tardiness etc.. (assuming a gaussian distribution).
> 
> The proposed seems close to that, but not quite, and I'm afraid it's not
> quite strong enough to still provide any guarantees.

After reading some papers on statistical bandwidth sharing, it occurs to me 
that statistical bandwidth sharing is about the way bandwidth is shared between 
competitors. I wonder if the paper you mentioned was "Insensitivity results in 
statistical bandwidth sharing" or some paper referenced, which showed the end 
result is insensitive to maybe the distribution or the arrival pattern.

I am sorry that I cannot prove using statistical bandwidth sharing theory now. 
However, I wonder if it is more acceptable to put rate-based cfsb after share 
fairness, because the input stream of cfsb account is the output stream of 
fairness. And that is in contrast with what statistical bandwidth sharing does, 
in which fairness is used to share bandwidth upon the output stream of 
rate-based control. In this way, maybe guarantees can be provided by share 
fairness, and cfsb may use a larger buffer to allow more jitters.

A buffer, which is several times of quota, is able to handle large bursts, and 
throttle threads soon when overloaded. The present cfsb, however, has to be 
configured several times larger to handle jitters, which is ineffective and 
does not provide the designed rate-based control.


Re: [PATCH v2 0/3] Build ORC fast lookup table in scripts/sorttable tool

2020-08-18 Thread changhuaixin
Hi,Ingo

This patchset reverts the hacks from patchset v1. Also it includes some other 
fixes upon v1 as suggested.
Will you please have a look at this?

The previous links are:
https://lore.kernel.org/lkml/20200724135531.gb648...@gmail.com/

Thanks,
huaixin

> On Aug 7, 2020, at 12:17 PM, Huaixin Chang  
> wrote:
> 
> Move building of fast lookup table from boot to sorttable tool. This saves us
> 6380us boot time on Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz with cores. It
> adds a little more than 7ms to boot time when testing on the same CPU.
> 
> Changelog v2:
> 1. Write .orc_lookup section header via objtool
> 2. Move two ORC lookup table macro from orc_lookup.h into orc_types.h
> 3. Spell 'ORC' in capitalized fashion
> 
> Huaixin Chang (3):
>  objtool: Write .orc_lookup section header
>  scripts/sorttable: Build ORC fast lookup table via sorttable tool
>  x86/unwind/orc: Simplify unwind_init() for x86 boot
> 
> arch/x86/include/asm/orc_lookup.h  | 16 --
> arch/x86/include/asm/orc_types.h   | 16 ++
> arch/x86/kernel/unwind_orc.c   | 41 +--
> arch/x86/kernel/vmlinux.lds.S  |  2 +-
> scripts/sorttable.h| 96 +++---
> tools/arch/x86/include/asm/orc_types.h | 16 ++
> tools/objtool/orc_gen.c|  4 ++
> 7 files changed, 128 insertions(+), 63 deletions(-)
> 
> -- 
> 2.14.4.44.g2045bb6



Re: [PATCH 1/3] scripts/sorttable: Change section type of orc_lookup to SHT_PROGBITS

2020-08-06 Thread changhuaixin



> On Aug 6, 2020, at 11:08 PM, Ingo Molnar  wrote:
> 
> 
> * changhuaixin  wrote:
> 
>> Hi, Ingo
>> 
>> Another way to write SHT_PROGBITS is using elf_create_section to write 
>> orc_lookup table headers, when orc_unwind_ip table and orc_unwind table are 
>> written. Is this a better solution?
>> 
>> diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c
>> index 3f98dcfbc177..860d4dcec8e6 100644
>> --- a/tools/objtool/orc_gen.c
>> +++ b/tools/objtool/orc_gen.c
>> @@ -183,6 +183,10 @@ int create_orc_sections(struct objtool_file *file)
>>u_sec = elf_create_section(file->elf, ".orc_unwind",
>>   sizeof(struct orc_entry), idx);
>> 
>> +   /* make flags of section orc_lookup right */
>> +   if (!elf_create_section(file->elf, ".orc_lookup", sizeof(int), 0))
>> +   return -1;
>> +
>>/* populate sections */
>>idx = 0;
>>for_each_sec(file, sec) {
> 
> Looks much nicer IMO.
> 
> Mind turning this into a proper patch that does it plus reverts the 
> hack?
> 
A new patchset is sent.

Thanks,
huaixin

> Thanks,
> 
>   Ingo



Re: [PATCH 1/3] scripts/sorttable: Change section type of orc_lookup to SHT_PROGBITS

2020-08-03 Thread changhuaixin
Hi, Ingo

Another way to write SHT_PROGBITS is using elf_create_section to write 
orc_lookup table headers, when orc_unwind_ip table and orc_unwind table are 
written. Is this a better solution?

diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c
index 3f98dcfbc177..860d4dcec8e6 100644
--- a/tools/objtool/orc_gen.c
+++ b/tools/objtool/orc_gen.c
@@ -183,6 +183,10 @@ int create_orc_sections(struct objtool_file *file)
u_sec = elf_create_section(file->elf, ".orc_unwind",
   sizeof(struct orc_entry), idx);

+   /* make flags of section orc_lookup right */
+   if (!elf_create_section(file->elf, ".orc_lookup", sizeof(int), 0))
+   return -1;
+
/* populate sections */
idx = 0;
for_each_sec(file, sec) {

Thanks,
Huaixin

> On Jul 23, 2020, at 11:46 AM, Huaixin Chang  
> wrote:
> 
> In order to edit orc_lookup table via sorttable, type of section
> orc_lookup needs to be SHT_PROGBITS instead of SHT_NOBITS.
> 
> Linker script doesn't seem to allow manual specification of the section
> type, so just write a byte into the section instead.
> 
> Signed-off-by: Josh Poimboeuf 
> Signed-off-by: Huaixin Chang 
> Signed-off-by: Shile Zhang 
> ---
> include/asm-generic/vmlinux.lds.h | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index db600ef218d7..49f4f5bc6165 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -826,6 +826,8 @@
>   . += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /\
>   LOOKUP_BLOCK_SIZE) + 1) * 4;\
>   orc_lookup_end = .; \
> + /* HACK: force SHT_PROGBITS so sorttable can edit: */   \
> + BYTE(1);\
>   }
> #else
> #define ORC_UNWIND_TABLE
> -- 
> 2.14.4.44.g2045bb6



Re: [PATCH 2/3] scripts/sorttable: Build orc fast lookup table via sorttable tool

2020-07-26 Thread changhuaixin



> On Jul 24, 2020, at 9:53 PM, Ingo Molnar  wrote:
> 
> 
> * Huaixin Chang  wrote:
> 
>> Since orc tables are already sorted by sorttable tool, let us move
>> building of fast lookup table into sorttable tool too. This saves us
>> 6380us from boot time under Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
>> with 64 cores.
> 
> Neat!
> 
>> +struct orc_sort_param {
>> +size_t  lookup_table_size;
>> +unsigned int*orc_lookup_table;
>> +unsigned long   start_ip;
>> +size_t  text_size;
>> +unsigned intorc_num_entries;
>> +};
> 
>> 
>> +#define LOOKUP_BLOCK_ORDER  8
>> +#define LOOKUP_BLOCK_SIZE   (1 << LOOKUP_BLOCK_ORDER)
>> +
>> +for (i = 0; i < lookup_num_blocks-1; i++) {
>> +orc = __orc_find(g_orc_ip_table, g_orc_table,
>> + num_entries,
>> + lookup_start_ip + (LOOKUP_BLOCK_SIZE * i));
>> +if (!orc) {
>> +snprintf(g_err, ERRSTR_MAXSZ,
>> +"Corrupt .orc_unwind table\n");
>> +pthread_exit(g_err);
>> +}
>> +
>> +orc_lookup[i] = orc - g_orc_table;
>> +}
>> +
>> +/* Initialize the ending block: */
>> +orc = __orc_find(g_orc_ip_table, g_orc_table, num_entries,
>> + lookup_stop_ip);
>> +if (!orc) {
>> +snprintf(g_err, ERRSTR_MAXSZ, "Corrupt .orc_unwind table\n");
>> +pthread_exit(g_err);
>> +}
>> +orc_lookup[lookup_num_blocks-1] = orc - g_orc_table;
> 
> Yeah, so now this definition of LOOKUP_BLOCK_* basicaly duplicates the 
> arch/x86/include/asm/orc_lookup.h size, with no obvious link between 
> the two. This is asking for trouble.
> 
>  looks simple enough - can we include it in 
> scripts/sorttable.h?
> 
> Or better yet, please move these two defines into , 
> which is already included in sorttable.h.
> 
Thanks!
Moving these two into  and capitalized spelling will be done 
in the following patches.

Huaixin

> BTW., please update your patches to spell 'ORC' in a capitalized 
> fashion, like most of the existing code does:
> 
>>  /* create thread to sort ORC unwind tables concurrently */
> 
> Thanks,
> 
>   Ingo



Re: [PATCH v3 0/3] Build ORC fast lookup table in scripts/sorttable tool

2020-06-28 Thread changhuaixin
Hi Josh, will you please have a look at this patchset?

There might be another way to set SHT_PROGBITS of section .orc_lookup by 
writing section headers when orc_unwind and orc_unwind_ip tables are writen. It 
might be as follows:

diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c
index 3f98dcfbc177..860d4dcec8e6 100644
--- a/tools/objtool/orc_gen.c
+++ b/tools/objtool/orc_gen.c
@@ -183,6 +183,10 @@ int create_orc_sections(struct objtool_file *file)
u_sec = elf_create_section(file->elf, ".orc_unwind",
   sizeof(struct orc_entry), idx);

+   /* make flags of section orc_lookup right */
+   if (!elf_create_section(file->elf, ".orc_lookup", sizeof(int), 0))
+   return -1;
+

What do you think about this way of setting SHT_PROGBITS?

> On Jun 3, 2020, at 10:39 PM, Huaixin Chang  
> wrote:
> 
> Move building of fast lookup table from boot to sorttable tool. This saves us
> 6380us boot time on Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz with cores. It
> adds a little more than 7ms to boot time when testing on the same CPU.
> 
> Changelog v3:
> 1. Modify annotation of unwind_init().
> 
> Changelog v2:
> 1. Type of section orc_lookup needs to be SHT_PROGBITS.
> 2. unwind_init() cannot be removed totally as setting lookup_num_blocks is 
> needed.
> 
> Huaixin Chang (3):
>  scripts/sorttable: Change section type of orc_lookup to SHT_PROGBITS
>  scripts/sorttable: Build orc fast lookup table via sorttable tool
>  x86/unwind/orc: Simplify unwind_init() for x86 boot
> 
> arch/x86/kernel/unwind_orc.c  | 41 +---
> include/asm-generic/vmlinux.lds.h |  2 +
> scripts/sorttable.h   | 99 ---
> 3 files changed, 96 insertions(+), 46 deletions(-)
> 
> -- 
> 2.14.4.44.g2045bb6



Re: [PATCH 0/2] Build ORC fast lookup table in scripts/sorttable tool

2020-06-03 Thread changhuaixin



> On Jun 2, 2020, at 1:38 AM, Josh Poimboeuf  wrote:
> 
> On Sun, May 31, 2020 at 01:26:54PM +0800, changhuaixin wrote:
>>   It turned out to be an alignment problem. If sh_size of previous section
>>   orc_unwind is not 4-byte aligned, sh_offset of the following orc_lookup
>>   section is not 4-byte aligned too. However, the VMA of section orc_lookup
>>   is aligned to the nearest 4-byte. Thus, the orc_lookup section means two
>>   different ares for scripts/sorttable tool and kernel.
>> 
>>   Sections headers look like this when it happens:
>> 
>>   12 .orc_unwind_ip 00172124  82573b28  02573b28  01773b28
>>2**0
>>CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
>>   13 .orc_unwind   0022b1b6  826e5c4c  026e5c4c  018e5c4c
>>2**0
>>CONTENTS, ALLOC, LOAD, READONLY, DATA
>>   14 .orc_lookup   0003003c  82910e04  02910e04  01b10e02
>>2**0
>>ALLOC
>>   15 .vvar 1000  82941000  02941000  01b41000
>>2**4
>>CONTENTS, ALLOC, LOAD, DATA
>> 
>>   Sorttable tool uses the are starting with offset 0x01b10e02 for 0x0003003c
>>   bytes. While kernel use the area starting with VMA at  0x82910e04
>>   for 0x0003003c bytes, meaning that each entry in this table used by kernel
>>   is actually 2 bytes behind the corresponding entry set from sorttable
>>   tool.
>> 
>>   Any suggestion on fixing this?
> 
> The VMA and LMA are both 4-byte aligned.  The file offset alignment
> (0x01b10e02) shouldn't matter.
> 
> Actually it looks like the problem is that the section doesn't have
> CONTENTS, so it's just loaded as a BSS section (all zeros).  The section
> needs to be type SHT_PROGBITS instead of SHT_NOBITS.
> 
> $ readelf -S vmlinux |grep orc_lookup
>  [16] .orc_lookup   NOBITS   82b68418  01d68418
> 
> I tried to fix it with
> 
> diff --git a/scripts/sorttable.h b/scripts/sorttable.h
> index a36c76c17be4..76adb1fb88f8 100644
> --- a/scripts/sorttable.h
> +++ b/scripts/sorttable.h
> @@ -341,6 +341,7 @@ static int do_sort(Elf_Ehdr *ehdr,
>   param.lookup_table_size = s->sh_size;
>   param.orc_lookup_table = (unsigned int *)
>   ((void *)ehdr + s->sh_offset);
> + w(SHT_PROGBITS, >sh_type);
>   }
>   if (!strcmp(secstrings + idx, ".text")) {
>   param.text_size = s->sh_size;
> 
> 
> But that makes kallsyms unhappy, so I guess we need to do it from the
> linker script where .orc_lookup is created.
> 
> Linker script doesn't seem to allow manual specification of the section
> type, so this is the best I could come up with:
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index db600ef218d7..49f4f5bc6165 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -826,6 +826,8 @@
>   . += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /\
>   LOOKUP_BLOCK_SIZE) + 1) * 4;\
>   orc_lookup_end = .; \
> + /* HACK: force SHT_PROGBITS so sorttable can edit: */   \
> + BYTE(1);\
>   }
> #else
> #define ORC_UNWIND_TABLE

Thanks! It works.




Re: [PATCH 0/2] Build ORC fast lookup table in scripts/sorttable tool

2020-05-26 Thread changhuaixin
Thanks for your kindly reply. Let me have a check.

> On May 23, 2020, at 2:28 AM, Josh Poimboeuf  wrote:
> 
> On Wed, Apr 29, 2020 at 02:46:24PM +0800, Huaixin Chang wrote:
>> Move building of fast lookup table from boot to sorttable tool. This saves us
>> 6380us boot time on Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz with cores.
>> 
>> Huaixin Chang (2):
>>  scripts/sorttable: Build orc fast lookup table via sorttable tool
>>  x86/unwind/orc: Remove unwind_init() from x86 boot
>> 
>> arch/x86/include/asm/unwind.h |  2 -
>> arch/x86/kernel/setup.c   |  2 -
>> arch/x86/kernel/unwind_orc.c  | 51 --
>> scripts/sorttable.h   | 99 
>> ---
>> 4 files changed, 92 insertions(+), 62 deletions(-)
> 
> I tested this (rebased on tip/master), it seems to break ORC
> completely... e.g. /proc/self/stack is empty.
> 
> -- 
> Josh



Re: [PATCH 0/2] Build ORC fast lookup table in scripts/sorttable tool

2020-04-29 Thread changhuaixin



> On Apr 29, 2020, at 4:49 PM, Peter Zijlstra  wrote:
> 
> On Wed, Apr 29, 2020 at 02:46:24PM +0800, Huaixin Chang wrote:
>> Move building of fast lookup table from boot to sorttable tool. This saves us
>> 6380us boot time on Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz with cores.
> 
> And what does it add to the build time?

It takes a little more than 7ms to build fast lookup table in sorttable on the 
same CPU. And it is on the critical path.