Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
Hi Balbir, On 04/14/2016 11:10 AM, Balbir Singh wrote: On 13/04/16 04:06, Akshay Adiga wrote: This patch brings down global pstate at a slower rate than the local pstate. As the frequency transition latency from pmin to pmax is observed to be in few millisecond granurality. It takes a performance penalty during sudden frequency rampup. Hence by holding global pstates higher than local pstate makes the subsequent rampups faster. What domains does local and global refer to? The *global pstate* is a Chip-level entity as such, so the global entity (Voltage) is managed across several cores. But with a DCM (Dual Chip Modules), its more of a socket-level entity and hence Voltage is managed across both chips. The *local pstate* is a Core-level entity, so the local entity (frequency) is managed across threads. I will include this in the commit message. Thanks. +/* + * Quadratic equation which gives the percentage rampdown for time elapsed in + * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds ) + * This equation approximates to y = -4e-6 x^2 Thanks for documenting this, but I think it will also be good to explain why we use y = -4 e-6*x^2 as opposed to any other magic numbers. Well it empirically worked out best this way. On an idle system we want the Global Pstate to ramp-down from max value to min over a span of ~5 secs. Also we want initially ramp-down slowly and ramp-down rapidly later on, hence the equation. I will try to make this in the comment more informative. Regards Akshay Adiga ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
On 13/04/16 04:06, Akshay Adiga wrote: > This patch brings down global pstate at a slower rate than the local > pstate. As the frequency transition latency from pmin to pmax is > observed to be in few millisecond granurality. It takes a performance > penalty during sudden frequency rampup. Hence by holding global pstates > higher than local pstate makes the subsequent rampups faster. What domains does local and global refer to? > > A global per policy structure is maintained to keep track of the global > and local pstate changes. The global pstate is brought down using a > parabolic equation. The ramp down time to pmin is set to 6 seconds. To > make sure that the global pstates are dropped at regular interval , a > timer is queued for every 2 seconds, which eventually brings the pstate > down to local pstate. > > Iozone results show fairly consistent performance boost. > YCSB on redis shows improved Max latencies in most cases. > > Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with > different record sizes . The following table shows IOoperations/sec with and > without patch. > > Iozone Results ( in op/sec) ( mean over 3 iterations ) > > file size-withwithout > recordsize-IOtype patch patch% > change > -- > 200704-1-SeqWrite 1616532 1615425 0.06 > 200704-1-Rewrite 2423195 2303130 5.21 > 200704-2-SeqWrite 1628577 1602620 1.61 > 200704-2-Rewrite 2428264 2312154 5.02 > 200704-4-SeqWrite 1617605 1617182 0.02 > 200704-4-Rewrite 2430524 2351238 3.37 > 200704-8-SeqWrite 1629478 1600436 1.81 > 200704-8-Rewrite 2415308 2298136 5.09 > 200704-16-SeqWrite1619632 1618250 0.08 > 200704-16-Rewrite 2396650 2352591 1.87 > 200704-32-SeqWrite1632544 1598083 2.15 > 200704-32-Rewrite 2425119 2329743 4.09 > 200704-64-SeqWrite1617812 1617235 0.03 > 200704-64-Rewrite 2402021 2321080 3.48 > 200704-128-SeqWrite 1631998 1600256 1.98 > 200704-128-Rewrite2422389 2304954 5.09 > 200704-256 SeqWrite 1617065 1616962 0.00 > 200704-256-Rewrite2432539 2301980 5.67 > 200704-512-SeqWrite 1632599 1598656 2.12 > 200704-512-Rewrite2429270 2323676 4.54 > 200704-1024-SeqWrite 1618758 1616156 0.16 > 200704-1024-Rewrite 2431631 2315889 4.99 > 401408-1-SeqWrite 1631479 1608132 1.45 > 401408-1-Rewrite 2501550 2459409 1.71 > 401408-2-SeqWrite 1617095 1626069 -0.55 > 401408-2-Rewrite 2507557 2443621 2.61 > 401408-4-SeqWrite 1629601 1611869 1.10 > 401408-4-Rewrite 2505909 2462098 1.77 > 401408-8-SeqWrite 1617110 1626968 -0.60 > 401408-8-Rewrite 2512244 2456827 2.25 > 401408-16-SeqWrite1632609 1609603 1.42 > 401408-16-Rewrite 2500792 2451405 2.01 > 401408-32-SeqWrite1619294 1628167 -0.54 > 401408-32-Rewrite 2510115 2451292 2.39 > 401408-64-SeqWrite1632709 1603746 1.80 > 401408-64-Rewrite 2506692 2433186 3.02 > 401408-128-SeqWrite 1619284 1627461 -0.50 > 401408-128-Rewrite2518698 2453361 2.66 > 401408-256-SeqWrite 1634022 1610681 1.44 > 401408-256-Rewrite2509987 2446328 2.60 > 401408-512-SeqWrite 1617524 1628016 -0.64 > 401408-512-Rewrite2504409 2442899 2.51 > 401408-1024-SeqWrite 1629812 1611566 1.13 > 401408-1024-Rewrite 2507620 24429682.64 > > Tested with YCSB workloada over redis for 1 million records and 1 million > operation. Each test was carried out with target operations per second and > persistence disabled. > > Max-latency (in us)( mean over 5 iterations ) > --- > op/s Operation with patch without patch %change > > 15000 Read61480.6 50261.4 22.32 > 15000 cleanup 215.2
Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
On 13-04-16, 23:27, Akshay Adiga wrote: > On 04/13/2016 10:33 AM, Viresh Kumar wrote: > >>+void gpstate_timer_handler(unsigned long data) > >>+{ > >>+ struct cpufreq_policy *policy = (struct cpufreq_policy *) data; > >no need to cast. > > May be i need a cast here, because data is unsigned long ( unlike other > places where its void *). > On building without cast, it throws me a warning. My bad, yeah :( > >>+ if (freq_data.gpstate_id != freq_data.pstate_id) > >>+ ret = queue_gpstate_timer(gpstates); > >ret not used. > > Should i make it void instead of returning int?, as i cannot do much even if > it fails, except for notifying. Sure. -- viresh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
Hi Viresh , Thanks for reviewing in detail. I will correct all comments related to coding standards in my next patch. On 04/13/2016 10:33 AM, Viresh Kumar wrote: Comments mostly on the coding standards which you have *not* followed. Also, please run checkpatch --strict next time you send patches upstream. Thanks for pointing out the --strict option, was not aware of that. I will run checkpatch --strict on the next versions. On 12-04-16, 23:36, Akshay Adiga wrote: + +/* + * While resetting we don't want "timer" fields to be set to zero as we + * may lose track of timer and will not be able to cleanly remove it + */ +#define reset_gpstates(policy) memset(policy->driver_data, 0,\ + sizeof(struct global_pstate_info)-\ + sizeof(struct timer_list)-\ + sizeof(spinlock_t)) That's super *ugly*. Why don't you create a simple routine which will set the 5 integer variables to 0 in a straight forward way ? Yeh, will create a routine. @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data) unsigned long val; unsigned long pstate_ul = ((struct powernv_smp_call_data *) freq_data)->pstate_id; + unsigned long gpstate_ul = + ((struct powernv_smp_call_data *) freq_data)->gpstate_id; Remove these unnecessary casts and do: struct powernv_smp_call_data *freq_data = data; //Name func arg as data And then use freq_data->*. Ok. Will do that. +/* + * gpstate_timer_handler + * + * @data: pointer to cpufreq_policy on which timer was queued + * + * This handler brings down the global pstate closer to the local pstate + * according quadratic equation. Queues a new timer if it is still not equal + * to local pstate + */ +void gpstate_timer_handler(unsigned long data) +{ + struct cpufreq_policy *policy = (struct cpufreq_policy *) data; no need to cast. May be i need a cast here, because data is unsigned long ( unlike other places where its void *). On building without cast, it throws me a warning. + struct global_pstate_info *gpstates = (struct global_pstate_info *) + struct powernv_smp_call_data freq_data; + int ret; + + ret = spin_trylock(>gpstate_lock); no need of 'ret' for just this, simply do: if (!spin_trylock())... Sure will do that. a + if (!ret) + return; + + gpstates->last_sampled_time += time_diff; + gpstates->elapsed_time += time_diff; + freq_data.pstate_id = gpstates->last_lpstate; + if ((gpstates->last_gpstate == freq_data.pstate_id) || + (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) { + freq_data.gpstate_id = freq_data.pstate_id; + reset_gpstates(policy); + gpstates->highest_lpstate = freq_data.pstate_id; + } else { + freq_data.gpstate_id = calculate_global_pstate( You can't break a line after ( of a function call :) Let it go beyond 80 columns if it has to. May be i will try to get it inside 80 columns with a temporary variable instead of freq_data.gpstate_id. + gpstates->elapsed_time, gpstates->highest_lpstate, + freq_data.pstate_id); + } + + /* If local pstate is equal to global pstate, rampdown is over Bad style again. +* So timer is not required to be queued. +*/ + if (freq_data.gpstate_id != freq_data.pstate_id) + ret = queue_gpstate_timer(gpstates); ret not used. Should i make it void instead of returning int?, as i cannot do much even if it fails, except for notifying. +gpstates_done: + gpstates->last_sampled_time = cur_msec; + gpstates->last_gpstate = freq_data.gpstate_id; + gpstates->last_lpstate = freq_data.pstate_id; + /* * Use smp_call_function to send IPI and execute the * mtspr on target CPU. We could do that without IPI * if current CPU is within policy->cpus (core) */ smp_call_function_any(policy->cpus, set_pstate, _data, 1); + spin_unlock_irqrestore(>gpstate_lock, flags); + return 0; +} +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy) Add this after the init() routine. Ok will do it. + policy->driver_data = gpstates; + + /* initialize timer */ + init_timer_deferrable(>timer); + gpstates->timer.data = (unsigned long) policy; + gpstates->timer.function = gpstate_timer_handler; + gpstates->timer.expires = jiffies + + msecs_to_jiffies(GPSTATE_TIMER_INTERVAL); + + pr_info("Added global_pstate_info & timer for %d cpu\n", base); return cpufreq_table_validate_and_show(policy, powernv_freqs); Who will free gpstates if this fails ? Thanks for pointing out. Will fix in v2. Regards Akshay Adiga ___ Linuxppc-dev
Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
Comments mostly on the coding standards which you have *not* followed. Also, please run checkpatch --strict next time you send patches upstream. On 12-04-16, 23:36, Akshay Adiga wrote: > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > +#define MAX_RAMP_DOWN_TIME 5120 > +#define ramp_down_percent(time) ((time * time)>>18) You need spaces around >> > + > +/*Interval after which the timer is queued to bring down global pstate*/ Bad comment style, should be /* ... */ > +#define GPSTATE_TIMER_INTERVAL 2000 > +/* > + * global_pstate_info : This is bad as well.. See how doc style comments are written at separate places. > + * per policy data structure to maintain history of global pstates > + * > + * @highest_lpstate : the local pstate from which we are ramping down No space required before : > + * @elapsed_time : time in ms spent in ramping down from highest_lpstate > + * @last_sampled_time : time from boot in ms when global pstates were last > set > + * @last_lpstate , last_gpstate : last set values for local and global > pstates > + * @timer : is used for ramping down if cpu goes idle for a long time with > + * global pstate held high > + * @gpstate_lock : a spinlock to maintain synchronization between routines > + * called by the timer handler and governer's target_index calls > + */ > +struct global_pstate_info { > + int highest_lpstate; > + unsigned int elapsed_time; > + unsigned int last_sampled_time; > + int last_lpstate; > + int last_gpstate; > + spinlock_t gpstate_lock; > + struct timer_list timer; > +}; > + > +/* > + * While resetting we don't want "timer" fields to be set to zero as we > + * may lose track of timer and will not be able to cleanly remove it > + */ > +#define reset_gpstates(policy) memset(policy->driver_data, 0,\ > + sizeof(struct global_pstate_info)-\ > + sizeof(struct timer_list)-\ > + sizeof(spinlock_t)) That's super *ugly*. Why don't you create a simple routine which will set the 5 integer variables to 0 in a straight forward way ? > @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data) > unsigned long val; > unsigned long pstate_ul = > ((struct powernv_smp_call_data *) freq_data)->pstate_id; > + unsigned long gpstate_ul = > + ((struct powernv_smp_call_data *) freq_data)->gpstate_id; Remove these unnecessary casts and do: struct powernv_smp_call_data *freq_data = data; //Name func arg as data And then use freq_data->*. > > val = get_pmspr(SPRN_PMCR); > val = val & 0xULL; > > pstate_ul = pstate_ul & 0xFF; > + gpstate_ul = gpstate_ul & 0xFF; > > /* Set both global(bits 56..63) and local(bits 48..55) PStates */ > - val = val | (pstate_ul << 56) | (pstate_ul << 48); > + val = val | (gpstate_ul << 56) | (pstate_ul << 48); > /* You need /** here. See comments everywhere please, they are mostly done in a wrong way. > + * calcuate_global_pstate: > + * > + * @elapsed_time : elapsed time in milliseconds > + * @local_pstate : new local pstate > + * @highest_lpstate : pstate from which its ramping down > + * > + * Finds the appropriate global pstate based on the pstate from which its > + * ramping down and the time elapsed in ramping down. It follows a quadratic > + * equation which ensures that it reaches ramping down to pmin in 5sec. > + */ > +inline int calculate_global_pstate(unsigned int elapsed_time, > + int highest_lpstate, int local_pstate) Not aligned properly, checkpatch --strict will tell you what to do. > +{ > + int pstate_diff; > + > + /* > + * Using ramp_down_percent we get the percentage of rampdown > + * that we are expecting to be dropping. Difference between > + * highest_lpstate and powernv_pstate_info.min will give a absolute > + * number of how many pstates we will drop eventually by the end of > + * 5 seconds, then just scale it get the number pstates to be dropped. > + */ > + pstate_diff = ((int)ramp_down_percent(elapsed_time) * > + (highest_lpstate - powernv_pstate_info.min))/100; > + > + /* Ensure that global pstate is >= to local pstate */ > + if (highest_lpstate - pstate_diff < local_pstate) > + return local_pstate; > + else > + return (highest_lpstate - pstate_diff); Unnecessary (). > +} > + > +inline int queue_gpstate_timer(struct global_pstate_info *gpstates) Looks like many of the function you wrote now should be marked 'static' as well. > +{ > + unsigned int timer_interval; > + > + /* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But Bad style again: /* * ... * ... */ > + * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time. > + *
[PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
This patch brings down global pstate at a slower rate than the local pstate. As the frequency transition latency from pmin to pmax is observed to be in few millisecond granurality. It takes a performance penalty during sudden frequency rampup. Hence by holding global pstates higher than local pstate makes the subsequent rampups faster. A global per policy structure is maintained to keep track of the global and local pstate changes. The global pstate is brought down using a parabolic equation. The ramp down time to pmin is set to 6 seconds. To make sure that the global pstates are dropped at regular interval , a timer is queued for every 2 seconds, which eventually brings the pstate down to local pstate. Iozone results show fairly consistent performance boost. YCSB on redis shows improved Max latencies in most cases. Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with different record sizes . The following table shows IOoperations/sec with and without patch. Iozone Results ( in op/sec) ( mean over 3 iterations ) file size- withwithout recordsize-IOtype patch patch% change -- 200704-1-SeqWrite 1616532 1615425 0.06 200704-1-Rewrite2423195 2303130 5.21 200704-2-SeqWrite 1628577 1602620 1.61 200704-2-Rewrite2428264 2312154 5.02 200704-4-SeqWrite 1617605 1617182 0.02 200704-4-Rewrite2430524 2351238 3.37 200704-8-SeqWrite 1629478 1600436 1.81 200704-8-Rewrite2415308 2298136 5.09 200704-16-SeqWrite 1619632 1618250 0.08 200704-16-Rewrite 2396650 2352591 1.87 200704-32-SeqWrite 1632544 1598083 2.15 200704-32-Rewrite 2425119 2329743 4.09 200704-64-SeqWrite 1617812 1617235 0.03 200704-64-Rewrite 2402021 2321080 3.48 200704-128-SeqWrite 1631998 1600256 1.98 200704-128-Rewrite 2422389 2304954 5.09 200704-256 SeqWrite 1617065 1616962 0.00 200704-256-Rewrite 2432539 2301980 5.67 200704-512-SeqWrite 1632599 1598656 2.12 200704-512-Rewrite 2429270 2323676 4.54 200704-1024-SeqWrite1618758 1616156 0.16 200704-1024-Rewrite 2431631 2315889 4.99 401408-1-SeqWrite 1631479 1608132 1.45 401408-1-Rewrite2501550 2459409 1.71 401408-2-SeqWrite 1617095 1626069 -0.55 401408-2-Rewrite2507557 2443621 2.61 401408-4-SeqWrite 1629601 1611869 1.10 401408-4-Rewrite2505909 2462098 1.77 401408-8-SeqWrite 1617110 1626968 -0.60 401408-8-Rewrite2512244 2456827 2.25 401408-16-SeqWrite 1632609 1609603 1.42 401408-16-Rewrite 2500792 2451405 2.01 401408-32-SeqWrite 1619294 1628167 -0.54 401408-32-Rewrite 2510115 2451292 2.39 401408-64-SeqWrite 1632709 1603746 1.80 401408-64-Rewrite 2506692 2433186 3.02 401408-128-SeqWrite 1619284 1627461 -0.50 401408-128-Rewrite 2518698 2453361 2.66 401408-256-SeqWrite 1634022 1610681 1.44 401408-256-Rewrite 2509987 2446328 2.60 401408-512-SeqWrite 1617524 1628016 -0.64 401408-512-Rewrite 2504409 2442899 2.51 401408-1024-SeqWrite1629812 1611566 1.13 401408-1024-Rewrite 2507620 24429682.64 Tested with YCSB workloada over redis for 1 million records and 1 million operation. Each test was carried out with target operations per second and persistence disabled. Max-latency (in us)( mean over 5 iterations ) --- op/sOperation with patch without patch %change 15000 Read61480.6 50261.4 22.32 15000 cleanup 215.2 293.6 -26.70 15000 update 25666.2 25163.8 2.00 25000 Read32626.2 89525.4 -63.56 25000