Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-14 Thread Akshay Adiga

Hi Balbir,

On 04/14/2016 11:10 AM, Balbir Singh wrote:


On 13/04/16 04:06, Akshay Adiga wrote:

This patch brings down global pstate at a slower rate than the local
pstate. As the frequency transition latency from pmin to pmax is
observed to be in few millisecond granurality. It takes a performance
penalty during sudden frequency rampup. Hence by holding global pstates
higher than local pstate makes the subsequent rampups faster.

What domains does local and global refer to?


The *global pstate* is a Chip-level entity as such, so the global entity
(Voltage)  is managed across several cores. But with a DCM (Dual Chip Modules),
its more of a socket-level entity and hence Voltage is managed across both 
chips.

The *local pstate* is a Core-level entity, so the local entity (frequency) is
managed across threads.

I will include this in the commit message. Thanks.




+/*
+ * Quadratic equation which gives the percentage rampdown for time elapsed in
+ * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
+ * This equation approximates to y = -4e-6 x^2

Thanks for documenting this, but I think it will also be good to explain why we
use y = -4 e-6*x^2 as opposed to any other magic numbers.


Well it empirically worked out best this way.

On an idle system we want the Global Pstate to ramp-down from max value to min 
over
a span of ~5 secs. Also we want initially ramp-down slowly and ramp-down 
rapidly later
on, hence the equation.

I will try to make this in the comment more informative.

Regards

Akshay Adiga

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-13 Thread Balbir Singh


On 13/04/16 04:06, Akshay Adiga wrote:
> This patch brings down global pstate at a slower rate than the local
> pstate. As the frequency transition latency from pmin to pmax is
> observed to be in few millisecond granurality. It takes a performance
> penalty during sudden frequency rampup. Hence by holding global pstates
> higher than local pstate makes the subsequent rampups faster.

What domains does local and global refer to?

> 
> A global per policy structure is maintained to keep track of the global
> and local pstate changes. The global pstate is brought down using a
> parabolic equation. The ramp down time to pmin is set to 6 seconds. To
> make sure that the global pstates are dropped at regular interval , a
> timer is queued for every 2 seconds, which eventually brings the pstate
> down to local pstate.
> 
> Iozone results show fairly consistent performance boost.
> YCSB on redis shows improved Max latencies in most cases.
> 
> Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with
> different record sizes . The following table shows IOoperations/sec with and
> without patch.
> 
> Iozone Results ( in op/sec) ( mean over 3 iterations )
> 
> file size-withwithout
> recordsize-IOtype patch   patch% 
> change
> --
> 200704-1-SeqWrite 1616532 1615425 0.06
> 200704-1-Rewrite  2423195 2303130 5.21
> 200704-2-SeqWrite 1628577 1602620 1.61
> 200704-2-Rewrite  2428264 2312154 5.02
> 200704-4-SeqWrite 1617605 1617182 0.02
> 200704-4-Rewrite  2430524 2351238 3.37
> 200704-8-SeqWrite 1629478 1600436 1.81
> 200704-8-Rewrite  2415308 2298136 5.09
> 200704-16-SeqWrite1619632 1618250 0.08
> 200704-16-Rewrite 2396650 2352591 1.87
> 200704-32-SeqWrite1632544 1598083 2.15
> 200704-32-Rewrite 2425119 2329743 4.09
> 200704-64-SeqWrite1617812 1617235 0.03
> 200704-64-Rewrite 2402021 2321080 3.48
> 200704-128-SeqWrite   1631998 1600256 1.98
> 200704-128-Rewrite2422389 2304954 5.09
> 200704-256 SeqWrite   1617065 1616962 0.00
> 200704-256-Rewrite2432539 2301980 5.67
> 200704-512-SeqWrite   1632599 1598656 2.12
> 200704-512-Rewrite2429270 2323676 4.54
> 200704-1024-SeqWrite  1618758 1616156 0.16
> 200704-1024-Rewrite   2431631 2315889 4.99
> 401408-1-SeqWrite 1631479 1608132 1.45
> 401408-1-Rewrite  2501550 2459409 1.71
> 401408-2-SeqWrite 1617095 1626069 -0.55
> 401408-2-Rewrite  2507557 2443621 2.61
> 401408-4-SeqWrite 1629601 1611869 1.10
> 401408-4-Rewrite  2505909 2462098 1.77
> 401408-8-SeqWrite 1617110 1626968 -0.60
> 401408-8-Rewrite  2512244 2456827 2.25
> 401408-16-SeqWrite1632609 1609603 1.42
> 401408-16-Rewrite 2500792 2451405 2.01
> 401408-32-SeqWrite1619294 1628167 -0.54
> 401408-32-Rewrite 2510115 2451292 2.39
> 401408-64-SeqWrite1632709 1603746 1.80
> 401408-64-Rewrite 2506692 2433186 3.02
> 401408-128-SeqWrite   1619284 1627461 -0.50
> 401408-128-Rewrite2518698 2453361 2.66
> 401408-256-SeqWrite   1634022 1610681 1.44
> 401408-256-Rewrite2509987 2446328 2.60
> 401408-512-SeqWrite   1617524 1628016 -0.64
> 401408-512-Rewrite2504409 2442899 2.51
> 401408-1024-SeqWrite  1629812 1611566 1.13
> 401408-1024-Rewrite   2507620  24429682.64
> 
> Tested with YCSB workloada over redis for 1 million records and 1 million
> operation. Each test was carried out with target operations per second and
> persistence disabled. 
> 
> Max-latency (in us)( mean over 5 iterations )
> ---
> op/s  Operation   with patch  without patch   %change
> 
> 15000 Read61480.6 50261.4 22.32
> 15000 cleanup 215.2   

Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-13 Thread Viresh Kumar
On 13-04-16, 23:27, Akshay Adiga wrote:
> On 04/13/2016 10:33 AM, Viresh Kumar wrote:
> >>+void gpstate_timer_handler(unsigned long data)
> >>+{
> >>+   struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
> >no need to cast.
> 
> May be i need a cast here,  because data is unsigned long ( unlike other 
> places where its void *).
> On building without cast, it throws me a warning.

My bad, yeah :(

> >>+   if (freq_data.gpstate_id != freq_data.pstate_id)
> >>+   ret = queue_gpstate_timer(gpstates);
> >ret not used.
> 
> Should i make it void instead of returning int?, as i cannot do much even if 
> it fails, except for notifying.

Sure.

-- 
viresh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-13 Thread Akshay Adiga

Hi Viresh ,

Thanks for reviewing in detail.
I will correct all comments related to coding standards in my next patch.

On 04/13/2016 10:33 AM, Viresh Kumar wrote:


Comments mostly on the coding standards which you have *not* followed.

Also, please run checkpatch --strict next time you send patches
upstream.


Thanks for pointing out the --strict option, was not aware of that. I will
run checkpatch --strict on the next versions.


On 12-04-16, 23:36, Akshay Adiga wrote:
+
+/*
+ * While resetting we don't want "timer" fields to be set to zero as we
+ * may lose track of timer and will not be able to cleanly remove it
+ */
+#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
+   sizeof(struct global_pstate_info)-\
+   sizeof(struct timer_list)-\
+   sizeof(spinlock_t))
That's super *ugly*. Why don't you create a simple routine which will
set the 5 integer variables to 0 in a straight forward way ?


Yeh, will create a routine.


@@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
unsigned long val;
unsigned long pstate_ul =
((struct powernv_smp_call_data *) freq_data)->pstate_id;
+   unsigned long gpstate_ul =
+   ((struct powernv_smp_call_data *) freq_data)->gpstate_id;

Remove these unnecessary casts and do:

struct powernv_smp_call_data *freq_data = data; //Name func arg as data

And then use freq_data->*.


Ok. Will do that.


+/*
+ * gpstate_timer_handler
+ *
+ * @data: pointer to cpufreq_policy on which timer was queued
+ *
+ * This handler brings down the global pstate closer to the local pstate
+ * according quadratic equation. Queues a new timer if it is still not equal
+ * to local pstate
+ */
+void gpstate_timer_handler(unsigned long data)
+{
+   struct cpufreq_policy *policy = (struct cpufreq_policy *) data;

no need to cast.


May be i need a cast here,  because data is unsigned long ( unlike other places 
where its void *).
On building without cast, it throws me a warning.


+   struct global_pstate_info *gpstates = (struct global_pstate_info *)
+   struct powernv_smp_call_data freq_data;
+   int ret;
+
+   ret = spin_trylock(>gpstate_lock);

no need of 'ret' for just this, simply do: if (!spin_trylock())...


Sure will do that.


a

+   if (!ret)
+   return;
+
+   gpstates->last_sampled_time += time_diff;
+   gpstates->elapsed_time += time_diff;
+   freq_data.pstate_id = gpstates->last_lpstate;
+   if ((gpstates->last_gpstate == freq_data.pstate_id) ||
+   (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
+   freq_data.gpstate_id = freq_data.pstate_id;
+   reset_gpstates(policy);
+   gpstates->highest_lpstate = freq_data.pstate_id;
+   } else {
+   freq_data.gpstate_id = calculate_global_pstate(

You can't break a line after ( of a function call :)

Let it go beyond 80 columns if it has to.


May be i will try to get it inside 80 columns with a temporary variable instead 
of
freq_data.gpstate_id.


+   gpstates->elapsed_time, gpstates->highest_lpstate,
+   freq_data.pstate_id);
+   }
+
+   /* If local pstate is equal to global pstate, rampdown is over

Bad style again.


+* So timer is not required to be queued.
+*/
+   if (freq_data.gpstate_id != freq_data.pstate_id)
+   ret = queue_gpstate_timer(gpstates);

ret not used.


Should i make it void instead of returning int?, as i cannot do much even if it 
fails, except for notifying.


+gpstates_done:
+   gpstates->last_sampled_time = cur_msec;
+   gpstates->last_gpstate = freq_data.gpstate_id;
+   gpstates->last_lpstate = freq_data.pstate_id;
+
/*
 * Use smp_call_function to send IPI and execute the
 * mtspr on target CPU.  We could do that without IPI
 * if current CPU is within policy->cpus (core)
 */
smp_call_function_any(policy->cpus, set_pstate, _data, 1);
+   spin_unlock_irqrestore(>gpstate_lock, flags);
+   return 0;
+}
  
+static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)

Add this after the init() routine.


Ok will do it.


+   policy->driver_data = gpstates;
+
+   /* initialize timer */
+   init_timer_deferrable(>timer);
+   gpstates->timer.data = (unsigned long) policy;
+   gpstates->timer.function = gpstate_timer_handler;
+   gpstates->timer.expires = jiffies +
+   msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
+
+   pr_info("Added global_pstate_info & timer for %d cpu\n", base);
return cpufreq_table_validate_and_show(policy, powernv_freqs);

Who will free gpstates if this fails ?


Thanks for pointing out. Will fix in v2.

Regards
Akshay Adiga

___
Linuxppc-dev 

Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-12 Thread Viresh Kumar
Comments mostly on the coding standards which you have *not* followed.

Also, please run checkpatch --strict next time you send patches
upstream.

On 12-04-16, 23:36, Akshay Adiga wrote:
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> +#define MAX_RAMP_DOWN_TIME   5120
> +#define ramp_down_percent(time)  ((time * time)>>18)

You need spaces around >>

> +
> +/*Interval after which the timer is queued to bring down global pstate*/

Bad comment style, should be /* ... */

> +#define GPSTATE_TIMER_INTERVAL   2000
> +/*
> + * global_pstate_info :

This is bad as well.. See how doc style comments are written at
separate places.

> + *   per policy data structure to maintain history of global pstates
> + *
> + * @highest_lpstate : the local pstate from which we are ramping down

No space required before :

> + * @elapsed_time : time in ms spent in ramping down from highest_lpstate
> + * @last_sampled_time : time from boot in ms when global pstates were last 
> set
> + * @last_lpstate , last_gpstate : last set values for local and global 
> pstates
> + * @timer : is used for ramping down if cpu goes idle for a long time with
> + *   global pstate held high
> + * @gpstate_lock : a spinlock to maintain synchronization between routines
> + *   called by the timer handler and governer's target_index calls
> + */
> +struct global_pstate_info {
> + int highest_lpstate;
> + unsigned int elapsed_time;
> + unsigned int last_sampled_time;
> + int last_lpstate;
> + int last_gpstate;
> + spinlock_t gpstate_lock;
> + struct timer_list timer;
> +};
> +
> +/*
> + * While resetting we don't want "timer" fields to be set to zero as we
> + * may lose track of timer and will not be able to cleanly remove it
> + */
> +#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
> + sizeof(struct global_pstate_info)-\
> + sizeof(struct timer_list)-\
> + sizeof(spinlock_t))

That's super *ugly*. Why don't you create a simple routine which will
set the 5 integer variables to 0 in a straight forward way ?

> @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
>   unsigned long val;
>   unsigned long pstate_ul =
>   ((struct powernv_smp_call_data *) freq_data)->pstate_id;
> + unsigned long gpstate_ul =
> + ((struct powernv_smp_call_data *) freq_data)->gpstate_id;

Remove these unnecessary casts and do:

struct powernv_smp_call_data *freq_data = data; //Name func arg as data

And then use freq_data->*.

>  
>   val = get_pmspr(SPRN_PMCR);
>   val = val & 0xULL;
>  
>   pstate_ul = pstate_ul & 0xFF;
> + gpstate_ul = gpstate_ul & 0xFF;
>  
>   /* Set both global(bits 56..63) and local(bits 48..55) PStates */
> - val = val | (pstate_ul << 56) | (pstate_ul << 48);
> + val = val | (gpstate_ul << 56) | (pstate_ul << 48);

>  /*

You need /** here. See comments everywhere please, they are mostly
done in a wrong way.

> + * calcuate_global_pstate:
> + *
> + * @elapsed_time : elapsed time in milliseconds
> + * @local_pstate : new local pstate
> + * @highest_lpstate : pstate from which its ramping down
> + *
> + * Finds the appropriate global pstate based on the pstate from which its
> + * ramping down and the time elapsed in ramping down. It follows a quadratic
> + * equation which ensures that it reaches ramping down to pmin in 5sec.
> + */
> +inline int calculate_global_pstate(unsigned int elapsed_time,
> + int highest_lpstate, int local_pstate)

Not aligned properly, checkpatch --strict will tell you what to do.

> +{
> + int pstate_diff;
> +
> + /*
> +  * Using ramp_down_percent we get the percentage of rampdown
> +  * that we are expecting to be dropping. Difference between
> +  * highest_lpstate and powernv_pstate_info.min will give a absolute
> +  * number of how many pstates we will drop eventually by the end of
> +  * 5 seconds, then just scale it get the number pstates to be dropped.
> +  */
> + pstate_diff =  ((int)ramp_down_percent(elapsed_time) *
> + (highest_lpstate - powernv_pstate_info.min))/100;
> +
> + /* Ensure that global pstate is >= to local pstate */
> + if (highest_lpstate - pstate_diff < local_pstate)
> + return local_pstate;
> + else
> + return (highest_lpstate - pstate_diff);

Unnecessary ().

> +}
> +
> +inline int queue_gpstate_timer(struct global_pstate_info *gpstates)

Looks like many of the function you wrote now should be marked
'static' as well.

> +{
> + unsigned int timer_interval;
> +
> + /* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But

Bad style again:

/*
 * ...
 * ...
 */

> +  * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
> +  * 

[PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-12 Thread Akshay Adiga
This patch brings down global pstate at a slower rate than the local
pstate. As the frequency transition latency from pmin to pmax is
observed to be in few millisecond granurality. It takes a performance
penalty during sudden frequency rampup. Hence by holding global pstates
higher than local pstate makes the subsequent rampups faster.

A global per policy structure is maintained to keep track of the global
and local pstate changes. The global pstate is brought down using a
parabolic equation. The ramp down time to pmin is set to 6 seconds. To
make sure that the global pstates are dropped at regular interval , a
timer is queued for every 2 seconds, which eventually brings the pstate
down to local pstate.

Iozone results show fairly consistent performance boost.
YCSB on redis shows improved Max latencies in most cases.

Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with
different record sizes . The following table shows IOoperations/sec with and
without patch.

Iozone Results ( in op/sec) ( mean over 3 iterations )

file size-  withwithout
recordsize-IOtype   patch   patch% change
--
200704-1-SeqWrite   1616532 1615425 0.06
200704-1-Rewrite2423195 2303130 5.21
200704-2-SeqWrite   1628577 1602620 1.61
200704-2-Rewrite2428264 2312154 5.02
200704-4-SeqWrite   1617605 1617182 0.02
200704-4-Rewrite2430524 2351238 3.37
200704-8-SeqWrite   1629478 1600436 1.81
200704-8-Rewrite2415308 2298136 5.09
200704-16-SeqWrite  1619632 1618250 0.08
200704-16-Rewrite   2396650 2352591 1.87
200704-32-SeqWrite  1632544 1598083 2.15
200704-32-Rewrite   2425119 2329743 4.09
200704-64-SeqWrite  1617812 1617235 0.03
200704-64-Rewrite   2402021 2321080 3.48
200704-128-SeqWrite 1631998 1600256 1.98
200704-128-Rewrite  2422389 2304954 5.09
200704-256 SeqWrite 1617065 1616962 0.00
200704-256-Rewrite  2432539 2301980 5.67
200704-512-SeqWrite 1632599 1598656 2.12
200704-512-Rewrite  2429270 2323676 4.54
200704-1024-SeqWrite1618758 1616156 0.16
200704-1024-Rewrite 2431631 2315889 4.99
401408-1-SeqWrite   1631479 1608132 1.45
401408-1-Rewrite2501550 2459409 1.71
401408-2-SeqWrite   1617095 1626069 -0.55
401408-2-Rewrite2507557 2443621 2.61
401408-4-SeqWrite   1629601 1611869 1.10
401408-4-Rewrite2505909 2462098 1.77
401408-8-SeqWrite   1617110 1626968 -0.60
401408-8-Rewrite2512244 2456827 2.25
401408-16-SeqWrite  1632609 1609603 1.42
401408-16-Rewrite   2500792 2451405 2.01
401408-32-SeqWrite  1619294 1628167 -0.54
401408-32-Rewrite   2510115 2451292 2.39
401408-64-SeqWrite  1632709 1603746 1.80
401408-64-Rewrite   2506692 2433186 3.02
401408-128-SeqWrite 1619284 1627461 -0.50
401408-128-Rewrite  2518698 2453361 2.66
401408-256-SeqWrite 1634022 1610681 1.44
401408-256-Rewrite  2509987 2446328 2.60
401408-512-SeqWrite 1617524 1628016 -0.64
401408-512-Rewrite  2504409 2442899 2.51
401408-1024-SeqWrite1629812 1611566 1.13
401408-1024-Rewrite 2507620  24429682.64

Tested with YCSB workloada over redis for 1 million records and 1 million
operation. Each test was carried out with target operations per second and
persistence disabled. 

Max-latency (in us)( mean over 5 iterations )
---
op/sOperation   with patch  without patch   %change

15000   Read61480.6 50261.4 22.32
15000   cleanup 215.2   293.6   -26.70
15000   update  25666.2 25163.8 2.00

25000   Read32626.2 89525.4 -63.56
25000