Linux cpuidle framework(4)

Linux cpuidle framework(4)_menu governor

作者：wowo 发布于：2015-1-18 23:14 分类：电源管理子系统

1. 前言

本文以menu governor为例，进一步理解cpuidle framework中governor的概念，并学习governor的实现方法。

在当前的kernel中，有2个governor，分别为ladder和menu（蜗蜗试图理解和查找，为什么会叫这两个名字，暂时还没有答案）。ladder在periodic timer tick system中使用，menu在tickless system中使用。

现在主流的系统，出于电源管理的考量，大多都是tickless system。另外，menu governor会利用pm qos framework（蜗蜗会在后续的文章中分析），在选择策略中加入延迟容忍度（Latency tolerance）的考量。因此本文选取menu governor作为分析对象，至于ladder，就不再分析了。

注：有关periodic timer tick和tickless的知识，可参考本站时间子系统的系列文章。

2. 背后的思考

本节的内容，主要来源于drivers/cpuidle/governors/menu.c中的注释。

governor的主要职责，是根据系统的运行情况，选择一个合适idle state（在kernel的标准术语中，也称作C state）。具体的算法，需要基于下面两点考虑：

1）切换的代价

进入C state的目的，是节省功耗，但CPU在C state和normal state之间切换，是要付出功耗上面的代价的。这最终会体现在idle state的target_residency字段上。

idle driver在注册idle state时，要非常明确state切换的代价，基于该代价，CPU必须在idle state中停留超过一定的时间（target_residency）才是划算的。

因此governor在选择C state时，需要预测出CPU将要在C state中的停留时间，并和备选idle state的target_residency字段比较，选取满足“停留时间 > target_residency”的state。

2）系统的延迟容忍程度

备选的的C state中，功耗和退出延迟是一对不可调和的矛盾，电源管理的目标，是在保证延迟在系统可接受的范围内的情况下，尽可能的节省功耗。

idle driver在注册idle state时，会提供两个信息：CPU在某个state下的功耗（power_usage）和退出该state的延迟（exit_latency）。那么如果知道系统当前所能容忍的延迟（简称latency_req），就可以在所有exit_latency小于latency_req的state中，选取功耗最小的那个。

因此，governor算法就转换为获取系统当前的latency_req，而这正是pm qos的特长。

基于上面的考量，menu governor的主要任务就转化为两个：1. 根据系统的运行情况，预测CPU将在C state中停留的时间（简称predicted_us）；2. 借助pm qos framework，获取系统当前的延迟容忍度（简称latency_req）。

任务1，menu governor从如下几个方面去达成：

前面讲过，menu governor用于tickless system，简化处理，menu将“距离下一个tick来临的时间（由next timer event测量，简称next_timer_us）”作为基础的predicted_us。

当然，这个基础的predicted_us是不准确的，因为在这段时间内，随时都可能产生除next timer event之外的其它wakeup event。为了使预测更准确，有必要加入一个校正因子（correction factor），该校正因子基于过去的实际predicted_us和next_timer_us之间的比率，例如，如果wakeup event都是在预测的next timer event时间的一半时产生，则factor为0.5。另外，为了更精确，menu使用动态平均的factor。

另外，对不同范围的next_timer_us，correction factor的影响程度是不一样的。例如期望50ms和500ms的next timer event时，都是在10ms时产生了wakeup event，显然对500ms的影响比较大。如果计算平均值时将它们混在一起，就会对预测的准确性产生影响，所以计算correction factor的数据时，需要区分不同级别的next_timer_us。同时，系统是否存在io wait，对factor的敏感度也不同。基于这些考虑，menu使用了一组factor（12个），分别用于不同next_timer_us、不同io wait的场景下的的校正。

最后，在有些场合下，next_timer_us的预测是完全不正确的，如存在固定周期的中断时（音频等）。这时menu采用另一种不同的预测方式：统计过去8次停留时间的标准差（stand deviation），如果小于一定的门限值，则使用这8个停留时间的平均值，作为预测值。

任务2，延迟容忍度（latency_req）的估算，menu综合考虑了两种因素，如下：

1）由pm qos获得的，系统期望的，CPU和DMA的延迟需求。这是一个硬性指标。

2）基于这样一个经验法则：越忙的系统，对系统延迟的要求越高，结合任务1中预测到的停留时间（predicted_us），以及当前系统的CPU平均负荷和iowaiters的个数（get_iowait_load函数获得），算出另一个延迟容忍度，计算公式（这是一个经验公式）为：
predicted_us / (1 + 2 * loadavg +10 * iowaiters)
这个公式反映的是退出延迟和预期停留时间之间的比例，loadavg和iowaiters越大，对退出延迟的要求就越高奥。

最后，latency_req的值取上面两个估值的最小值。

3. 代码分析

理解menu governor背后的思考之后，再去看代码，就比较简单了。

3.1 初始化

首先，在init代码中，调用cpuidle_register_governor，注册menu_governor，如下：

        1: static struct cpuidle_governor menu_governor = {

       2:         .name =         "menu",

       3:         .rating =       20,

       4:         .enable =       menu_enable_device,

       5:         .select =       menu_select,

       6:         .reflect =      menu_reflect,

       7:         .owner =        THIS_MODULE,

       8: };

       9:  

      10: /**

      11:  * init_menu - initializes the governor

      12:  */

      13: static int __init init_menu(void)

      14: {

      15:         return cpuidle_register_governor(&menu_governor);

      16: }

      17:  

      18: postcore_initcall(init_menu);

由menu_governor变量可知，该governor的名字为“menu”，rating为20，共提供了enable、select、reflect三个API。

3.2 enable API

enable API负责governor运行前的准备动作，由menu_enable_device实现：

       1: static int menu_enable_device(struct cpuidle_driver *drv,

       2:                                 struct cpuidle_device *dev)

       3: {

       4:         struct menu_device *data = &per_cpu(menu_devices, dev->cpu);

       5:         int i;

       6:  

       7:         memset(data, 0, sizeof(struct menu_device));

       8:  

       9:         /*

      10:          * if the correction factor is 0 (eg first time init or cpu hotplug

      11:          * etc), we actually want to start out with a unity factor.

      12:          */

      13:         for(i = 0; i < BUCKETS; i++)

      14:                 data->correction_factor[i] = RESOLUTION * DECAY;

      15:  

      16:         return 0;

      17: }

由代码可知，主要任务是初始化在私有数据结构（struct menu_device）中保存的correction_factor。struct menu_device的定义如下：

       1: struct menu_device {

       2:         int             last_state_idx;

       3:         int             needs_update;

       4:  

       5:         unsigned int    next_timer_us;

       6:         unsigned int    predicted_us;

       7:         unsigned int    bucket;

       8:         unsigned int    correction_factor[BUCKETS];

       9:         unsigned int    intervals[INTERVALS];

      10:         int             interval_ptr;

      11: };

last_state_idx，记录了上一次进入的C state；

needs_update，每次从C state返回时，kernel（kernel\sched\idle.c）会调用governor的reflect接口，以便有机会让governor考虑这一次state切换的结果（如更新统计信息）。对menu而言，它的reflect接口会设置needs_update标志，并在下一次select时，更新状态，具体行为可参考后面的描述；

next_timer_us、predicted_us，可参考第2章中的有关说明；

correction_factor，保存校正因子的数组，因子的个数为BUCKETS（当前代码为12）；

bucket，指明select state时所使用的因子（当前的校正因子）；

intervals、interval_ptr，可参考第2章中的描述，用于计算停留时间的标准差，当前代码使用了8个停留时间（INTERVALS）。

3.2 select接口

governor的核心API，根据系统的运行情况，选择一个合适的C state。由menu_select接口实现，逻辑如下：

       1: /**

       2:  * menu_select - selects the next idle state to enter

       3:  * @drv: cpuidle driver containing state data

       4:  * @dev: the CPU

       5:  */

       6: static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)

       7: {

       8:     struct menu_device *data = this_cpu_ptr(&menu_devices);

       9:     int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);

    int i;

    unsigned int interactivity_req;

    unsigned long nr_iowaiters, cpu_load;

  

    if (data->needs_update) {

        menu_update(drv, dev);

        data->needs_update = 0;

    }

  

    data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;

  

    /* Special case when user has set very strict latency requirement */

    if (unlikely(latency_req == 0))

        return 0;

  

    /* determine the expected residency time, round up */

    data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());

  

    get_iowait_load(&nr_iowaiters, &cpu_load);

    data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);

  

    /*

     * Force the result of multiplication to be 64 bits even if both

     * operands are 32 bits.

     * Make sure to round up for half microseconds.

     */

    data->predicted_us = div_round64((uint64_t)data->next_timer_us *

                     data->correction_factor[data->bucket],

                     RESOLUTION * DECAY);

  

    get_typical_interval(data);

  

    /*

     * Performance multiplier defines a minimum predicted idle

     * duration / latency ratio. Adjust the latency limit if

     * necessary.

     */

    interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);

    if (latency_req > interactivity_req)

        latency_req = interactivity_req;

  

    /*

     * We want to default to C1 (hlt), not to busy polling

     * unless the timer is happening really really soon.

     */

    if (data->next_timer_us > 5 &&

        !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&

        dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)

        data->last_state_idx = CPUIDLE_DRIVER_STATE_START;

  

    /*

     * Find the idle state with the lowest power while satisfying

     * our constraints.

     */

    for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {

        struct cpuidle_state *s = &drv->states[i];

        struct cpuidle_state_usage *su = &dev->states_usage[i];

  

        if (s->disabled || su->disable)

            continue;

        if (s->target_residency > data->predicted_us)

            continue;

        if (s->exit_latency > latency_req)

            continue;

  

        data->last_state_idx = i;

    }

  

    return data->last_state_idx;

}

8行，取出per cpu的struct menu_device指针；

9行，调用pm_qos_request接口，获取系统CPU和DMA所能容忍的延迟。因为cpuidle状态下，运行任何的中断事件唤醒，因此这里只考虑了CPU和DMA；

14~17行，根据needs_update标志，调用menu_update，更新统计信息，具体可参考代码；

19行，last_state_idx会在menu_reflect中设置，并在menu_update中使用，此时已经没有用处了，初始化为无效值；

22~23行，如果pm qos要求的latency为0，则当前系统是一个比较苛刻的状态，不能进入idle状态，直接返回零。由此可以看出，software可以通过pm qos，控制系统是否可以进入idle状态，后续分析pm qos时，会再说明；

26~29行，调用timer子系统的接口，获取next_timer_us，调用sched提供de接口，获取iowaiter的个数以及CPU load信息，并利用next_timer_us和iowaiters信息，计算出需要使用哪一类校正因子。计算逻辑比较简单，详见代码；

36~39行，将next_timer_us乘以校正因子，得到predicted_us。计算时考虑了溢出、精度等情况；

40行，调用get_typical_interval接口，检查是否存在固定周期的情况，检查的逻辑就是计算8次停留时间的标准差，如果存在，则利用平均值更新predicted_us；

42~48，根据predicted_us和系统负荷情况（cpu load、iowaiters），估算另一个延迟容忍值，并和latency_req，取最小值；

51~78行，根据上面的信息，查找cpuidle device的所有state，选出一个符合条件的state，并返回该state在cpuidle state数组中的index。

3.3 reflect接口

menu的reflect接口比较简单，更新data->last_state_idx后，置位data->needs_update标志。可以多思考一下：为什么不直接在reflect中更新状态，而是到下一次select时再更新？这个问题留给读者吧。

原创文章，转发请注明出处。蜗窝科技，www.wowotech.net。

标签: Linux cpuidle menu governor pm_qos

« “极致”神话和产品观念 | Linux cpuidle framework(3)_ARM64 generic CPU idle driver»

评论：

ctwillson
2022-07-01 15:19

一直没理解在抓到的 systrace 中有个
<idle>-0 (-----) [002] .n.1 39203.759393: cpu_idle: state=4294967295 cpu_id=2
这种 c-state 代表的是什么含义？

pdzhu
2025-02-07 11:07

@ctwillson：state=4294967295为-1，退出idle态的意思

yellow
2017-06-21 16:14

@wowo: 为什么我打印log出来，select 函数一直在调用的（一直在操作手机）？不是应该在suspend的时候才通过select选择 state level吗？
望指教，谢谢

wowo
2017-06-21 16:45

@yellow：cpu的idle和suspend不是一回事，调用频度也不是一个数量级的。
（idle可能会一直在进进出出～）

yellow
2017-06-22 09:09

@wowo：但奇怪的是我一直在操作手机app，还是能打印出来。这个时候idle还有机会被调度到的吗？

wowo
2017-06-22 17:15

@yellow：你可以看看idle的统计信息，操作APP的粒度不够小，cpu进idle很正常。再说了，系统中还不只有一个CPU，所有cpu一直run的可能性很小。

lj
2017-05-23 19:40

next_timer_us的值是最近一个timer定时的值，arch timer也算吧？在SMP中，next_timer_us并不是只针对自己cpu上的线程或进程来考虑的？比如当前cpu 150 us后回有arch timer的中断，另一个cpu上 10 us后会有中断，那么next_timer_us的值应该是10us?

但是 next_timer_us的值是通过tick_nohz_get_sleep_length来获取的，该函数里面是调用 this_cpu_ptr(&tick_cpu_sched)来获取的，这样会影响到前面所说的next_timer_us的值是最近一个timer定时的值，这个结论吗？

wowo
2017-05-25 10:12

@lj：next_timer_us在cpufreq的上下文里面，并不关心timer的绝对值，而是关心在下一个timer到期的时候，自己（当前cpu）会sleep多久。
因此，计算这个时间的时候，会考虑所有的timer（global timer或者arch timer）。

tigger
2015-11-17 11:16

有个问题
1：如果exit_latency 设计的不合理，会出现什么样的后果，
2：或者说exit_latency 是怎么得到的呢？
3：exit_latency 是从c state到完全退出的时间吗？

wowo
2015-11-17 13:16

@tigger：exit_latency是由CPU的特性决定的，软件无法干涉。一般从平台的设计spec里面能拿到，例如“arch/arm/mach-omap2/cpuidle34xx.c”中的那个数组。
当然，从设计的角度看，这个值越小越好，最好是0，但不可能。软件和硬件的实际情况最好能匹配。
“exit_latency 是从c state到完全退出的时间吗？”，从定义上，应该是，但需要根据实际情况取一个近似值。

szlhappy
2015-11-06 16:27

有个疑问，”menu将“距离下一个tick来临的时间（由next timer event测量，简称next_timer_us）”作为基础的predicted_us“
1、系统的tick来临时间不应该是固定的么？
2、在启用noHz之后是怎么来预测这个来临时间呢？

另外，最近在研究cpuidle的退出原因，是否在idle之后schedule的进程就是退出的原因呢？

wowo
2015-11-06 17:23

@szlhappy：1. 对于周期性tick的系统，tick的来临时间是固定的，文中有提到：“menu governor用于tickless system”，因此这个上下文里指的是tickless的系统，也即是no Hz。

2. 对no Hz系统来说，下一次tick的来临，取决于下一次timer中断，这一般是由系统中最近的一个定时器所决定。而系统不可能只由timer唤醒，因此需要加入校正因子。

idle退出的原因很简单：有中断发生了，检查一下idle前后的中断数量，看看是哪一个。引发进程调度的原因很多，idle退出后被调度的那个不一定是罪魁祸首。

szlhappy
2015-11-09 08:30

@wowo：“对no Hz系统来说，下一次tick的来临，取决于下一次timer中断，这一般是由系统中最近的一个定时器所决定。”

下一次timer中断的时间不也是不可确定的么？那么用下个timer中断来的时间next_timer_us作为基础，本身也是不确定的？

wowo
2015-11-09 08:53

@szlhappy："下一次timer中断的时间"是可以确定的，只要看最近的一个定时器是什么时候expire就行了。

ccczh
2021-04-23 17:31

@wowo：在新版的内核代码，在la在新版的内核代码，在ladder.c 有针对nohz处理，这似乎和“menu governor用于tickless system”有点出入.

https://elixir.bootlin.com/linux/latest/source/drivers/cpuidle/governors/ladder.c#L191

shoujixiaodao
2015-07-21 18:48

透彻。不过我看了下虽然governor费力选择出来合适的idle state，但是到具体平台，比如三星，nvdia等并没有支持到这么多的idle state。
同时看了下MTK的实现方式。根本没有用kernel现有的这些cpu idle机制。而是根据自己的平台制定出多重cpu idle状态。进入各种idle状态的条件除了考虑CPUidle state中停留超过一定的时间（没有考虑校正因子），还考虑了CLK tree， power domain以及当前cpus online情形。
所以感觉内核自带的这个gonver结合具体平台不是很亲密。

wowo
2015-07-21 21:37

@shoujixiaodao：个人感觉，ARM和linux kernel的目标是一致的，尽量做到标准化。其它厂商，特别是那些有实力的厂商，就不这么想了。
从技术的角度看，标准化是好的选择，可以更优雅，更高效。但从商业的角度看，就不一定了。

coray
2015-03-20 13:44

看了wowo一系列都文章，总的来说，文章质量都很高。但是，有点意见必须提出的：文章很少用示意图表达模块的框架，例如用UML顺序图等。现在文章内容很多，看了几篇了觉得只见树木，不见树林。希望这方面能改进啊wowo

wowo
2015-03-20 18:57

@coray：感谢coray的建议，您说的非常对，很多时候，一张图比一堆文字所表达的东西都多。但画图很耗时间，以后有机会整理这些文章的时候，我一定遵照您的建议。多谢~~

haichunzhao
2015-01-19 19:49

ladder 梯子必须一级一级的向上。所以ladder实现的就是状态必须逐级变化
menu 菜单想要哪个直接选中。所以menu实现的就是目标状态直接选中

最后的问题，我想就是cpu从退出状态到开始工作，要尽可能的快。

wowo
2015-01-19 22:21

@haichunzhao：多谢haichun解惑，这样一解释，还是很贴切、形象的，佩服老外的想象力。
最后的问题，我和你的想法是一样的，优雅的设计啊！

radar
2015-05-21 10:00

@haichunzhao：赞，感谢！！

发表评论：

蜗窝科技

慢下来，享受技术。