I find this when I run a docker test script.

#-----script begin--

for((i=0; i<1000000; i++))
do
    echo "=================================== test $i start ================================"
    rm -rf /media/ram/docker/test/*
    docker run -d --name test -h test -v /media/docker/app/dataindex:/media/dataindex -v /media/ram/docker/test:/run -v /media/docker/app/test:/media/container -v /media/wave:/media/wave:ro -v /run/systemd/journal/dev-log:/dev/log --ipc=host --log-driver=journald --init arpubt:mips64el-1.2

    docker ps -s
    docker stats --no-stream

    cp /media/arp/testsvc.tar.gz /media/ram/docker/test/
    docker exec -it test app-manager install testsvc
    sleep 3

    docker ps -s
    docker stats --no-stream
    docker exec -it test monit status
    sleep 3

    docker exec -it test app-manager disable testsvc
    sleep 1

    docker exec -it test monit status
    sleep 3

    docker exec -it test app-manager enable testsvc
    sleep 1

    docker exec -it test app-manager disable testsvc
    sleep 1

    docker exec -it test app-manager enable testsvc
    sleep 1

    docker exec -it test app-manager disable testsvc
    sleep 1

    docker exec -it test app-manager enable testsvc
    sleep 1

    docker ps -s
    docker stats --no-stream
    sleep 1

    docker exec -it test app-manager uninstall testsvc
    sleep 1

    docker rm -f test

    sleep 1
    docker ps -s
    docker stats --no-stream

    echo "================================== test $i end =================================="
done

#---- script end---

some times, docker exec will stop longtime.

testsvc is a busy loop application.

int i;

int main()

{

while(1) i++;

}

When use kgdb, I see there are two tasks on run queue,  docker exec's vruntime is very large, but which is  value is almost same to  linux cgroup's min_vruntime. like bellow:

CFS0 0x98000002de50f6e0, enter cgroup 0xffffffff81349900
CFS1 0x98000002dba65700, enter cgroup 0x98000002dbc4a700
SE:0x98000002dab95618 vruntime:49388638792 14348 14344 runc:[2:INIT] ra=0xffffffff80e8e110 sp=0x98000002d2b33d40 epc=0x12006b0f0 usp=0xffeafc6810 task=0x98000002dab955b0 tinfo=0x98000002d2b30000 uregs=0x98000002d2b33eb0 stat=0x1
CFS2 0x98000002dba57700, enter cgroup 0x98000000014bf100
se:0x98000002daa92a78 vruntime:49391638792 (always run)14266 13758 testsvc ra=0xffffffff80e8e110 sp=0x98000002d2cbfe30 epc=0x1200017f0 usp=0xffffc8bc10 task=0x98000002daa92a10 tinfo=0x98000002d2cbc000 uregs=0x98000002d2cbfeb0 stat=0x0 se:0x98000002dab5b488 vruntime:242669408647 (large) 14349 14344 runc:[2:INIT] ra=0xffffffff80e8e110 sp=0x98000002d2b37ba0 epc=0x12006b3d8 usp=0xffea7c6798 task=0x98000002dab5b420 tinfo=0x98000002d2b34000 uregs=0x98000002d2b37eb0 stat=0x0

I add check code in sched_move_task, and found that  task has been run and is stopped(on_rq == 0), vruntime is not changed through sched_move_task function, because se->sum_exec_runtime == 0

I add debug code in update_curr:

    if (unlikely((s64)delta_exec <= 0))
    {
     if (!curr->sum_exec_runtime)
     {
        struct task_struct *p;
        p = container_of(curr, struct task_struct, se);
        if ((p->nvcsw + p->nivcsw) != 0 && entity_is_task(curr))
        {
         curr->sum_exec_runtime += 1;
         dump_stack();
        }
     }
dump_stack also sometimes  was called.

Now after I read your letter, I know the reason is:

    unsigned long long __weak sched_clock(void)
      {
              return (unsigned long long)(jiffies - INITIAL_JIFFIES)
                                              * (NSEC_PER_SEC / HZ);
     }

My kernel use default sched_clock, which use jiffies as schedule clock.

I should use another high resolution schedule clock, then update_curr's  dalta_exec will not be 0.

So thanks alot, but I still think curr->sum_exec_runtime check code is not accurate enough.


在 2019/8/27 下午10:36, Peter Zijlstra 写道:
On Mon, Aug 26, 2019 at 07:46:50PM +0800, QiaoChong wrote:
From: Chong Qiao <qiaoch...@loongson.cn>

Such as:
cpu_cgroup_attach>
  sched_move_task>
   task_change_group_fair>
    task_move_group_fair>
     detach_task_cfs_rq>
      vruntime_normalized>

        /*
         * When !on_rq, vruntime of the task has usually NOT been normalized.
         * But there are some cases where it has already been normalized:
         *
         * - A forked child which is waiting for being woken up by
         *   wake_up_new_task().
         * - A task which has been woken up by try_to_wake_up() and
         *   waiting for actually being woken up by sched_ttwu_pending().
         */
        if (!se->sum_exec_runtime ||
            (p->state == TASK_WAKING && p->sched_remote_wakeup))
                return true;

p->se.sum_exec_runtime is 0, does not mean task not been run (A forked child 
which is waiting for being woken up by  wake_up_new_task()).

Task may have been scheduled multimes, but p->se.sum_exec_runtime is still 0, 
because delta_exec maybe 0 in update_curr.

static void update_curr(struct cfs_rq *cfs_rq)
{
...
        delta_exec = now - curr->exec_start;
        if (unlikely((s64)delta_exec <= 0))
                return;
...

        curr->sum_exec_runtime += delta_exec;
...
}

Task has been run and is stopped(on_rq == 0), vruntime not been normalized, but 
se->sum_exec_runtime == 0.
This cause vruntime_normalized set on_rq 1, and does not normalize vruntime.
This may cause task use old vruntime in old cgroup, which maybe very large than 
task's vruntime in new cgroup.
Which may cause task may not scheduled in run queue for long time after been 
waked up.

Now I change sum_exec_runtime to 1 when sum_exec_runtime == 0 in update_curr to 
make sun_exec_runtime not 0.
Have you actually observed this? It is very hard to have a 0 delta
between two scheduling events.


--


乔崇 qiaoch...@loongson.cn

2019年 08月 28日 星期三 11:31:43 CST

Reply via email to