subject:"Re\: \[PATCH\] drm\/scheduler\: fix race condition in load balancer"

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-15 Thread Nirmoy




I think I know why it happens. At init all entity's rq gets assigned 
to sched_list[0]. I put some prints to check what we compare in 
drm_sched_entity_get_free_sched.


It turns out most of the time it compares zero values(num_jobs(0) < 
min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was picked 
by drm_sched_entity_get_free_sched.


Well that is expected because the unit tests always does 
submission,wait,submission,wait,submission,wait So the number of 
jobs in the scheduler becomes zero in between.
Even with multiple parallel instances of amdgpu_test, I haven't seen any 
improvement in the load balance.


This patch was not correct , had an extra atomic_inc(num_jobs) in 
drm_sched_job_init. This probably added bit of randomness I think, 
which helped in better job distribution.


Mhm, that might not be a bad idea after all. We could rename num_jobs 
into something like like score and do a +1 in 
drm_sched_rq_add_entity() and a -1 in drm_sched_rq_remove_entity().


That should have pretty much the effect we want to have.

That's sounds good as well. I will create a patch.


I've updated my previous RFC patch which uses time consumed by each 
sched for load balance with a twist of ignoring previously scheduled 
sched/rq. Let me know what do you think.


I didn't had time yet to wrap my head around that in detail, but at 
least of hand Luben is right that the locking looks really awkward.
I was unable to find a better way to do the locking part. My mail client 
might've missed Luben's review, can't find it :/



Regards,

Nirmoy

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-15 Thread Christian König


Hi Nirmoy,

Am 15.01.20 um 12:04 schrieb Nirmoy:

Hi Christian,

On 1/14/20 5:01 PM, Christian König wrote:



Before this patch:

sched_name num of many times it got scheduled
=  ==
sdma0  314
sdma1  32
comp_1.0.0 56
comp_1.1.0 0
comp_1.1.1 0
comp_1.2.0 0
comp_1.2.1 0
comp_1.3.0 0
comp_1.3.1 0

After this patch:

sched_name num of many times it got scheduled
=  ==
  sdma1  243
  sdma0  164
  comp_1.0.1 14
  comp_1.1.0 11
  comp_1.1.1 10
  comp_1.2.0 15
  comp_1.2.1 14
  comp_1.3.0 10
  comp_1.3.1 10


Well that is still rather nice to have, why does that happen?


I think I know why it happens. At init all entity's rq gets assigned 
to sched_list[0]. I put some prints to check what we compare in 
drm_sched_entity_get_free_sched.


It turns out most of the time it compares zero values(num_jobs(0) < 
min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was picked 
by drm_sched_entity_get_free_sched.


Well that is expected because the unit tests always does 
submission,wait,submission,wait,submission,wait So the number of 
jobs in the scheduler becomes zero in between.


This patch was not correct , had an extra atomic_inc(num_jobs) in 
drm_sched_job_init. This probably added bit of randomness I think, 
which helped in better job distribution.


Mhm, that might not be a bad idea after all. We could rename num_jobs 
into something like like score and do a +1 in drm_sched_rq_add_entity() 
and a -1 in drm_sched_rq_remove_entity().


That should have pretty much the effect we want to have.

I've updated my previous RFC patch which uses time consumed by each 
sched for load balance with a twist of ignoring previously scheduled 
sched/rq. Let me know what do you think.


I didn't had time yet to wrap my head around that in detail, but at 
least of hand Luben is right that the locking looks really awkward.


And I would rather like to avoid a larger change like this for a nice to 
have for testing feature.


Regards,
Christian.




Regards,

Nirmoy



Christian.




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-15 Thread Nirmoy


Hi Christian,

On 1/14/20 5:01 PM, Christian König wrote:



Before this patch:

sched_name num of many times it got scheduled
=  ==
sdma0  314
sdma1  32
comp_1.0.0 56
comp_1.1.0 0
comp_1.1.1 0
comp_1.2.0 0
comp_1.2.1 0
comp_1.3.0 0
comp_1.3.1 0

After this patch:

sched_name num of many times it got scheduled
=  ==
  sdma1  243
  sdma0  164
  comp_1.0.1 14
  comp_1.1.0 11
  comp_1.1.1 10
  comp_1.2.0 15
  comp_1.2.1 14
  comp_1.3.0 10
  comp_1.3.1 10


Well that is still rather nice to have, why does that happen?


I think I know why it happens. At init all entity's rq gets assigned to 
sched_list[0]. I put some prints to check what we compare in 
drm_sched_entity_get_free_sched.


It turns out most of the time it compares zero values(num_jobs(0) < 
min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was picked by 
drm_sched_entity_get_free_sched.



This patch was not correct , had an extra atomic_inc(num_jobs) in 
drm_sched_job_init. This probably added bit of randomness I think, which 
helped in better job distribution.


I've updated my previous RFC patch which uses time consumed by each 
sched for load balance with a twist of ignoring previously scheduled 
sched/rq. Let me know what do you think.



Regards,

Nirmoy



Christian.



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-14 Thread Christian König


Am 14.01.20 um 17:13 schrieb Nirmoy:


On 1/14/20 5:01 PM, Christian König wrote:

Am 14.01.20 um 16:43 schrieb Nirmoy Das:

Jobs submitted in an entity should execute in the order those jobs
are submitted. We make sure that by checking entity->job_queue in
drm_sched_entity_select_rq() so that we don't loadbalance jobs within
an entity.

But because we update entity->job_queue later in 
drm_sched_entity_push_job(),
there remains a open window when it is possibe that entity->rq might 
get

updated by drm_sched_entity_select_rq() which should not be allowed.


NAK, concurrent calls to 
drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in 
the first place or otherwise we mess up the fence sequence order and 
risk memory corruption.






Changes in this part also improves job distribution.
Below are test results after running amdgpu_test from mesa drm

Before this patch:

sched_name num of many times it got scheduled
=  ==
sdma0  314
sdma1  32
comp_1.0.0 56
comp_1.1.0 0
comp_1.1.1 0
comp_1.2.0 0
comp_1.2.1 0
comp_1.3.0 0
comp_1.3.1 0

After this patch:

sched_name num of many times it got scheduled
=  ==
  sdma1  243
  sdma0  164
  comp_1.0.1 14
  comp_1.1.0 11
  comp_1.1.1 10
  comp_1.2.0 15
  comp_1.2.1 14
  comp_1.3.0 10
  comp_1.3.1 10


Well that is still rather nice to have, why does that happen?


I think it is because we are updating num_jobs immediately after 
selecting a new rq. Previously  we do that way after


drm_sched_job_init() in drm_sched_entity_push_job(). The problem is if 
I just do


@@ -562,6 +562,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
  return -ENOENT;
    sched = entity->rq->sched;
+    atomic_inc(>rq->sched->num_jobs);

 @@ -498,7 +504,6 @@ void drm_sched_entity_push_job(struct 
drm_sched_job *sched_job,

  bool first;
    trace_drm_sched_job(sched_job, entity);
-    atomic_inc(>rq->sched->num_jobs);


num_jobs gets negative somewhere down the line somewhere. I am 
guessing  it's hitting the race condition as I explained in the commit 
message


The race condition you explain in the commit message should be 
impossible to hit or we have much much larger problems than just an 
incorrect job count.


Incrementing num_jobs so early is not possible either cause the job 
might not get pushed to the entity because of an error.


Christian.




Regards,

Nirmoy



Christian.



Fixes: 35e160e781a048 (drm/scheduler: change entities rq even earlier)

Signed-off-by: Nirmoy Das 
Reported-by: Pierre-Eric Pelloux-Prayer 


---
  drivers/gpu/drm/scheduler/sched_entity.c | 9 +++--
  drivers/gpu/drm/scheduler/sched_main.c   | 1 +
  include/drm/gpu_scheduler.h  | 1 +
  3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c

index 2e3a058fc239..8414e084b6ac 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -67,6 +67,7 @@ int drm_sched_entity_init(struct drm_sched_entity 
*entity,

  entity->priority = priority;
  entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
  entity->last_scheduled = NULL;
+    entity->loadbalance_on = true;
    if(num_sched_list)
  entity->rq = _list[0]->sched_rq[entity->priority];
@@ -447,6 +448,9 @@ struct drm_sched_job 
*drm_sched_entity_pop_job(struct drm_sched_entity *entity)
  entity->last_scheduled = 
dma_fence_get(_job->s_fence->finished);

    spsc_queue_pop(>job_queue);
+    if (!spsc_queue_count(>job_queue))
+    entity->loadbalance_on = true;
+
  return sched_job;
  }
  @@ -463,7 +467,8 @@ void drm_sched_entity_select_rq(struct 
drm_sched_entity *entity)

  struct dma_fence *fence;
  struct drm_sched_rq *rq;
  -    if (spsc_queue_count(>job_queue) || 
entity->num_sched_list <= 1)

+    atomic_inc(>rq->sched->num_jobs);
+    if ((entity->num_sched_list <= 1) || !entity->loadbalance_on)
  return;
    fence = READ_ONCE(entity->last_scheduled);
@@ -477,6 +482,7 @@ void drm_sched_entity_select_rq(struct 
drm_sched_entity *entity)

  entity->rq = rq;
  }
  +    entity->loadbalance_on = false;
  spin_unlock(>rq_lock);
  }
  @@ -498,7 +504,6 @@ void drm_sched_entity_push_job(struct 
drm_sched_job *sched_job,

  bool first;
    trace_drm_sched_job(sched_job, entity);
-    atomic_inc(>rq->sched->num_jobs);
  WRITE_ONCE(entity->last_user, current->group_leader);
  first = spsc_queue_push(>job_queue, 
_job->queue_node);
  diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

index 3fad5876a13f..00fdc350134e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -562,6 +562,7 @@ int

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-14 Thread Nirmoy



On 1/14/20 5:23 PM, Christian König wrote:

Am 14.01.20 um 17:20 schrieb Nirmoy:


On 1/14/20 5:01 PM, Christian König wrote:

Am 14.01.20 um 16:43 schrieb Nirmoy Das:

Jobs submitted in an entity should execute in the order those jobs
are submitted. We make sure that by checking entity->job_queue in
drm_sched_entity_select_rq() so that we don't loadbalance jobs within
an entity.

But because we update entity->job_queue later in 
drm_sched_entity_push_job(),
there remains a open window when it is possibe that entity->rq 
might get

updated by drm_sched_entity_select_rq() which should not be allowed.


NAK, concurrent calls to 
drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in 
the first place or otherwise we mess up the fence sequence order and 
risk memory corruption.
if I am not missing something, I don't see any lock securing 
drm_sched_job_init()/drm_sched_entity_push_job() calls in 
amdgpu_cs_submit().


See one step up in the call chain, function amdgpu_cs_ioctl().

This is locking the page tables, which also makes access to the 
context and entities mutual exclusive:

    r = amdgpu_cs_parser_bos(, data);

...

    r = amdgpu_cs_submit(, cs);

out:


And here the page tables are unlocked again:

    amdgpu_cs_parser_fini(, r, reserved_buffers);


Okay. Then something else is going on. Let me dig more.


Thanks,

Nirmoy



Regards,
Christian.




Regards,

Nirmoy




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-14 Thread Christian König


Am 14.01.20 um 17:20 schrieb Nirmoy:


On 1/14/20 5:01 PM, Christian König wrote:

Am 14.01.20 um 16:43 schrieb Nirmoy Das:

Jobs submitted in an entity should execute in the order those jobs
are submitted. We make sure that by checking entity->job_queue in
drm_sched_entity_select_rq() so that we don't loadbalance jobs within
an entity.

But because we update entity->job_queue later in 
drm_sched_entity_push_job(),
there remains a open window when it is possibe that entity->rq might 
get

updated by drm_sched_entity_select_rq() which should not be allowed.


NAK, concurrent calls to 
drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in 
the first place or otherwise we mess up the fence sequence order and 
risk memory corruption.
if I am not missing something, I don't see any lock securing 
drm_sched_job_init()/drm_sched_entity_push_job() calls in 
amdgpu_cs_submit().


See one step up in the call chain, function amdgpu_cs_ioctl().

This is locking the page tables, which also makes access to the context 
and entities mutual exclusive:

    r = amdgpu_cs_parser_bos(, data);

...

    r = amdgpu_cs_submit(, cs);

out:


And here the page tables are unlocked again:

    amdgpu_cs_parser_fini(, r, reserved_buffers);


Regards,
Christian.




Regards,

Nirmoy



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-14 Thread Nirmoy



On 1/14/20 5:01 PM, Christian König wrote:

Am 14.01.20 um 16:43 schrieb Nirmoy Das:

Jobs submitted in an entity should execute in the order those jobs
are submitted. We make sure that by checking entity->job_queue in
drm_sched_entity_select_rq() so that we don't loadbalance jobs within
an entity.

But because we update entity->job_queue later in 
drm_sched_entity_push_job(),

there remains a open window when it is possibe that entity->rq might get
updated by drm_sched_entity_select_rq() which should not be allowed.


NAK, concurrent calls to 
drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in 
the first place or otherwise we mess up the fence sequence order and 
risk memory corruption.
if I am not missing something, I don't see any lock securing 
drm_sched_job_init()/drm_sched_entity_push_job() calls in 
amdgpu_cs_submit().



Regards,

Nirmoy

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-14 Thread Nirmoy



On 1/14/20 5:01 PM, Christian König wrote:

Am 14.01.20 um 16:43 schrieb Nirmoy Das:

Jobs submitted in an entity should execute in the order those jobs
are submitted. We make sure that by checking entity->job_queue in
drm_sched_entity_select_rq() so that we don't loadbalance jobs within
an entity.

But because we update entity->job_queue later in 
drm_sched_entity_push_job(),

there remains a open window when it is possibe that entity->rq might get
updated by drm_sched_entity_select_rq() which should not be allowed.


NAK, concurrent calls to 
drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in 
the first place or otherwise we mess up the fence sequence order and 
risk memory corruption.






Changes in this part also improves job distribution.
Below are test results after running amdgpu_test from mesa drm

Before this patch:

sched_name num of many times it got scheduled
=  ==
sdma0  314
sdma1  32
comp_1.0.0 56
comp_1.1.0 0
comp_1.1.1 0
comp_1.2.0 0
comp_1.2.1 0
comp_1.3.0 0
comp_1.3.1 0

After this patch:

sched_name num of many times it got scheduled
=  ==
  sdma1  243
  sdma0  164
  comp_1.0.1 14
  comp_1.1.0 11
  comp_1.1.1 10
  comp_1.2.0 15
  comp_1.2.1 14
  comp_1.3.0 10
  comp_1.3.1 10


Well that is still rather nice to have, why does that happen?


I think it is because we are updating num_jobs immediately after 
selecting a new rq. Previously  we do that way after


drm_sched_job_init() in drm_sched_entity_push_job(). The problem is if I 
just do


@@ -562,6 +562,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
  return -ENOENT;
    sched = entity->rq->sched;
+    atomic_inc(>rq->sched->num_jobs);

 @@ -498,7 +504,6 @@ void drm_sched_entity_push_job(struct 
drm_sched_job *sched_job,

  bool first;
    trace_drm_sched_job(sched_job, entity);
-    atomic_inc(>rq->sched->num_jobs);


num_jobs gets negative somewhere down the line somewhere. I am guessing  
it's hitting the race condition as I explained in the commit message



Regards,

Nirmoy



Christian.



Fixes: 35e160e781a048 (drm/scheduler: change entities rq even earlier)

Signed-off-by: Nirmoy Das 
Reported-by: Pierre-Eric Pelloux-Prayer 


---
  drivers/gpu/drm/scheduler/sched_entity.c | 9 +++--
  drivers/gpu/drm/scheduler/sched_main.c   | 1 +
  include/drm/gpu_scheduler.h  | 1 +
  3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c

index 2e3a058fc239..8414e084b6ac 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -67,6 +67,7 @@ int drm_sched_entity_init(struct drm_sched_entity 
*entity,

  entity->priority = priority;
  entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
  entity->last_scheduled = NULL;
+    entity->loadbalance_on = true;
    if(num_sched_list)
  entity->rq = _list[0]->sched_rq[entity->priority];
@@ -447,6 +448,9 @@ struct drm_sched_job 
*drm_sched_entity_pop_job(struct drm_sched_entity *entity)
  entity->last_scheduled = 
dma_fence_get(_job->s_fence->finished);

    spsc_queue_pop(>job_queue);
+    if (!spsc_queue_count(>job_queue))
+    entity->loadbalance_on = true;
+
  return sched_job;
  }
  @@ -463,7 +467,8 @@ void drm_sched_entity_select_rq(struct 
drm_sched_entity *entity)

  struct dma_fence *fence;
  struct drm_sched_rq *rq;
  -    if (spsc_queue_count(>job_queue) || 
entity->num_sched_list <= 1)

+    atomic_inc(>rq->sched->num_jobs);
+    if ((entity->num_sched_list <= 1) || !entity->loadbalance_on)
  return;
    fence = READ_ONCE(entity->last_scheduled);
@@ -477,6 +482,7 @@ void drm_sched_entity_select_rq(struct 
drm_sched_entity *entity)

  entity->rq = rq;
  }
  +    entity->loadbalance_on = false;
  spin_unlock(>rq_lock);
  }
  @@ -498,7 +504,6 @@ void drm_sched_entity_push_job(struct 
drm_sched_job *sched_job,

  bool first;
    trace_drm_sched_job(sched_job, entity);
-    atomic_inc(>rq->sched->num_jobs);
  WRITE_ONCE(entity->last_user, current->group_leader);
  first = spsc_queue_push(>job_queue, 
_job->queue_node);
  diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

index 3fad5876a13f..00fdc350134e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -562,6 +562,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
  return -ENOENT;
    sched = entity->rq->sched;
+    atomic_inc(>rq->sched->num_jobs);
    job->sched = sched;
  job->entity = entity;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 96a1a1b7526e..a5190869d323 100644
--- a/include/drm/gpu_scheduler.h
+++

Re: [PATCH] drm/scheduler: fix race condition in load balancer

2020-01-14 Thread Christian König


Am 14.01.20 um 16:43 schrieb Nirmoy Das:

Jobs submitted in an entity should execute in the order those jobs
are submitted. We make sure that by checking entity->job_queue in
drm_sched_entity_select_rq() so that we don't loadbalance jobs within
an entity.

But because we update entity->job_queue later in drm_sched_entity_push_job(),
there remains a open window when it is possibe that entity->rq might get
updated by drm_sched_entity_select_rq() which should not be allowed.


NAK, concurrent calls to 
drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in the 
first place or otherwise we mess up the fence sequence order and risk 
memory corruption.




Changes in this part also improves job distribution.
Below are test results after running amdgpu_test from mesa drm

Before this patch:

sched_name num of many times it got scheduled
=  ==
sdma0  314
sdma1  32
comp_1.0.0 56
comp_1.1.0 0
comp_1.1.1 0
comp_1.2.0 0
comp_1.2.1 0
comp_1.3.0 0
comp_1.3.1 0

After this patch:

sched_name num of many times it got scheduled
=  ==
  sdma1  243
  sdma0  164
  comp_1.0.1 14
  comp_1.1.0 11
  comp_1.1.1 10
  comp_1.2.0 15
  comp_1.2.1 14
  comp_1.3.0 10
  comp_1.3.1 10


Well that is still rather nice to have, why does that happen?

Christian.



Fixes: 35e160e781a048 (drm/scheduler: change entities rq even earlier)

Signed-off-by: Nirmoy Das 
Reported-by: Pierre-Eric Pelloux-Prayer 
---
  drivers/gpu/drm/scheduler/sched_entity.c | 9 +++--
  drivers/gpu/drm/scheduler/sched_main.c   | 1 +
  include/drm/gpu_scheduler.h  | 1 +
  3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 2e3a058fc239..8414e084b6ac 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -67,6 +67,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
entity->priority = priority;
entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
entity->last_scheduled = NULL;
+   entity->loadbalance_on = true;
  
  	if(num_sched_list)

entity->rq = _list[0]->sched_rq[entity->priority];
@@ -447,6 +448,9 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct 
drm_sched_entity *entity)
entity->last_scheduled = dma_fence_get(_job->s_fence->finished);
  
  	spsc_queue_pop(>job_queue);

+   if (!spsc_queue_count(>job_queue))
+   entity->loadbalance_on = true;
+
return sched_job;
  }
  
@@ -463,7 +467,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)

struct dma_fence *fence;
struct drm_sched_rq *rq;
  
-	if (spsc_queue_count(>job_queue) || entity->num_sched_list <= 1)

+   atomic_inc(>rq->sched->num_jobs);
+   if ((entity->num_sched_list <= 1) || !entity->loadbalance_on)
return;
  
  	fence = READ_ONCE(entity->last_scheduled);

@@ -477,6 +482,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity 
*entity)
entity->rq = rq;
}
  
+	entity->loadbalance_on = false;

spin_unlock(>rq_lock);
  }
  
@@ -498,7 +504,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,

bool first;
  
  	trace_drm_sched_job(sched_job, entity);

-   atomic_inc(>rq->sched->num_jobs);
WRITE_ONCE(entity->last_user, current->group_leader);
first = spsc_queue_push(>job_queue, _job->queue_node);
  
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c

index 3fad5876a13f..00fdc350134e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -562,6 +562,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
return -ENOENT;
  
  	sched = entity->rq->sched;

+   atomic_inc(>rq->sched->num_jobs);
  
  	job->sched = sched;

job->entity = entity;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 96a1a1b7526e..a5190869d323 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -97,6 +97,7 @@ struct drm_sched_entity {
struct dma_fence*last_scheduled;
struct task_struct  *last_user;
boolstopped;
+   boolloadbalance_on;
struct completion   entity_idle;
  };
  


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Re: [PATCH] drm/scheduler: fix race condition in load balancer

9 matches

Site Navigation

Mail list logo

Footer information