Taras Ledkov created IGNITE-3558: ------------------------------------ Summary: Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts Key: IGNITE-3558 URL: https://issues.apache.org/jira/browse/IGNITE-3558 Project: Ignite Issue Type: Bug Components: compute Reporter: Taras Ledkov Assignee: Taras Ledkov
The test to reproduce: IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest#testJobFinishing *Root cause* GridJobExecuteResponse isn't set from target node because there is a confusion with GridJobWorker instances in the CollisionContext. *Suggestion* The method GridJobProcessor.CollisionJobContext.cancel() use passiveJobs.remove(jobWorker.getJobId(), jobWorker). *passiveJobs* is a ConcurrentHashMap and GridJobWorker.equals() implements as a equation of jobId. So, when two thread try to cancel the two workers with *the same jobIds* we have the case: - thread0 remove jobWorker0 & cancel jobWorker0. - thread0 put jobWorker1 (because jobWorker0 already removed); - thread1: (has a copy of jobWorker0) and try to cancel it. - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to identify); - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled. *Proposal* Try to use system default equals for the GridJobWorker -- This message was sent by Atlassian JIRA (v6.3.4#6332)