Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
pbacsko closed pull request #837: [YUNIKORN-2543] Fix locking in RMProxy URL: https://github.com/apache/yunikorn-core/pull/837 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
pbacsko commented on PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2047240375 @chenyulin0719 thanks, that's perfectly normal, so it's good news. It just shouldn't show up in the happened before/after call stack. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2047205971 Hi @pbacsko, Just recheck the [go-deadlock doc.](https://github.com/sasha-s/go-deadlock?tab=readme-ov-file#deadlocks) I think I mixup below log categories: 1. happened before 2. happened after 3. Other goroutines holding locks After this PR, all the 'rmp.RLock' related logs are under "Other goroutines holding locks". It's merely a information for touble shooting, and doesn't indicate that there have potential deadlock. Attached the logs file: [deadlock-log-after-YUNIKORN-2543.txt](https://github.com/apache/yunikorn-core/files/14931175/deadlock-log-after-YUNIKORN-2543.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
pbacsko commented on code in PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1559177908 ## pkg/entrypoint/entrypoint.go: ## @@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) *ServiceContext { // start services log.Log(log.Entrypoint).Info("ServiceContext start scheduling services") sched.StartService(eventHandler, opts.manualScheduleFlag) - proxy.StartService(eventHandler) + proxy.StartService() Review Comment: Yeah that's correct. I think we can fix this in a follow-up JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
pbacsko commented on PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2047095393 @chenyulin0719 IMO there shouldn't be any RMProxy related potential deadlocks. Could you share your `deadlock-log.txt` with us to see what's going on? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on code in PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584 ## pkg/entrypoint/entrypoint.go: ## @@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) *ServiceContext { // start services log.Log(log.Entrypoint).Info("ServiceContext start scheduling services") sched.StartService(eventHandler, opts.manualScheduleFlag) - proxy.StartService(eventHandler) + proxy.StartService() Review Comment: I'm thinking about the necessity of having handler.EventHandlers struct. Not sure why we need it... IMO, I think it would also work if we remove handler.EventHandlers and change the startService() functions to: ``` sched.StartService(proxy, opts.manualScheduleFlag) proxy.StartService() ``` I suppose the reason for having handler.EventHandlers is to group all EventHandler objects. However, if we implement this change... the handler.EventHandlers will be reduntant since it's only existing for RMProxyEventHandler . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on code in PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584 ## pkg/entrypoint/entrypoint.go: ## @@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) *ServiceContext { // start services log.Log(log.Entrypoint).Info("ServiceContext start scheduling services") sched.StartService(eventHandler, opts.manualScheduleFlag) - proxy.StartService(eventHandler) + proxy.StartService() Review Comment: I'm thinking about the necessity of having handler.EventHandlers struct. Not sure why we need it... IMO, I think it would also works if we remove handler.EventHandlers and change the startService() functions to: ``` sched.StartService(proxy, opts.manualScheduleFlag) proxy.StartService() ``` I suppose the reason for having handler.EventHandlers is to group all EventHandler objects. However, if we implement this change... the handler.EventHandlers will be reduntant since it's only existing for RMProxyEventHandler . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on code in PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584 ## pkg/entrypoint/entrypoint.go: ## @@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) *ServiceContext { // start services log.Log(log.Entrypoint).Info("ServiceContext start scheduling services") sched.StartService(eventHandler, opts.manualScheduleFlag) - proxy.StartService(eventHandler) + proxy.StartService() Review Comment: I'm thinking about the necessity of having handler.EventHandlers struct. Not sure why we need it... IMO, I think it would also works if we remove handler.EventHandlers and change the startService() function to: ``` sched.StartService(proxy, opts.manualScheduleFlag) proxy.StartService() ``` I suppose the reason for having handler.EventHandlers is to group all EventHandler objects. However, if we implement this change... the handler.EventHandlers will be reduntant since it's only existing for RMProxyEventHandler . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2046652335 LGTM for the rmp lock scope change. After this PR, I can see the number of lock-related false alert has significantly reduced. Test command: `DEADLOCK_DETECTION_ENABLED=true DEADLOCK_TIMEOUT_SECONDS=60 go test ./pkg/shim/... -race -count=100 > deadlock-log.txt ` Before: ~3441 'rmp.RLock' related logs in my deadlock-log.txt After: ~17 'rmp.RLock' related logs in my deadlock-log.txt This change has increased the readability of our lock trace when a deadlock occurs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on code in PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584 ## pkg/entrypoint/entrypoint.go: ## @@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) *ServiceContext { // start services log.Log(log.Entrypoint).Info("ServiceContext start scheduling services") sched.StartService(eventHandler, opts.manualScheduleFlag) - proxy.StartService(eventHandler) + proxy.StartService() Review Comment: I'm thinking about the necessity of having handler.EventHandlers struct. Not sure why we need it... IMO, I think it would also works if we remove handler.EventHandlers and change the startService() function to: ``` sched.StartService(proxy, opts.manualScheduleFlag) proxy.StartService() ``` I suppose the reason for having have handler.EventHandlers is to group all EventHandler objects. However, if we implement this change... the handler.EventHandlers will be reduntant since it's only existing for RMProxyEventHandler . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on code in PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584 ## pkg/entrypoint/entrypoint.go: ## @@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) *ServiceContext { // start services log.Log(log.Entrypoint).Info("ServiceContext start scheduling services") sched.StartService(eventHandler, opts.manualScheduleFlag) - proxy.StartService(eventHandler) + proxy.StartService() Review Comment: I'm thinking about the necessity of having handler.EventHandlers struct. Not sure why we need it... To me, I think it would also works if we remove handler.EventHandlers and change the startService() function to: ``` sched.StartService(proxy, opts.manualScheduleFlag) proxy.StartService() ``` I suppose the reason for having have handler.EventHandlers is to group all EventHandler objects. However, if we implement this change... the handler.EventHandlers will be reduntant since it's only existing for RMProxyEventHandler . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
pbacsko commented on PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2045382952 @chenyulin0719 yes, enable the deadlock detection and run a smoke test. If the tests in the core do not trigger these, run the smokes from the shim, that will definitely do the job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
chenyulin0719 commented on PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2045377668 Hi @pbacsko, May I ask what test workload you're using to reproduce the lock detection warnings? (I've followed [Craig's guide](https://issues.apache.org/jira/browse/YUNIKORN-2521?focusedCommentId=17834392&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17834392) to enable deadlock detection and can see below logs in scheduler pod: - "=== Deadlock detection enabled (timeout: 60 seconds) ===" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]
codecov-commenter commented on PR #837: URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2042985036 ## [Codecov](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `93.75000%` with `1 lines` in your changes are missing coverage. Please review. > Project coverage is 77.02%. Comparing base [(`f0fe932`)](https://app.codecov.io/gh/apache/yunikorn-core/commit/f0fe93293079e89714e7e8ba75c90d0f505ec07a?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`b3a1b72`)](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). | [Files](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [pkg/rmproxy/rmproxy.go](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?src=pr&el=tree&filepath=pkg%2Frmproxy%2Frmproxy.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGtnL3JtcHJveHkvcm1wcm94eS5nbw==) | 92.85% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@Coverage Diff @@ ## master #837 +/- ## == + Coverage 77.01% 77.02% +0.01% == Files 97 97 Lines 1204312031 -12 == - Hits 9275 9267 -8 + Misses 2432 2427 -5 - Partials 336 337 +1 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org