Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-10 Thread via GitHub


pbacsko closed pull request #837: [YUNIKORN-2543] Fix locking in RMProxy
URL: https://github.com/apache/yunikorn-core/pull/837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-10 Thread via GitHub


pbacsko commented on PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2047240375

   @chenyulin0719 thanks, that's perfectly normal, so it's good news. It just 
shouldn't show up in the happened before/after call stack.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-10 Thread via GitHub


chenyulin0719 commented on PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2047205971

   Hi @pbacsko, 
   
   Just recheck the [go-deadlock 
doc.](https://github.com/sasha-s/go-deadlock?tab=readme-ov-file#deadlocks) 
   I think I mixup below log categories:
   1. happened before
   2. happened after
   3. Other goroutines holding locks
   
   After this PR, all the 'rmp.RLock' related logs are under "Other goroutines 
holding locks".
   It's merely a information for touble shooting, and doesn't indicate that 
there have potential deadlock.
   
   Attached the logs file: 
[deadlock-log-after-YUNIKORN-2543.txt](https://github.com/apache/yunikorn-core/files/14931175/deadlock-log-after-YUNIKORN-2543.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-10 Thread via GitHub


pbacsko commented on code in PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1559177908


##
pkg/entrypoint/entrypoint.go:
##
@@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) 
*ServiceContext {
// start services
log.Log(log.Entrypoint).Info("ServiceContext start scheduling services")
sched.StartService(eventHandler, opts.manualScheduleFlag)
-   proxy.StartService(eventHandler)
+   proxy.StartService()

Review Comment:
   Yeah that's correct. I think we can fix this in a follow-up JIRA. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-10 Thread via GitHub


pbacsko commented on PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2047095393

   @chenyulin0719 IMO there shouldn't be any RMProxy related potential 
deadlocks. Could you share your `deadlock-log.txt` with us to see what's going 
on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-10 Thread via GitHub


chenyulin0719 commented on code in PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584


##
pkg/entrypoint/entrypoint.go:
##
@@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) 
*ServiceContext {
// start services
log.Log(log.Entrypoint).Info("ServiceContext start scheduling services")
sched.StartService(eventHandler, opts.manualScheduleFlag)
-   proxy.StartService(eventHandler)
+   proxy.StartService()

Review Comment:
   I'm thinking about the necessity of having handler.EventHandlers struct.
   
   Not sure why we need it...  IMO, I think it would also work if we remove  
handler.EventHandlers and change the startService() functions to:
   ```
sched.StartService(proxy, opts.manualScheduleFlag)
proxy.StartService()
   ```
   
   I suppose the reason for having handler.EventHandlers is to group all 
EventHandler objects.  However, if we implement this change... the 
handler.EventHandlers will be reduntant since it's only existing for 
RMProxyEventHandler .
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-10 Thread via GitHub


chenyulin0719 commented on code in PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584


##
pkg/entrypoint/entrypoint.go:
##
@@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) 
*ServiceContext {
// start services
log.Log(log.Entrypoint).Info("ServiceContext start scheduling services")
sched.StartService(eventHandler, opts.manualScheduleFlag)
-   proxy.StartService(eventHandler)
+   proxy.StartService()

Review Comment:
   I'm thinking about the necessity of having handler.EventHandlers struct.
   
   Not sure why we need it...  IMO, I think it would also works if we remove  
handler.EventHandlers and change the startService() functions to:
   ```
sched.StartService(proxy, opts.manualScheduleFlag)
proxy.StartService()
   ```
   
   I suppose the reason for having handler.EventHandlers is to group all 
EventHandler objects.  However, if we implement this change... the 
handler.EventHandlers will be reduntant since it's only existing for 
RMProxyEventHandler .
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-09 Thread via GitHub


chenyulin0719 commented on code in PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584


##
pkg/entrypoint/entrypoint.go:
##
@@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) 
*ServiceContext {
// start services
log.Log(log.Entrypoint).Info("ServiceContext start scheduling services")
sched.StartService(eventHandler, opts.manualScheduleFlag)
-   proxy.StartService(eventHandler)
+   proxy.StartService()

Review Comment:
   I'm thinking about the necessity of having handler.EventHandlers struct.
   
   Not sure why we need it...  IMO, I think it would also works if we remove  
handler.EventHandlers and change the startService() function to:
   ```
sched.StartService(proxy, opts.manualScheduleFlag)
proxy.StartService()
   ```
   
   I suppose the reason for having handler.EventHandlers is to group all 
EventHandler objects.  However, if we implement this change... the 
handler.EventHandlers will be reduntant since it's only existing for 
RMProxyEventHandler .
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-09 Thread via GitHub


chenyulin0719 commented on PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2046652335

   LGTM for the rmp lock scope change.  After this PR, I can see the number of 
lock-related false alert has significantly reduced.
   
   Test command:
   `DEADLOCK_DETECTION_ENABLED=true DEADLOCK_TIMEOUT_SECONDS=60 go test 
./pkg/shim/... -race -count=100 > deadlock-log.txt `
   
   Before: ~3441 'rmp.RLock' related logs in my deadlock-log.txt
   After: ~17  'rmp.RLock' related logs in my deadlock-log.txt
   
   This change has increased the readability of our lock trace when a deadlock 
occurs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-09 Thread via GitHub


chenyulin0719 commented on code in PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584


##
pkg/entrypoint/entrypoint.go:
##
@@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) 
*ServiceContext {
// start services
log.Log(log.Entrypoint).Info("ServiceContext start scheduling services")
sched.StartService(eventHandler, opts.manualScheduleFlag)
-   proxy.StartService(eventHandler)
+   proxy.StartService()

Review Comment:
   I'm thinking about the necessity of having handler.EventHandlers struct.
   
   Not sure why we need it...  IMO, I think it would also works if we remove  
handler.EventHandlers and change the startService() function to:
   ```
sched.StartService(proxy, opts.manualScheduleFlag)
proxy.StartService()
   ```
   
   I suppose the reason for having have handler.EventHandlers is to group all 
EventHandler objects.  However, if we implement this change... the 
handler.EventHandlers will be reduntant since it's only existing for 
RMProxyEventHandler .
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-09 Thread via GitHub


chenyulin0719 commented on code in PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#discussion_r1558922584


##
pkg/entrypoint/entrypoint.go:
##
@@ -90,7 +89,7 @@ func startAllServicesWithParameters(opts startupOptions) 
*ServiceContext {
// start services
log.Log(log.Entrypoint).Info("ServiceContext start scheduling services")
sched.StartService(eventHandler, opts.manualScheduleFlag)
-   proxy.StartService(eventHandler)
+   proxy.StartService()

Review Comment:
   I'm thinking about the necessity of having handler.EventHandlers struct.
   
   Not sure why we need it...  To me, I think it would also works if we remove  
handler.EventHandlers and change the startService() function to:
   ```
sched.StartService(proxy, opts.manualScheduleFlag)
proxy.StartService()
   ```
   
   I suppose the reason for having have handler.EventHandlers is to group all 
EventHandler objects.  However, if we implement this change... the 
handler.EventHandlers will be reduntant since it's only existing for 
RMProxyEventHandler .
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-09 Thread via GitHub


pbacsko commented on PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2045382952

   @chenyulin0719 yes, enable the deadlock detection and run a smoke test. If 
the tests in the core do not trigger these, run the smokes from the shim, that 
will definitely do the job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-09 Thread via GitHub


chenyulin0719 commented on PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2045377668

   Hi @pbacsko, 
   May I ask what test workload you're using to reproduce the lock detection 
warnings?
   (I've followed [Craig's 
guide](https://issues.apache.org/jira/browse/YUNIKORN-2521?focusedCommentId=17834392&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17834392)
 to enable deadlock detection and can see below logs in scheduler pod: 
   - "=== Deadlock detection enabled (timeout: 60 seconds) ==="
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [YUNIKORN-2543] Fix locking in RMProxy [yunikorn-core]

2024-04-08 Thread via GitHub


codecov-commenter commented on PR #837:
URL: https://github.com/apache/yunikorn-core/pull/837#issuecomment-2042985036

   ## 
[Codecov](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `93.75000%` with `1 lines` in your changes are 
missing coverage. Please review.
   > Project coverage is 77.02%. Comparing base 
[(`f0fe932`)](https://app.codecov.io/gh/apache/yunikorn-core/commit/f0fe93293079e89714e7e8ba75c90d0f505ec07a?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`b3a1b72`)](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   | 
[Files](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[pkg/rmproxy/rmproxy.go](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?src=pr&el=tree&filepath=pkg%2Frmproxy%2Frmproxy.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGtnL3JtcHJveHkvcm1wcm94eS5nbw==)
 | 92.85% | [1 Missing :warning: 
](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@Coverage Diff @@
   ##   master #837  +/-   ##
   ==
   + Coverage   77.01%   77.02%   +0.01% 
   ==
 Files  97   97  
 Lines   1204312031  -12 
   ==
   - Hits 9275 9267   -8 
   + Misses   2432 2427   -5 
   - Partials  336  337   +1 
   ```
   
   
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/yunikorn-core/pull/837?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org