[jira] [Assigned] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YUNIKORN-2562:
--

Assignee: Peter Bacsko

> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Kiran Arangale (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838568#comment-17838568
 ] 

Kiran Arangale commented on YUNIKORN-2562:
--

This happens when we kill the Yunikorn scheduler pod and then the new one 
launches with this error. This also results in 503 Service Unavailable for 
Web-UI.

> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2562:
---
Target Version: 1.6.0, 1.5.1  (was: 1.5.1)

> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2562) Nil pointer panic in Application.ReplaceAllocation()

2024-04-18 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2562:
---
Summary: Nil pointer panic in Application.ReplaceAllocation()  (was: Nil 
pointer in Application.ReplaceAllocation())

> Nil pointer panic in Application.ReplaceAllocation()
> 
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838707#comment-17838707
 ] 

Xi Chen commented on YUNIKORN-2521:
---

Hey [~pbacsko], I used the branch-1.5 to build the scheduler and deployed in 
our environment. Our settings include multiple namespaces running Spark K8s 
jobs. The scheduler was still getting POTENTIAL DEADLOCK. Please check the log 
[^deadlock_2024-04-18.log], thanks!
 

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838512#comment-17838512
 ] 

Peter Bacsko edited comment on YUNIKORN-2562 at 4/18/24 3:25 PM:
-

{noformat}
 41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 
vcore:1000]"}
2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 
allocation found on queue {"queueName": "root.xxx-spark", "appID": 
"application-spark-4rrgafat101r", "allocation": 
"applicationID=application-spark-4rrgafat101r, 
allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, 
allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, 
Node=aks-obemuatnew-34197442-vmss08, result=Replaced"}
2024-04-18T06:49:34.944Z INFO core.scheduler.partition 
scheduler/partition.go:867 scheduler replace placeholder processed {"appID": 
"application-spark-4rrgafat101r", "allocationKey": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0"}
2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 
try to release pod from application {"appID": "application-spark-4rrgafat101r", 
"allocationID": "9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": 
"PLACEHOLDER_REPLACED"}
2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found 
outstanding requests that will trigger autoscaling {"number of requests": 1, 
"total resources": "map[memory:11811160064 pods:1 vcore:2000]"}
2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added 
{"appID": "application-spark-34b5vjdbgeb4", "taskID": 
"5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"}
2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing 
allocations {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1}
2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state 
transition {"app": "application-spark-x2bwqi3mjr5q", "task": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": 
"obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": 
"Bound", "destination": "Completed", "event": "CompleteTask"}
2024-04-18T06:49:35.139Z INFO core.scheduler.application 
objects/application.go:616 ask removed successfully from application {"appID": 
"application-spark-x2bwqi3mjr5q", "ask": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"}
2024-04-18T06:49:35.139Z INFO core.scheduler.partition 
scheduler/partition.go:1281 replacing placeholder allocation {"appID": 
"application-spark-x2bwqi3mjr5q", "allocationID": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
 
goroutine 129 [running]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00,
 {0xc00630a390, 0x24})
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
 +0x615
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
 0xc007f19b00)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?,
 {0xc0098172a0, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?,
 0xc003a43f10?)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
goroutine 1
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
+0x9c{noformat}
 


was (Author: JIRAUSER305116):
41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 
vcore:1000]"}

2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 
allocation found on queue \{"queueName": "root.xxx-spark", "appID": 
"application-spark-4rrgafat101r", "allocation": 
"applicationID=application-spark-4rrgafat101r, 
allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, 
allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, 
Node=aks-obemuatnew-34197442-vmss08, result=Replaced"}

2024-04-18T06:49:34.944Z INFO core.scheduler.partition 
scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": 
"application-spark-4rrgafat101r", "allocationKey": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released 

[jira] [Comment Edited] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838510#comment-17838510
 ] 

Peter Bacsko edited comment on YUNIKORN-2562 at 4/18/24 3:24 PM:
-

Adding more comments - Actually queue capacity gradually degrades even though 
we have capacity available [ example - Lets say my Max allocation is 1.5 TB so 
initially it works well but post few days [2+ days ]this utilisation come down 
to 60% of max capacity where inspite of available resources queue max capacity 
gets limited to 55-65% a[ max] and upon restart yunikorn keep n crashing for 
long time ...eventually after few minutes [ 15-20 minutes to 1 hour ]it starts 
working again . Adding few logs here :

 
{noformat}
41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 
vcore:1000]"}

2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 
allocation found on queue \{"queueName": "root.xxx-spark", "appID": 
"application-spark-4rrgafat101r", "allocation": 
"applicationID=application-spark-4rrgafat101r, 
allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, 
allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, 
Node=aks-obemuatnew-34197442-vmss08, result=Replaced"}
2024-04-18T06:49:34.944Z INFO core.scheduler.partition 
scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": 
"application-spark-4rrgafat101r", "allocationKey": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0"}
2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 
try to release pod from application \{"appID": 
"application-spark-4rrgafat101r", "allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": 
"PLACEHOLDER_REPLACED"}
2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found 
outstanding requests that will trigger autoscaling \{"number of requests": 1, 
"total resources": "map[memory:11811160064 pods:1 vcore:2000]"}
2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added 
\{"appID": "application-spark-34b5vjdbgeb4", "taskID": 
"5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"}
2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing 
allocations \{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1}
2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state 
transition \{"app": "application-spark-x2bwqi3mjr5q", "task": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": 
"obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": 
"Bound", "destination": "Completed", "event": "CompleteTask"}
2024-04-18T06:49:35.139Z INFO core.scheduler.application 
objects/application.go:616 ask removed successfully from application \{"appID": 
"application-spark-x2bwqi3mjr5q", "ask": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"}
2024-04-18T06:49:35.139Z INFO core.scheduler.partition 
scheduler/partition.go:1281 replacing placeholder allocation \{"appID": 
"application-spark-x2bwqi3mjr5q", "allocationID": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3"}
panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]


goroutine 129 [running]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00,
 \{0xc00630a390, 0x24})
 
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
 +0x615
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
 0xc007f19b00)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?,
 \{0xc0098172a0, 0x1, 0x40a0fa?}, \{0x1e0d902, 0x9})
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?,
 0xc003a43f10?)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
goroutine 1
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c
{noformat}
 

 


was (Author: JIRAUSER305116):
Adding more comments - Actually queue capacity gradually degrades even though 
we have capacity available [ example - Lets say my Max allocation is 1.5 TB so 
initially it works well but post few days [2+ days ]this utilisation come down 
to 60% of max capacity where 

[jira] [Updated] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2521:
--
Attachment: deadlock_2024-04-18.log

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838718#comment-17838718
 ] 

Peter Bacsko commented on YUNIKORN-2521:


Many thanks [~jshmchenxi]! That's indeed a new thing and involves preemption.
We need a unit tests that triggers this, then fix it.


> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838703#comment-17838703
 ] 

Peter Bacsko edited comment on YUNIKORN-2562 at 4/18/24 3:27 PM:
-

[~kiranarangale] thanks, as we discussed on Slack, here's the root cause: this 
panic occurs when we have "n" number of placeholder allocations after restart, 
HOWEVER, there's a new placeholder ask with a different task group. In this 
case, we'll create the {{placeholderData}} map and the lone "nil" check itself 
will not defend against the panic:
{noformat}
func (sa *Application) ReplaceAllocation(uuid string) *Allocation {
sa.Lock()
defer sa.Unlock()

[ ... removed for clarity ... ]

if sa.placeholderData != nil {
sa.placeholderData[ph.GetTaskGroup()].Replaced++ <- map 
exists, but no entry for a ph.GetTaskGroup()
}
return ph
}
{noformat}


was (Author: pbacsko):
[~kiranarangale] thanks, as we discussed on Slack, here's the root cause: this 
panic occurs when we have "n" number of placeholder allocations after restart, 
HOWEVER, there's a new placeholder ask with a different task group. In this 
case, we'll create the {{placeholderData}} map and the lone "nil" check itself 
will not defend against the panic:

{noformat}
func (sa *Application) ReplaceAllocation(uuid string) *Allocation {
sa.Lock()
defer sa.Unlock()

[ ... removed for clarity ... ]

if sa.placeholderData != nil {
sa.placeholderData[ph.GetTaskGroup()].Replaced++ <- map 
exists, but no entry for a certain taskgroup
}
return ph
}
{noformat}



> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For 

[jira] [Commented] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838703#comment-17838703
 ] 

Peter Bacsko commented on YUNIKORN-2562:


[~kiranarangale] thanks, as we discussed on Slack, here's the root cause: this 
panic occurs when we have "n" number of placeholder allocations after restart, 
HOWEVER, there's a new placeholder ask with a different task group. In this 
case, we'll create the {{placeholderData}} map and the lone "nil" check itself 
will not defend against the panic:

{noformat}
func (sa *Application) ReplaceAllocation(uuid string) *Allocation {
sa.Lock()
defer sa.Unlock()

[ ... removed for clarity ... ]

if sa.placeholderData != nil {
sa.placeholderData[ph.GetTaskGroup()].Replaced++ <- map 
exists, but no entry for a certain taskgroup
}
return ph
}
{noformat}



> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838732#comment-17838732
 ] 

Craig Condit commented on YUNIKORN-2521:


[~jshmchenxi], [~pbacsko] I think this latest log (2024-04-18) might be a false 
positive. TryAllocate() is in all the mentioned code paths (and takes an app 
lock), and preemption requires walking other apps which takes their locks as 
well. So the detector can see App A locked, then App B, and in another run, App 
B first, then App A. However, there's no way for TryAllocate() to be active in 
multiple goroutines (it's only ever called from the main scheduler loop). So I 
think this isn't actually an issue, just something that trips up the detector.

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838821#comment-17838821
 ] 

Xi Chen commented on YUNIKORN-2521:
---

[~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying 
on the deadlock detection to restart YuniKorn. I can switch to livenessProbe 
using `/ws/v1/fullstatedump` to see if this really hangs the scheduler. WDYT?

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2562) Nil pointer panic in Application.ReplaceAllocation()

2024-04-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2562:
-
Labels: pull-request-available  (was: )

> Nil pointer panic in Application.ReplaceAllocation()
> 
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2570) Add test cases to break the current preemption flow

2024-04-18 Thread Manikandan R (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R updated YUNIKORN-2570:
---
Description: Add various test cases to break the current preemption flow. 
These test would fail now. Follow up jira's 
[https://issues.apache.org/jira/browse/YUNIKORN-2500] should fix the problems 
in current preemption flow so that these test cases should pass.  (was: Add 
various test cases to break the current preemption flow. These test would fail 
now. Follow up jira's should fix the problems in current preemption flow so 
that these test cases should pass.)

> Add test cases to break the current preemption flow
> ---
>
> Key: YUNIKORN-2570
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2570
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Fix For: 1.6.0
>
>
> Add various test cases to break the current preemption flow. These test would 
> fail now. Follow up jira's 
> [https://issues.apache.org/jira/browse/YUNIKORN-2500] should fix the problems 
> in current preemption flow so that these test cases should pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



(yunikorn-site) branch master updated: [YUNIKORN-2522] Move e2e test doc from k8shim to website (#420)

2024-04-18 Thread chenyulin0719
This is an automated email from the ASF dual-hosted git repository.

chenyulin0719 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
 new c1a31ec63e [YUNIKORN-2522] Move e2e test doc from k8shim to website 
(#420)
c1a31ec63e is described below

commit c1a31ec63e9094a8e7c10cddf5fd47f02059e183
Author: targetoee 
AuthorDate: Fri Apr 19 11:45:34 2024 +0800

[YUNIKORN-2522] Move e2e test doc from k8shim to website (#420)

Closes: #420

Signed-off-by: Yu-Lin Chen 
---
 docs/developer_guide/e2e_test.md | 97 
 sidebars.js  |  1 +
 2 files changed, 98 insertions(+)

diff --git a/docs/developer_guide/e2e_test.md b/docs/developer_guide/e2e_test.md
new file mode 100644
index 00..99d5c6fdd6
--- /dev/null
+++ b/docs/developer_guide/e2e_test.md
@@ -0,0 +1,97 @@
+---
+id: e2e_test
+title: End-to-End Testing
+---
+
+
+
+End-to-end (e2e) tests for YuniKorn-K8shim provide a mechanism to test 
end-to-end behavior of the system, and is the last signal to ensure end user 
operations match developer specifications. 
+
+The primary objectives of the e2e tests are to ensure a consistent and 
reliable behavior of the yunikorn code base, and to catch hard-to-test bugs 
before users do, when unit and integration tests are insufficient.
+
+The e2e tests are built atop of [Ginkgo](https://onsi.github.io/ginkgo/) and 
[Gomega](https://github.com/onsi/gomega). There are a host of features that 
this Behavior-Driven Development (BDD) testing framework provides, and it is 
recommended that the developer read the documentation prior to diving into the 
tests.
+
+Below is the structure of e2e tests, all contained within the 
[yunikorn-k8shim](https://github.com/apache/yunikorn-k8shim).
+* `test/e2e/` contains tests for YuniKorn Features like Scheduling, Predicates 
etc
+* `test/e2e/framework/configManager` manages & maintains the test and cluster 
configuration
+* `test/e2e/framework/helpers` contains utility modules for k8s client, 
(de)serializers, rest api client and other common libraries.
+* `test/e2e/testdata` contains all the test related data like configmaps, pod 
specs etc
+
+## Pre-requisites
+This project requires Go to be installed. On OS X with Homebrew you can just 
run `brew install go`.
+OR follow this doc for deploying go https://golang.org/doc/install
+
+## Understanding the Command Line Arguments
+* `yk-namespace` - namespace under which YuniKorn is deployed. [Required]
+* `kube-config` - path to kube config file, needed for k8s client [Required]
+* `yk-host` - hostname of the YuniKorn REST Server, defaults to localhost.   
+* `yk-port` - port number of the YuniKorn REST Server, defaults to 9080.
+* `yk-scheme` - scheme of the YuniKorn REST Server, defaults to http.
+* `timeout` -  timeout for all tests, defaults to 24 hours
+
+## Launching Tests
+
+### Trigger through CLI
+* Begin by installing a new cluster dedicated to testing, such as one named 
'yktest'
+```shell
+./scripts/run-e2e-tests.sh -a install -n yktest -v kindest/node:v1.28.0
+```
+
+* Launching CI tests is as simple as below.
+```shell
+# We need to add a 'kind' prefix to the argument of the run-e2e-tests.sh -n 
command.
+kubectl config use-context kind-yktest 
+ginkgo -r -v ci -timeout=2h -- -yk-namespace "yunikorn" -kube-config 
"$HOME/.kube/config"
+```
+
+* Launching all the tests can be done as.
+```shell
+ginkgo -r -v -timeout=2h -- -yk-namespace "yunikorn" -kube-config 
"$HOME/.kube/config"
+```
+
+* Launching all the tests in specified e2e folder.
+e.g. test/e2e/user_group_limit/
+```shell 
+cd test/e2e/
+ginkgo -r user_group_limit -v -- -yk-namespace "yunikorn" -kube-config 
"$HOME/.kube/config"
+```
+
+* Launching specified test file.
+```shell
+cd test/e2e/
+ginkgo run -r -v --focus-file "admission_controller_test.go" -- -yk-namespace 
"yunikorn" -kube-config "$HOME/.kube/config"
+```
+
+* Launching specified test.
+e.g. Run test with ginkgo.it() spec name 
"Verify_maxapplications_with_a_specific_group_limit"
+```shell 
+cd test/e2e/
+ginkgo run -r -v --focus "Verify_maxapplications_with_a_specific_group_limit" 
-- -yk-namespace "yunikorn" -kube-config "$HOME/.kube/config"
+```
+
+* Launching all the tests except specified test file.
+```shell
+cd test/e2e/
+ginkgo run -r -v --skip-file "admission_controller_test.go" -- -yk-namespace 
"yunikorn" -kube-config "$HOME/.kube/config"
+```
+
+* Delete the cluster after we finish testing (this step is optional).
+```shell
+./scripts/run-e2e-tests.sh -a cleanup -n yktest
+```
\ No newline at end of file
diff --git a/sidebars.js b/sidebars.js
index 6de952f713..018de07bd0 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -76,6 +76,7 @@ module.exports = {
 'developer_guide/deployment',
 'developer_guide/openshift_development',
 

[jira] [Created] (YUNIKORN-2570) Add test cases to break the current preemption flow

2024-04-18 Thread Manikandan R (Jira)
Manikandan R created YUNIKORN-2570:
--

 Summary: Add test cases to break the current preemption flow
 Key: YUNIKORN-2570
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2570
 Project: Apache YuniKorn
  Issue Type: Test
  Components: core - scheduler
Reporter: Manikandan R
Assignee: Manikandan R
 Fix For: 1.6.0


Add various test cases to break the current preemption flow. These test would 
fail now. Follow up jira's should fix the problems in current preemption flow 
so that these test cases should pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2570) Add test cases to break the current preemption flow

2024-04-18 Thread Manikandan R (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R updated YUNIKORN-2570:
---
Parent: YUNIKORN-2493
Issue Type: Sub-task  (was: Test)

> Add test cases to break the current preemption flow
> ---
>
> Key: YUNIKORN-2570
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2570
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Fix For: 1.6.0
>
>
> Add various test cases to break the current preemption flow. These test would 
> fail now. Follow up jira's should fix the problems in current preemption flow 
> so that these test cases should pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838821#comment-17838821
 ] 

Xi Chen edited comment on YUNIKORN-2521 at 4/19/24 4:35 AM:


[~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying 
on the deadlock detection to restart YuniKorn. I can switch to livenessProbe 
using `/ws/v1/fullstatedump` to see if this really hangs the scheduler.

*Update:*

I switched deadlock exit to livenessProbe and the scheduler works without 
restarting. But the POTENTIAL DEADLOCK logs are still there. So this is likely 
a false positive.
 


was (Author: jshmchenxi):
[~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying 
on the deadlock detection to restart YuniKorn. I can switch to livenessProbe 
using `/ws/v1/fullstatedump` to see if this really hangs the scheduler. WDYT?

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2562) Nil pointer panic in Application.ReplaceAllocation()

2024-04-18 Thread Kiran Arangale (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838864#comment-17838864
 ] 

Kiran Arangale commented on YUNIKORN-2562:
--

[~pbacsko] - Thanks and appreciate your efforts 

> Nil pointer panic in Application.ReplaceAllocation()
> 
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Kiran Arangale (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838510#comment-17838510
 ] 

Kiran Arangale commented on YUNIKORN-2562:
--

Adding more comments - Actually queue capacity gradually degrades even though 
we have capacity available [ example - Lets say my Max allocation is 1.5 TB so 
initially it works well but post few days [2+ days ]this utilisation come down 
to 60% of max capacity where inspite of available resources queue max capacity 
gets limited to 55-65% a[ max] and upon restart yunikorn keep n crashing for 
long time ...eventually after few minutes [ 15-20 minutes to 1 hour ]it starts 
working again . Adding few logs here :

 

41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 
vcore:1000]"}

2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 
allocation found on queue \{"queueName": "root.xxx-spark", "appID": 
"application-spark-4rrgafat101r", "allocation": 
"applicationID=application-spark-4rrgafat101r, 
allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, 
allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, 
Node=aks-obemuatnew-34197442-vmss08, result=Replaced"}

2024-04-18T06:49:34.944Z INFO core.scheduler.partition 
scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": 
"application-spark-4rrgafat101r", "allocationKey": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0"}

2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 
try to release pod from application \{"appID": 
"application-spark-4rrgafat101r", "allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": 
"PLACEHOLDER_REPLACED"}

2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found 
outstanding requests that will trigger autoscaling \{"number of requests": 1, 
"total resources": "map[memory:11811160064 pods:1 vcore:2000]"}

2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added 
\{"appID": "application-spark-34b5vjdbgeb4", "taskID": 
"5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"}

2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing 
allocations \{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1}

2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state 
transition \{"app": "application-spark-x2bwqi3mjr5q", "task": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": 
"obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": 
"Bound", "destination": "Completed", "event": "CompleteTask"}

2024-04-18T06:49:35.139Z INFO core.scheduler.application 
objects/application.go:616 ask removed successfully from application \{"appID": 
"application-spark-x2bwqi3mjr5q", "ask": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"}

2024-04-18T06:49:35.139Z INFO core.scheduler.partition 
scheduler/partition.go:1281 replacing placeholder allocation \{"appID": 
"application-spark-x2bwqi3mjr5q", "allocationID": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3"}

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]

 

goroutine 129 [running]:

github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00,
 \{0xc00630a390, 0x24})

 
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
 +0x615

github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
 0xc007f19b00)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b

github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?,
 \{0xc0098172a0, 0x1, 0x40a0fa?}, \{0x1e0d902, 0x9})

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e

github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?,
 0xc003a43f10?)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5

github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5

created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
goroutine 1

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c

 

 

> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>

[jira] [Commented] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-18 Thread Kiran Arangale (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838512#comment-17838512
 ] 

Kiran Arangale commented on YUNIKORN-2562:
--

41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 
vcore:1000]"}

2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 
allocation found on queue \{"queueName": "root.xxx-spark", "appID": 
"application-spark-4rrgafat101r", "allocation": 
"applicationID=application-spark-4rrgafat101r, 
allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, 
allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, 
Node=aks-obemuatnew-34197442-vmss08, result=Replaced"}

2024-04-18T06:49:34.944Z INFO core.scheduler.partition 
scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": 
"application-spark-4rrgafat101r", "allocationKey": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0"}

2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 
try to release pod from application \{"appID": 
"application-spark-4rrgafat101r", "allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": 
"PLACEHOLDER_REPLACED"}

2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found 
outstanding requests that will trigger autoscaling \{"number of requests": 1, 
"total resources": "map[memory:11811160064 pods:1 vcore:2000]"}

2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added 
\{"appID": "application-spark-34b5vjdbgeb4", "taskID": 
"5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"}

2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing 
allocations \{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1}

2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state 
transition \{"app": "application-spark-x2bwqi3mjr5q", "task": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": 
"obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": 
"Bound", "destination": "Completed", "event": "CompleteTask"}

2024-04-18T06:49:35.139Z INFO core.scheduler.application 
objects/application.go:616 ask removed successfully from application \{"appID": 
"application-spark-x2bwqi3mjr5q", "ask": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"}

2024-04-18T06:49:35.139Z INFO core.scheduler.partition 
scheduler/partition.go:1281 replacing placeholder allocation \{"appID": 
"application-spark-x2bwqi3mjr5q", "allocationID": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3"}

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]

 

goroutine 129 [running]:

github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00,
 \{0xc00630a390, 0x24})

 
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
 +0x615

github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
 0xc007f19b00)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b

github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?,
 \{0xc0098172a0, 0x1, 0x40a0fa?}, \{0x1e0d902, 0x9})

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e

github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?,
 0xc003a43f10?)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5

github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5

created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
goroutine 1

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c

> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> 

[jira] [Created] (YUNIKORN-2569) Helm upgrade behaviour

2024-04-18 Thread Manikandan R (Jira)
Manikandan R created YUNIKORN-2569:
--

 Summary: Helm upgrade behaviour
 Key: YUNIKORN-2569
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2569
 Project: Apache YuniKorn
  Issue Type: Test
Reporter: Manikandan R


Need to test the Yunikorn upgrade behaviour through Helm.

For example, 

1. Create cluster using kind create.
2. Deploy old versions of Yunikorn (say, 1.2 or 1.3 or 1.4) using helm deploy.
3. Sanity checks to ensure deployed version is working as expected.
4. Upgrade YK version to the latest master (1.6) using helm upgrade.
5. Document the behaviour especially when there are any issues.

Repeat for each old versions (1.2, 1.3 and 1.4).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org