[jira] [Updated] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-06-06 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2637:

Target Version: 1.6.0, 1.5.2  (was: 1.6.0)

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2665) Gang app originator pod changes after restart

2024-06-05 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2665:

   Target Version: 1.6.0, 1.5.2
Affects Version/s: (was: 1.5.2)
 Priority: Critical  (was: Major)

The original originator pod will never be released when this happens. Leaking 
the application on the core side and multiple things like the application, pod 
and tasks on the k8shim side.

Traget for a backport to 1.5.2

> Gang app originator pod changes after restart
> -
>
> Key: YUNIKORN-2665
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2665
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.5.1
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Critical
>
> Gang app choose the first pod (who created the app) as originator pod which 
> becomes the real driver pod later. While processing gang app specifically 
> after the placeholder creation and in the process of replacement, restart can 
> lead to the below described incorrect behaviour:
> During restore, there is no guarantee on the ordering of pods coming from K8s 
> lister especially when all the pods created with the same second timestamp. 
> k8s use the seconds based timestamp, which means all pods created with in 
> same second has same timestamp. During this situation, whichever pod comes 
> first from lister, YK designate it as originator pod. So, any placeholder 
> could become the originator pod and actual originator pod has been lost. This 
> change could cause rippling effects leading to weird behaviour and needs to 
> be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2652) Expand getApplication() endpoint handler to optionally return resource usage

2024-05-31 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2652:

Component/s: core - common
 (was: scheduler-interface)

> Expand getApplication() endpoint handler to optionally return resource usage
> 
>
> Key: YUNIKORN-2652
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2652
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Rich Scott
>Priority: Major
>
> Some users would like to be able to see resource usage (preempted, 
> placeholder resource, etc) for applications that have been completed. The 
> `getApplication()` endpoint handler should be enhanced to take an optional 
> parameter specifying that the user would like details about resources 
> included in the response, and a new `ApplicationXXXDAOInfo` object that is a 
> slight superset of `ApplicationDAOInfo` should be introduced, and can be used 
> in the response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2581) Expose running placement rules in REST

2024-05-31 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850945#comment-17850945
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2581:
-

Documentation PR opened

> Expose running placement rules in REST
> --
>
> Key: YUNIKORN-2581
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2581
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> Since introducing the use of placement rules always and the recovery rule the 
> queue config does not correctly show the running rules.
> Also if a config update has been rejected, for any reason, the rules would 
> not be correct
> Exposing the configured rules from the placement manager works around all 
> these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2655) Cleanup REST API documentation

2024-05-31 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2655:
---

 Summary: Cleanup REST API documentation
 Key: YUNIKORN-2655
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2655
 Project: Apache YuniKorn
  Issue Type: Task
  Components: documentation
Reporter: Wilfred Spiegelenburg


The REST API documentation is not up to date with the current behaviour as it 
does not show any 400 or 404 errors returned by a number of API calls.

The error response only shows a 500 code with the same message for each call.

We should move to a simple list for each call showing the applicable errors 
like this:
{code:java}
### Error responses

**Code** : `400 Bad Request` (URL query is invalid, missing partition name)

**Code** : `404 Not Found` (Partition not found)

**Code** : `500 Internal Server Error` {code}
Remove the error examples as they do not add any detail required



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2654) Remove unused code in k8shim context

2024-05-30 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2654:
---

 Summary: Remove unused code in k8shim context
 Key: YUNIKORN-2654
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2654
 Project: Apache YuniKorn
  Issue Type: Task
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


The NotifyApplicationComplete and NotifyApplicationFail  function are not 
called by anything and are unused code.

The K8shim does not trigger the application completion or failure. This is 
triggered by the core when the application no longer has any activity 
registered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance

2024-05-30 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2653:
---

 Summary: Gang scheduling K8s event formatting compliance
 Key: YUNIKORN-2653
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2653
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The K8s events provide definitions and rules around the content of the fields 
within the event. Adjust the content of gang scheduling related events to 
comply with the rules.
Focussed on the reason and action fields only.
  * 'reason' is the reason this event is generated. 'reason' should be short 
and unique; it should be in UpperCamelCase format (starting with a capital 
letter). 
 * 'action' explains what happened with regarding/ what action did the 
ReportingController take in objects name; it should be in UpperCamelCase format 
(starting with a capital letter). 

No space or long text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-182) fix lint issues

2024-05-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850896#comment-17850896
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-182:


File a new Jira for this, it needs to be fixed in all our http servers we 
create in our code, those are spread over multiple repositories and all need to 
be checked:
{code:java}
pkg/cmd/admissioncontroller/main.go:143:15: G112: Potential Slowloris Attack 
because ReadHeaderTimeout is not configured in the http.Server (gosec) {code}
This one should get an ignore from the lint side, we do not need crypt quality 
random here;
{code:java}
test/e2e/framework/helpers/common/utils.go:105:18: G404: Use of weak random 
number generator (math/rand instead of crypto/rand) (gosec)
b[i] = letters[rand.Intn(len(letters))]{code}
All the ineffective assigns and shadowing remarks can and should be fixed.

Formatting issues can snd should be fixed

The function length ones are dubious and we probably should just add the 
{{//nolint:funlen}} remark on them specially since they are almost all test 
functions.

> fix lint issues
> ---
>
> Key: YUNIKORN-182
> URL: https://issues.apache.org/jira/browse/YUNIKORN-182
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: build
>Reporter: Wilfred Spiegelenburg
>Assignee: Yun Sun
>Priority: Minor
>  Labels: pull-request-available
>
> When we added the lint test most major issues were fixed. There are still a 
> lot of issues specially in tests that need to be fixed.
> This is a container Jira to track that work on both the k8shim as the core 
> repos.
> Work should be split into multiple parts (per linter?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2581) Expose running placement rules in REST

2024-05-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850889#comment-17850889
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2581:
-

code change committed, working on documentation before closing

> Expose running placement rules in REST
> --
>
> Key: YUNIKORN-2581
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2581
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> Since introducing the use of placement rules always and the recovery rule the 
> queue config does not correctly show the running rules.
> Also if a config update has been rejected, for any reason, the rules would 
> not be correct
> Exposing the configured rules from the placement manager works around all 
> these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2645) parent queue exceeds maximum resource

2024-05-29 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850562#comment-17850562
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2645:
-

The side effect of that broken node is that every single pod that we allocate 
will select that broken node. Based on the node sorting that node stays as the 
first node in the list to try. Every single pod gets placed but then fails to 
start. The node usage does not change and thus the node does not get pushed 
back in the list of available nodes. The scheduler due to that does not make 
any real progress.

I would consider that a hung scheduler but there is nothing that I think we can 
do about that without some major changes.

A possible solution would be for instance rate limit the number of pods we put 
on a node. Never schedule more than 10 pods per second on a node, including or 
ignoring failures, and when that is hit we skip the node. That could have made 
sure we try a couple of times and then try the next node. That could cause a 
slight delay when a cluster is almost full. It will also delay somewhat in an 
auto scaling cluster as the scheduler skips a node while the auto scaler does 
not...

> parent queue exceeds maximum resource
> -
>
> Key: YUNIKORN-2645
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2645
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Dmitry
>Priority: Major
> Attachments: yunikorn-logs.txt.gz
>
>
> We had a node broken in the cluster - kubernetes was creating pods which were 
> immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.
> The scheduler panicked with the log attached and was not scheduling any other 
> pods.
> The config:
> {code:yaml}
> apiVersion: v1
> data:
>   admissionController.filtering.bypassNamespaces: 
> ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$
>   queues.yaml: |
> partitions:
>   - name: default
> placementrules:
>   - name: fixed
> value: root.scavenging.osg
> create: true
> filter:
>   type: allow
>   users:
>   - system:serviceaccount:osg-ligo:prp-htcondor-provisioner
>   - 
> system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner
>   - system:serviceaccount:osg-icecube:prp-htcondor-provisioner
>   - name: tag
> value: namespace
> create: true
> parent:
>name: tag
>value: namespace.parentqueue
>   - name: tag
> value: namespace
> create: true
> parent:
>name: fixed
>value: general
> nodesortpolicy:
>   type: fair
>   resourceweights:
> vcore: 1.0
> memory: 1.0
> nvidia.com/gpu: 4.0
> queues:
>   - name: root
> submitacl: '*'
> properties:
>   application.sort.policy: fair
> queues:
> - name: system
>   parent: true
>   properties:
> preemption.policy: disabled
> - name: general
>   parent: true
>   childtemplate:
> properties:
>   application.sort.policy: fair
> resources:
>   guaranteed:
> vcore: 100
> memory: 1Ti
> nvidia.com/gpu: 8
>   max:
> vcore: 4000
> memory: 15Ti
> nvidia.com/gpu: 200
> - name: scavenging
>   parent: true
>   childtemplate:
> resources:
>   guaranteed:
> vcore: 1
> memory: 1G
> nvidia.com/gpu: 1
> properties:
>   priority.offset: "-10"
> - name: interactive
>   parent: true
>   childtemplate:
> resources:
>   guaranteed:
> vcore: 1000
> memory: 10T
> nvidia.com/gpu: 48
> nvidia.com/a100: 4
> properties:
>   priority.offset: "10"
>   preemption.policy: disabled
> - name: clemson
>   parent: true
>   properties:
> application.sort.policy: fair
>   resources:
> guaranteed:
>   vcore: 256
>   memory: 2T
>   nvidia.com/gpu: 24
>

[jira] [Created] (YUNIKORN-2648) Add deadlock detection config to the configmap

2024-05-29 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2648:
---

 Summary: Add deadlock detection config to the configmap
 Key: YUNIKORN-2648
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2648
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common
Reporter: Wilfred Spiegelenburg


The current deadlock detection is configured using environment variables. That 
requires a change of the image and a restart of the scheduler to take effect 
and is not easy to maintain.

We should be using yunikorn-defaults config map for the settings. We want a 
default set, turned off, for production use cases. However making the configs 
loadable from the config map makes turning it on easier.

Update the configmap and restart the scheduler to turn the detection on or off.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2647) Flaky test TestUpdateNodeCapacity

2024-05-29 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2647:
---

 Summary: Flaky test TestUpdateNodeCapacity
 Key: YUNIKORN-2647
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2647
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: test - unit
Reporter: Wilfred Spiegelenburg


Same as we saw in YUNIKORN-2573 the single node update test might fail:
{code:java}
--- FAIL: TestUpdateNodeCapacity (0.03s)
    operation_test.go:446: Expected partition resource map[memory:1 
vcore:2], doesn't match with actual partition resource map[memory:1 
vcore:2]{code}
We calculate the delta resources when updating node capacity with that delta we 
update resources in partition.

The test would fail with following order same as for multiple nodes

node.SetCapacity() -> waitForAvailableNodeResource() ->  
partitionInfo.GetTotalPartitionResource()  -> 
partition.updatePartitionResource()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2646) Deadlock detected during preemption

2024-05-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850241#comment-17850241
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2646:
-

I have found a way to turn this specific detection off. I will create a PR for 
it a little later. Would be good to backport it to 1.5.2. Need to still think 
about the default we would use for this.

> Deadlock detected during preemption
> ---
>
> Key: YUNIKORN-2646
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2646
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Dmitry
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: yunikorn-logs-lock.txt.gz
>
>
> Hitting deadlocks in 1.5.1
> The log is attached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2646) Deadlock detected during preemption

2024-05-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850240#comment-17850240
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2646:
-

For the analysis of the stack trace. [~pbacsko] is correct, we have seen this 
before and it is a false positive.

This lock order check points towards something like the following case:
 * app A -> Allocate, trigger preemption -> check if app B can be a victim
 * app B -> Allocate, trigger preemption -> check if app A can be a victim

Two points:
 # scheduling cycle is single threaded.
 # the application triggering preemption is never a victim

So how does that relate to the stack trace: the 
{{PartitionContext.tryAllocate}} shown in the logs are never running at the 
same time. Scheduling also does not run multiple go routines. Last point is 
that leaving the {{Application.tryAllocate}} for the next cycle all locks that 
were held have been released. The next cycle could look at the same application 
again or might use a completely different one.

When building the victim list via the {{Queue.FindEligiblePreemptionVictims}} 
and the recursive version of that call the queue from the application that 
triggered the preemption is filtered out. The lock held in 
{{Application.tryAllocate}} is on an application that cannot be later selected 
as a victim. If that would occur scheduling would immediately stop at that 
point. We would never see a second instance of this stack trace in the deadlock 
logging. The lock taken on the application for scheduling is a write lock. 
Getting a read lock on the same application would block.

We need to investigate how we can exclude this from the potential deadlock 
detection. The only optin I can find at the moment is setting 
{{Opts.DisableLockOrderDetection}} for the detection code if you want to run 
this with preemption turned on.

> Deadlock detected during preemption
> ---
>
> Key: YUNIKORN-2646
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2646
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Dmitry
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: yunikorn-logs-lock.txt.gz
>
>
> Hitting deadlocks in 1.5.1
> The log is attached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2646) Deadlock detected during preemption

2024-05-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850227#comment-17850227
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2646:
-

A flag in the queue config at the partition level: 
https://yunikorn.apache.org/docs/user_guide/queue_config#partitions

> Deadlock detected during preemption
> ---
>
> Key: YUNIKORN-2646
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2646
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Dmitry
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: yunikorn-logs-lock.txt.gz
>
>
> Hitting deadlocks in 1.5.1
> The log is attached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2645) parent queue exceeds maximum resource

2024-05-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850226#comment-17850226
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2645:
-

thank you [~dimm] for the logs that helped.

The scheduler did not panic as it would have shown a restart of the scheduler. 
It did log a message that should get your attention. If this happens your 
cluster and the scheduler are in a really bad state. We can only detect this 
and revert the changes but not fix it from the scheduler side. We keep on 
scheduling.

A panic would be caused by the logger and expected when the logger runs in 
development mode. This is all linked to the DPANIC level. We use 
[DPANIC|https://pkg.go.dev/go.uber.org/zap#pkg-constants] in a couple of 
places. What that level does it logs the error and then causes a panic if 
running in development mode. If not running in development mode you just see 
the message. The logger should never be running in development mode unless 
running as part of unit tests etc.

If you see these messages with a DPANIC level in production you have a serious 
issue.

Some background on the {{OutOfCpu}} message from the node: there has been a 
change in K8s 1.22 kubelet to fix some resource issues. That introduced an 
increased possibility of a race condition in the kubelet when scheduling short 
lived pods or pods that did not pass the node admission checks. A mitigation 
for that race condition was added in 1.22.4 but there is still complaints about 
it [regularly happening|https://github.com/kubernetes/kubernetes/issues/115325] 
even in the latest K8s versions with the default K8s scheduler. High pod churn, 
node and deployment scaling all seem to be related and triggering. The sig_node 
team has said that it is as good as it will get without causing the original 
issue to come back. They assessed that the original issue was far worse than 
this one.

> parent queue exceeds maximum resource
> -
>
> Key: YUNIKORN-2645
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2645
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Dmitry
>Priority: Major
> Attachments: yunikorn-logs.txt.gz
>
>
> We had a node broken in the cluster - kubernetes was creating pods which were 
> immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.
> The scheduler panicked with the log attached and was not scheduling any other 
> pods.
> The config:
> {code:yaml}
> apiVersion: v1
> data:
>   admissionController.filtering.bypassNamespaces: 
> ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$
>   queues.yaml: |
> partitions:
>   - name: default
> placementrules:
>   - name: fixed
> value: root.scavenging.osg
> create: true
> filter:
>   type: allow
>   users:
>   - system:serviceaccount:osg-ligo:prp-htcondor-provisioner
>   - 
> system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner
>   - system:serviceaccount:osg-icecube:prp-htcondor-provisioner
>   - name: tag
> value: namespace
> create: true
> parent:
>name: tag
>value: namespace.parentqueue
>   - name: tag
> value: namespace
> create: true
> parent:
>name: fixed
>value: general
> nodesortpolicy:
>   type: fair
>   resourceweights:
> vcore: 1.0
> memory: 1.0
> nvidia.com/gpu: 4.0
> queues:
>   - name: root
> submitacl: '*'
> properties:
>   application.sort.policy: fair
> queues:
> - name: system
>   parent: true
>   properties:
> preemption.policy: disabled
> - name: general
>   parent: true
>   childtemplate:
> properties:
>   application.sort.policy: fair
> resources:
>   guaranteed:
> vcore: 100
> memory: 1Ti
> nvidia.com/gpu: 8
>   max:
> vcore: 4000
> memory: 15Ti
> nvidia.com/gpu: 200
> - name: scavenging
>   parent: true
>   childtemplate:
> resources:
>   guaranteed:
> vcore: 1
> memory: 1G
> nvidia.com/gpu: 1
> properties:
>   priority.offset: "-10"
> 

[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-27 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849878#comment-17849878
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2629:
-

I have walked through the draft PR and looked. As a side effect it fixes 
recoding an event on a node that has been rejected by the core from being 
added. That is a good change to have.

What I do not understand yet is why we need to have a lock on the context. 
Registering a node does not make any changes in the context. The only two 
things that make changes to the context are the applications and config map 
changes. It feels like the context lock is used to synchronise changes in the 
schedulerCache: i.e. make sure that subsequent calls from the context into the 
cache see the same scheduler cache. If that really is the reason we should make 
sure that it is handled in the cache.

Example: in the context the ForgetPod method calls GetPod on the cache and then 
ForgetPod on the cache. That should be one single call to ForgetPod in the 
cache removing the lookup, and the need for the context lock.

The fact that we now unlock the context while waiting for a response to come 
back makes me wonder if we need the context lock at all during that call stack.

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2640) Conside removing config from Clients

2024-05-27 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849857#comment-17849857
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2640:
-

After the changes from YUNIKORN-2630 there is only one place left and we should 
really clean that up. Setting target for 1.6.0

> Conside removing config from Clients
> 
>
> Key: YUNIKORN-2640
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2640
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Chia-Ping Tsai
>Assignee: Chenchen Lai
>Priority: Minor
>
> The config (`conf.SchedulerConf`) [0] references to a global singleton object 
> [1][2]. Also, in the code base `clients#GetConf()` is used 3 times [3] and 
> `conf.GetSchedulerConf()` is used 61 times [4]
> It seems to me `clients#conf` should be removed to avoid confusion.
> [0] 
> https://github.com/apache/yunikorn-k8shim/blob/master/pkg/client/clients.go#L42C8-L42C26
> [1] 
> https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/plugin/scheduler_plugin.go#L291
> [2] 
> https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/cmd/shim/main.go#L53
> [3] 
> https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+GetConf%28%29=code
> [4] 
> https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+conf.GetSchedulerConf%28%29=code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2640) Conside removing config from Clients

2024-05-27 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2640:

Target Version: 1.6.0

> Conside removing config from Clients
> 
>
> Key: YUNIKORN-2640
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2640
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Chia-Ping Tsai
>Assignee: Chenchen Lai
>Priority: Minor
>
> The config (`conf.SchedulerConf`) [0] references to a global singleton object 
> [1][2]. Also, in the code base `clients#GetConf()` is used 3 times [3] and 
> `conf.GetSchedulerConf()` is used 61 times [4]
> It seems to me `clients#conf` should be removed to avoid confusion.
> [0] 
> https://github.com/apache/yunikorn-k8shim/blob/master/pkg/client/clients.go#L42C8-L42C26
> [1] 
> https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/plugin/scheduler_plugin.go#L291
> [2] 
> https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/cmd/shim/main.go#L53
> [3] 
> https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+GetConf%28%29=code
> [4] 
> https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+conf.GetSchedulerConf%28%29=code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848863#comment-17848863
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2637:
-

The case we are solving here is the correct removal of a pod that was 
registered and then stopped. In this case if the pod was assigned to a node it 
gets removed, this includes from the core also. In the case that it was not 
assigned to a node the request gets removed. I think both core and k8shim are 
affected by this after looking at the details in YUNIKORN-2526.

So I am not sure if that is the root cause of the difference...

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2631) Support canonical labels for queue/applicationId in Admission Controller

2024-05-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2631:

Labels: pull-request-available release-notes  (was: pull-request-available)

> Support canonical labels for queue/applicationId in Admission Controller
> 
>
> Key: YUNIKORN-2631
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2631
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Yu-Lin Chen
>Assignee: Yu-Lin Chen
>Priority: Major
>  Labels: pull-request-available, release-notes
>
> Admission controller adds applicationID and label to Pod if they are not 
> already set in the Pod.
> According to the new policy defined in YUNIKORN-1351.
> Admission Controller will change to patch canonical label/annotation in the 
> future releases.
>  * yunikorn.apache.org/app-id (Canonical Label)
>  * yunikorn.apache.org/queue  (Canonical Label)
> To avoid an upgrade problem where the admission controller gets started 
> first, AM needs to generate both canonical/non-canonical labels in 1.6.0. 
> (This ensures that the 1.5.0 scheduler could understand labels generated in 
> the 1.6.0 admission controller)  In 1.7.0, we can switch to generating only 
> the canonical label in AM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848809#comment-17848809
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2629:
-

Saw this in a test run locally, adding the deadlock trace that was printed for 
reference.

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-22 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2629:

Attachment: updateNode_deadlock_trace.txt

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-05-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848807#comment-17848807
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2521:
-

Collect the data and open a new Jira please. This Jira has been included in a 
release and will not be re-opened or worked on.

The logs for the scheduler should show the details around the possible deadlock.

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Peter Bacsko
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848435#comment-17848435
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2637:
-

The list of pods we iterate over in {{finalizePods()}} are the *original* pods 
that were processed and returned by the {{registerPods()}} call.

In {{registerPods()}} we skip pods in a terminal state. Those pods do not get 
added to the context. {{AddPod()}} is never called for them.

In {{finalizePods()}} we list the pods again and build a map. We then iterate 
over all pods returned from {{registerPods()}} and see if they are in the in 
new list we just pulled from K8s. If we have a pod from the registered list 
that does not show up in the newly pulled list we remove the pod from the 
context.

That last step, removing from the context only makes sense if the pod was added 
to the context to start with. Pods that were in a terminal state during the 
{{registerPods()}} processing are not added and thus do not need to be removed 
as they cannot exist in the context. It does not matter what state they are in 
in the newly pulled list. A pod in a terminal state cannot return to a running 
state ever.
{quote}I don't think this is safe. The pod may have moved into a terminal state 
between registerPods() and finalizePods(). In that case, we may lose the 
transition and end up with a phantom pod still in the system.
{quote}
This is not the case that needs to be optimised as this does not happen at all 
with the current code. That is a bug in the code by itself I did not even 
notice before.

The newly pulled list of pods is not status checked. A pod that was running, 
during {{{}registerPods(){}}}, and now shows as terminated, in 
{{{}finalizePods(){}}}, shows up in the both the map as well as the iteration 
and will thus not be removed. Just the existence check in the register list 
compared to the finalise list is not enough.

For that case to work we either need:
 # filtering of the pods that we put in the finalised map (i.e. skip terminated 
pods)
or
 # a comparison of the state of the pods during the iteration

Option 1 is simplest as it just gives us a map of still running pods and 
anything that does not exist should be removed.

I'll put up a PR that fixes both issues.

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Priority: Major
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848056#comment-17848056
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2637:
-

Comments in {{finalizePods()}} should be fixed at the same time as it points to 
nodes currently which is incorrect.

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Priority: Major
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2638) Simplify finalizeNodes and finalizePods

2024-05-21 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2638:
---

 Summary: Simplify finalizeNodes and finalizePods
 Key: YUNIKORN-2638
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2638
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


In finalizeNodes and finalizePods a map is created to store the newly retrieved 
pods and nodes. The map is only used as a reference and the pod and node 
objects themselves are not used.

Instead of storing the objects the maps could use a boolean value to store. 
This also simplifies the check later for the existence of the node or pod to 
just a single map lookup. 

We should also set the size of the map, length of the nodes or pod list 
retrieved, to prevent any re-allocation during the map filling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-20 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2637:

Target Version: 1.6.0

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Priority: Major
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-20 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2637:
---

 Summary: finalizePods should ignore pods like registerPods does
 Key: YUNIKORN-2637
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


The initialisation code is a two step process for pods: first list all pods and 
add them to the system in registerPods(). This returns a list of pods processed.

The second step happens after event handlers are turned on and nodes have been 
cleaned up etc. During the second step pods from the first step are checked and 
removed. However pods that were already in a terminated state in step 1 get 
removed again. Although the step should be idempotent this is unneeded. When 
iterating over the existing pods any pod in a terminal state should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2630) Release context lock in shim when processing config in the core

2024-05-16 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2630:

Target Version: 1.6.0, 1.5.2

> Release context lock in shim when processing config in the core
> ---
>
> Key: YUNIKORN-2630
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2630
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
>  Labels: pull-request-available
>
> When an change comes in for a the configmaps we process the change under a 
> context lock as we need to merge the two configmaps.
> We keep this lock even if all the work is done in the shim and processing has 
> been transferred to the core. This is unneeded as the core has its own 
> locking an serialisation of the changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2630) Release context lock in shim when processing config in the core

2024-05-16 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2630:
---

 Summary: Release context lock in shim when processing config in 
the core
 Key: YUNIKORN-2630
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2630
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When an change comes in for a the configmaps we process the change under a 
context lock as we need to merge the two configmaps.

We keep this lock even if all the work is done in the shim and processing has 
been transferred to the core. This is unneeded as the core has its own locking 
an serialisation of the changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2628) fix release announcement links

2024-05-16 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2628.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

links are fixed after removing the {{..}} from the path

> fix release announcement links
> --
>
> Key: YUNIKORN-2628
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2628
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> In YUNIKORN-2595 a regression snuck in breaking the links to the release 
> announcements.
> Need to reverse that path change for the release announcements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-16 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847123#comment-17847123
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2629:
-

I think we need to look at the context lock in the k8shim in general.

The context lock is held while we do none context work. There is no need to 
hold the lock if all we do is waiting for a response that might trigger post 
processing or not.

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2627) Add K8s 1.30 to the e2e matrix

2024-05-16 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2627.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

Upgrdaed kind to version 0.23 and added 1.30 as a new version to test with

> Add K8s 1.30 to the e2e matrix
> --
>
> Key: YUNIKORN-2627
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2627
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Wilfred Spiegelenburg
>Assignee: Tseng Hsi-Huang
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 1.6.0
>
>
> k8s 1.30 support in kind is now available as part of the [0.23 
> release|https://github.com/kubernetes-sigs/kind/releases/tag/v0.23.0]
> Need to add 1.30 to the matrix for the next release



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2626) Add flag to helm chart to disable web container

2024-05-16 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846827#comment-17846827
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2626:
-

I have no strong feelings either way. The default should be the web container 
on but that is it.

Create a PR to make it possible: charts are 
[here|https://github.com/wilfred-s/yunikorn-release/tree/master/helm-charts/yunikorn]

> Add flag to helm chart to disable web container
> ---
>
> Key: YUNIKORN-2626
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2626
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: deployment
>Reporter: Michael
>Priority: Major
>
> For our use case we only really need the admission controller and scheduler. 
> The helm chart does currently not provide a way to disable deploying the web 
> container and it would be great if that is possible.
> Is there any reason not to disable the web container?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2628) fix release announcement links

2024-05-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2628:

Description: 
In YUNIKORN-2595 a regression snuck in breaking the links to the release 
announcements.

Need to reverse that path change for the release announcements.

  was:
In YUNIKORN-2596 a regression snuck in breaking the links to the release 
announcements.

Need to reverse that path change for the release announcements.


> fix release announcement links
> --
>
> Key: YUNIKORN-2628
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2628
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
>
> In YUNIKORN-2595 a regression snuck in breaking the links to the release 
> announcements.
> Need to reverse that path change for the release announcements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2628) fix release announcement links

2024-05-14 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2628:
---

 Summary: fix release announcement links
 Key: YUNIKORN-2628
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2628
 Project: Apache YuniKorn
  Issue Type: Task
  Components: website
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


In YUNIKORN-2596 a regression snuck in breaking the links to the release 
announcements.

Need to reverse that path change for the release announcements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2627) Add K8s 1.30 to the e2e matrix

2024-05-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846470#comment-17846470
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2627:
-

We should also update from kind 0.20 to kind 0.23 as part of this change.

https://github.com/apache/yunikorn-k8shim/blob/master/Makefile#L157-L159

> Add K8s 1.30 to the e2e matrix
> --
>
> Key: YUNIKORN-2627
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2627
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
>
> k8s 1.30 support in kind is now available as part of the [0.23 
> release|https://github.com/kubernetes-sigs/kind/releases/tag/v0.23.0]
> Need to add 1.30 to the matrix for the next release



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2627) Add K8s 1.30 to the e2e matrix

2024-05-14 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2627:
---

 Summary: Add K8s 1.30 to the e2e matrix
 Key: YUNIKORN-2627
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2627
 Project: Apache YuniKorn
  Issue Type: Improvement
Reporter: Wilfred Spiegelenburg


k8s 1.30 support in kind is now available as part of the [0.23 
release|https://github.com/kubernetes-sigs/kind/releases/tag/v0.23.0]

Need to add 1.30 to the matrix for the next release



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2609) Improve visual style of the Web UI

2024-05-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846204#comment-17846204
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2609:
-

Also the "Logs" link on the application page I don't think we have that. Or 
does that point to the allocation logs? In that case we might want to come up 
with a nice pictogram for that link instead of the text.

> Improve visual style of the Web UI
> --
>
> Key: YUNIKORN-2609
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2609
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: webapp
>Reporter: Denis Coric
>Priority: Major
>  Labels: newbie
>
> Implement required CSS changes to tweak the overall look and feel of the web 
> UI.
> The full design can be previewed on this link: [ 
> [DESIGN|https://xd.adobe.com/view/1d84899f-72a8-472f-b03f-de40451b0956-48d7/] 
> ]
> This should include:
>  * Fix padding/margin values
>  * Add rounding on elements to match the design (menu selection, dropdowns, 
> etc)
>  * Fix font weight on visual elements to match the design
> _Note: Queues page can be skipped as it is being redesigned in YUNIKORN-2341_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2609) Improve visual style of the Web UI

2024-05-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846199#comment-17846199
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2609:
-

The design looks OK to me.

I do have a question around the resources: it was recently expanded to show 
more than just memory and CPU. How does that change affect the design that is 
shown in the link? Do areas expand collapse correctly when the list of 
resources, specially for nodes but any object is affected, become larger.  Most 
nodes will show 7+ resource types as allocatable and used etc.

Some detail is in https://github.com/apache/yunikorn-web/pull/146

> Improve visual style of the Web UI
> --
>
> Key: YUNIKORN-2609
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2609
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: webapp
>Reporter: Denis Coric
>Priority: Major
>  Labels: newbie
>
> Implement required CSS changes to tweak the overall look and feel of the web 
> UI.
> The full design can be previewed on this link: [ 
> [DESIGN|https://xd.adobe.com/view/1d84899f-72a8-472f-b03f-de40451b0956-48d7/] 
> ]
> This should include:
>  * Fix padding/margin values
>  * Add rounding on elements to match the design (menu selection, dropdowns, 
> etc)
>  * Fix font weight on visual elements to match the design
> _Note: Queues page can be skipped as it is being redesigned in YUNIKORN-2341_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2531) Create unit tests for AsyncRMCallback

2024-05-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2531.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

new tests added to the system to improve coverage

> Create unit tests for AsyncRMCallback
> -
>
> Key: YUNIKORN-2531
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2531
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> There are no unit tests for the {{AsyncRMCallback}} type in the shim 
> (scheduler_callback.go). It's tested indirectly but we have no idea about the 
> coverage or how it behaves in rare scenarios.
> At least longer methods such as {{UpdateApplication()}}, 
> {{UpdateAllocation()}} and {{UpdateNode()}} should be covered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2615) Remove named returns from predicate_manager.go

2024-05-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2615.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

refactor committed to master for 1.6.0

> Remove named returns from predicate_manager.go
> --
>
> Key: YUNIKORN-2615
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2615
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Predicate manager has defined named returns on some functions but does not 
> use them. They should be removed as the way they are used can cause issues 
> that are hard to debug.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2618) Streamline AsyncRMCallback UpdateAllocation

2024-05-09 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2618:
---

 Summary: Streamline AsyncRMCallback UpdateAllocation
 Key: YUNIKORN-2618
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2618
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


if task is not found, a nil is returned from {{context.getTask}} in  for 
{{response.New}} processing we should just log that fact and proceed to the 
next alloc. Simplifies the flow as we never need to check for a. nil task. We 
should never have a pod in the cache that does not exist as a task on an 
application.

We retrieve the application using the application ID from the response to never 
use the object. We only use the application ID to pass into an event. The 
context event handler then does the exact same lookup again to process the 
event on the app.

We need to become much smarter in this area, double or triple lookups, generate 
async events that just change the state of the app or task or kick off another 
event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2616) Remove unused bool return from PreemptionPredicates()

2024-05-08 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2616:
---

 Summary: Remove unused bool return from PreemptionPredicates()
 Key: YUNIKORN-2616
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2616
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


The predicate manager method {{PreemptionPredicates()}} returns two values an 
int and boolean. The boolean is false if the integer is -1 and true for 0 or 
llarger. There is no need for the boolean as the -1 already indicates the same



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2615) Remove named returns from predicate_manager.go

2024-05-08 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2615:
---

 Summary: Remove named returns from predicate_manager.go
 Key: YUNIKORN-2615
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2615
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Predicate manager has defined named returns on some functions but does not use 
them. They should be removed as the way they are used can cause issues that are 
hard to debug.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2601) Update kindest/node: v1.29.1 to v1.29.2, v1.28.6 to v1.28.7, v1.27.10 to v1.27.11, v1.26.13 -> v1.26.14

2024-05-08 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2601.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

Changes committed.

No Kind for 1.30 available yet we should log a new Jira to add it later.

> Update kindest/node:  v1.29.1 to v1.29.2, v1.28.6 to v1.28.7, v1.27.10 to 
> v1.27.11, v1.26.13 -> v1.26.14
> 
>
> Key: YUNIKORN-2601
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2601
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: test - e2e
>Reporter: Chia-Ping Tsai
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> as title



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2591) Document placement rules always

2024-05-06 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2591.
-
Fix Version/s: 1.5.1
   1.5.0
   1.4.0
   Resolution: Fixed

Change made to the docs going back to 1.4.0, 1.5.0.

Will be part of the 1.5.1. release also

> Document placement rules always
> ---
>
> Key: YUNIKORN-2591
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2591
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Wilfred Spiegelenburg
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.5.1, 1.5.0, 1.4.0
>
>
> The current [doc 
> says|https://yunikorn.apache.org/docs/user_guide/queue_config#placement-rules]:
> {quote}If no rules are defined the placement manager is not started and each 
> application _must_ have a queue set on submit.
> {quote}
> This is not correct, we moved to placement rules always in YUNIKORN-1793 in 
> YuniKorn 1.4 The documentation needs to be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2596) Enhance layout for release announcements

2024-05-06 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2596.
-
Fix Version/s: 1.5.1
   Resolution: Fixed

Fixed and published changes applied to 1.5.0 layout, before the 1.5.1 release.

marking as fixed in 1.5.1

> Enhance layout for release announcements
> 
>
> Key: YUNIKORN-2596
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2596
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.5.1
>
> Attachments: release_announce.png, releasee_announce_updated.png
>
>
> The current release announcements page lacks a decent layout. The page is 
> generated during the build based on the directory content.
> Some simple updates would make the page more readable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2595) Fix download page links

2024-05-06 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2595.
-
Fix Version/s: 1.5.1
   Resolution: Fixed

download page fixed for 1.5.0, deployed before the 1.5.1 release

Marking as fixed in 1.5.1

> Fix download page links
> ---
>
> Key: YUNIKORN-2595
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2595
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.5.1
>
>
> The download links must follow a specific set of rule as specified 
> [here|https://infra.apache.org/release-download-pages.html].
> We currently do not set the correct download link for the source package. We 
> dropped the closer.lua resolution for the content network in one of the 
> releases. With the next release, 1.5.1, coming up we need to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2595) Fix download page links

2024-04-29 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2595:

Priority: Minor  (was: Major)

> Fix download page links
> ---
>
> Key: YUNIKORN-2595
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2595
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>
> The download links must follow a specific set of rule as specified 
> [here|https://infra.apache.org/release-download-pages.html].
> We currently do not set the correct download link for the source package. We 
> dropped the closer.lua resolution for the content network in one of the 
> releases. With the next release, 1.5.1, coming up we need to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2595) Fix download page links

2024-04-29 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2595:
---

 Summary: Fix download page links
 Key: YUNIKORN-2595
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2595
 Project: Apache YuniKorn
  Issue Type: Task
  Components: website
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The download links must follow a specific set of rule as specified 
[here|https://infra.apache.org/release-download-pages.html].

We currently do not set the correct download link for the source package. We 
dropped the closer.lua resolution for the content network in one of the 
releases. With the next release, 1.5.1, coming up we need to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2593) Simplify partition name

2024-04-29 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841875#comment-17841875
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2593:
-

We need to be careful here. This now forces a unique partition to be specified 
by all shims that register. If that is not the case we break. One shim would 
overwrite the partition setup of a second shim. The first part of the "full" 
partition name is the hostname and port of the shim that registers the 
partition allowing remote shims to be identified.

If you are going to do this we might as well drop the whole multi & remote shim 
and multi partition design. Which would mean moving to one repository removing 
the SI etc along the way. I don't think that is a good idea. 

What I do not understand is why do we have partition anywhere in the scheduler 
objects? With objects I refer to anything like application, ask or node etc. 
Those cannot belong to anything but one partition and are only referenced from 
that one partition. They should not have the partition details as part of the 
object. It is redundant information taking up memory.

A simple remove of the partition name from all these objects should suffice.

BTW: The webservice broke the whole remote and multi shim idea when it was 
setup and we never got around to fixing that. We do not want to break it 
further.

> Simplify partition name
> ---
>
> Key: YUNIKORN-2593
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2593
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
>
> Currently, partition names are treated differently in different places within 
> the core. Specifically, sometimes they are bare (i.e. "default") and other 
> places they are composite (i.e. "[rm:123]default"). This is confusing and 
> unnecessary. It also hampers efforts to merge the AllocationAsk and 
> Allocation objects, as the semantics are different between them. Switch to 
> using bare form ("default") everywhere instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2591) Document placement rules always

2024-04-25 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2591:
---

 Summary: Document placement rules always
 Key: YUNIKORN-2591
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2591
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: documentation
Reporter: Wilfred Spiegelenburg


The current [doc 
says|https://yunikorn.apache.org/docs/user_guide/queue_config#placement-rules]:
{quote}If no rules are defined the placement manager is not started and each 
application _must_ have a queue set on submit.
{quote}
This is not correct, we moved to placement rules always in YUNIKORN-1793 in 
YuniKorn 1.4 The documentation needs to be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2590) Handler tests should check for nil request on create

2024-04-25 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2590:
---

 Summary: Handler tests should check for nil request on create
 Key: YUNIKORN-2590
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2590
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common, test - unit
Reporter: Wilfred Spiegelenburg


In the handler_test.go file we have an anti pattern showing a large number 
(40+) warnings in an IDE:
{quote}'req' might have 'nil' or other unexpected value as its corresponding 
error variable might be not 'nil'
{quote}
The warning are due to the fact that we have the following pattern:
{code:java}
req, err = http.NewRequest("GET", "path", strings.NewReader(""))
req = req.WithContext(context.WithValue(req.Context(), httprouter.ParamsKey, 
httprouter.Params{})){code}
There is no error assertion after the request creation. We should add a simple 
{{assert.NilError(t, err, "HTTP request create failed")}} inserted between 
creating and using the request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds

2024-04-24 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840296#comment-17840296
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2580:
-

Work towards one object for allocations and asks is in progress. YUNIKORN-2457 
is open and actively worked on, which means the whole ask object is going 
through major changes soon. At that point things that are no longer needed or 
were never used will disappear automatically.

Doing this one field at a time causes extra churn and makes it more difficult 
to track the how and why.

> Remove executionTimeoutMilliSeconds
> ---
>
> Key: YUNIKORN-2580
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2580
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: scheduler-interface
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211]
> It is not used actually, and hence we should either remove it or add facility 
> for it. Personally, I'd like to remove it to simplify the interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2581) Expose running placement rules in REST

2024-04-23 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2581:
---

 Summary: Expose running placement rules in REST
 Key: YUNIKORN-2581
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2581
 Project: Apache YuniKorn
  Issue Type: New Feature
  Components: core - common
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Since introducing the use of placement rules always and the recovery rule the 
queue config does not correctly show the running rules.

Also if a config update has been rejected, for any reason, the rules would not 
be correct

Exposing the configured rules from the placement manager works around all these 
issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2575) Make logging for IsPodFitNode clear

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2575.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

unique errors are returned for all failure cases which at DEBUG level will show 
exactly why the failure occurred.

> Make logging for IsPodFitNode clear
> ---
>
> Key: YUNIKORN-2575
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2575
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The logging in {{IsPodFitNode()}} logs the same message for a missing pod and 
> node. We should log clearly which thing is missing: the node or the pod.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840250#comment-17840250
 ] 

Wilfred Spiegelenburg edited comment on YUNIKORN-2580 at 4/24/24 12:05 AM:
---

This is used for the placeholder timeout and cannot be removed.

See handleSubmitApplicationEvent 
[here|https://github.com/apache/yunikorn-k8shim/blob/741c0d801ac4530669b8850706efe3f0bc0d5718/pkg/cache/application.go#L437]


was (Author: wifreds):
This is used for the placeholder timeout and cannot be removed.

> Remove executionTimeoutMilliSeconds
> ---
>
> Key: YUNIKORN-2580
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2580
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: scheduler-interface
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211]
> It is not used actually, and hence we should either remove it or add facility 
> for it. Personally, I'd like to remove it to simplify the interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2580.
-
Resolution: Won't Fix

This is used for the placeholder timeout and cannot be removed.

> Remove executionTimeoutMilliSeconds
> ---
>
> Key: YUNIKORN-2580
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2580
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: scheduler-interface
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211]
> It is not used actually, and hence we should either remove it or add facility 
> for it. Personally, I'd like to remove it to simplify the interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2577) Remove named returns from IsPodFitNodeViaPreemption

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840149#comment-17840149
 ] 

Wilfred Spiegelenburg edited comment on YUNIKORN-2577 at 4/23/24 5:13 PM:
--

BTW: not sure why {{GetPodNoLock}} returns two values. The pod is nil if the 
boolean is false, pod is not nil if boolean is true The signature can be 
simplified to just returning the pod. Should probably be a new jira.

edit: lpgged YUNIKORN-2578 for the refactor


was (Author: wifreds):
BTW: not sure why {{GetPodNoLock}} returns two values. The pod is nil if the 
boolean is false, pod is not nil if boolean is true The signature can be 
simplified to just returning the pod. Should probably be a new jira.

> Remove named returns from IsPodFitNodeViaPreemption
> ---
>
> Key: YUNIKORN-2577
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2577
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Minor
>  Labels: newbie
>
> IsPodFitNodeViaPreemption has defined named returns but does not use them. 
> They should be removed as the way they are used can cause issues that are 
> hard to debug.
> As part of this change we need to further cleanup:
> * The variable {{ok}} also gets shadowed multiple times, not just from the 
> named return declaration.
> * The if construct around {{GetPodNoLock()}} is not needed as it returns a 
> nil for the pod if it returns false. Just adding the result for the pod 
> always has the same effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2578) Refactor SchedulerCache.GetPod() remove bool return

2024-04-23 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2578:
---

 Summary: Refactor SchedulerCache.GetPod() remove bool return
 Key: YUNIKORN-2578
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2578
 Project: Apache YuniKorn
  Issue Type: Task
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


SchedulerCache {{GetPod()}} and {{GetPodNoLock()}} retrun two values:
# *v1.Pod
# bool

The boolean value is redundant as it is false if the pod is not found and a nil 
is returned for the pod. The boolean is true if the pod has a value. Testing 
for a nil pod has the same result.

We do not cache a nil pod in the cache for a pod UID



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2577) Remove named returns from IsPodFitNodeViaPreemption

2024-04-23 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2577:
---

 Summary: Remove named returns from IsPodFitNodeViaPreemption
 Key: YUNIKORN-2577
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2577
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


IsPodFitNodeViaPreemption has defined named returns but does not use them. They 
should be removed as the way they are used can cause issues that are hard to 
debug.

As part of this change we need to further cleanup:
* The variable {{ok}} also gets shadowed multiple times, not just from the 
named return declaration.
* The if construct around {{GetPodNoLock()}} is not needed as it returns a nil 
for the pod if it returns false. Just adding the result for the pod always has 
the same effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2576) Data Race: Flaky tests in dispatcher_test.go

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839986#comment-17839986
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2576:
-

The panic that triggered the race in the test you logged shows that we have a 
bigger problem than just a race condition in this test at the moment.

> Data Race: Flaky tests in dispatcher_test.go
> 
>
> Key: YUNIKORN-2576
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2576
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: test - unit
>Reporter: Yu-Lin Chen
>Priority: Major
> Attachments: shim-race.txt
>
>
> How to reproduce:
>  # In Shim, run 'go test ./pkg/... -race -count=10  > shim-race.txt' 
>  
> {code:java}
> WARNING: DATA RACE
> Write at 0x035315e0 by goroutine 88:
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.initDispatcher()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:73 
> +0x2c4
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.createDispatcher()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:305
>  +0x2f
>   runtime.Goexit()
>       /usr/local/go/src/runtime/panic.go:626 +0x5d
>   testing.(*T).FailNow()
>       :1 +0x31
>   gotest.tools/v3/assert.Equal()
>       
> /home/chenyulin0719/go/pkg/mod/gotest.tools/v3@v3.5.1/assert/assert.go:205 
> +0x1aa
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:244
>  +0x2ba
>   testing.tRunner()
>       /usr/local/go/src/testing/testing.go:1689 +0x21e
>   testing.(*T).Run.gowrap1()
>       /usr/local/go/src/testing/testing.go:1742 +0x44Previous read at 
> 0x035315e0 by goroutine 90:
>   
> github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.func1()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:188 
> +0x2f5
>   
> github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.gowrap1()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:197 
> +0x6eGoroutine 88 (running) created at:
>   testing.(*T).Run()
>       /usr/local/go/src/testing/testing.go:1742 +0x825
>   testing.runTests.func1()
>       /usr/local/go/src/testing/testing.go:2161 +0x85
>   testing.tRunner()
>       /usr/local/go/src/testing/testing.go:1689 +0x21e
>   testing.runTests()
>       /usr/local/go/src/testing/testing.go:2159 +0x8be
>   testing.(*M).Run()
>       /usr/local/go/src/testing/testing.go:2027 +0xf17
>   main.main()
>       _testmain.go:55 +0x2bdGoroutine 90 (running) created at:
>   
> github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:178 
> +0x391
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).dispatch()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:164 
> +0xbb
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.Dispatch()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:142 
> +0x71
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:232
>  +0x244
>   testing.tRunner()
>       /usr/local/go/src/testing/testing.go:1689 +0x21e
>   testing.(*T).Run.gowrap1()
>       /usr/local/go/src/testing/testing.go:1742 +0x44
> =={code}
> Root Cause:
>  * The [globla 
> vairables|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L46-L51]
>  in dispatcher.go is not protected when running unit tests. Each unit test 
> will run initDispatcher() through 
> [createDispatcher()|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatch_test.go#L305].
>  * Race occurs if any other unit tests read/write the global variables before 
> or after initDispatcher(). ex: TestDispatchTimeout()   
> [https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L188]
>  
> Solution to be discussed:
>  # Refactor dispatcher.go and encapsulates global variables to Dispatcher 
> struct
> , change Dispatcher.Start(), Dispatcher.Stop() to type method
>  # Implement Singleton in getDispatcher() and add a new function 
> newDispatcher()
>  # Create a new Dispatcher for each unit test
>  
> The race issue only happens in unit test becasue the shared vairable was 
> protected by 
> once.Do(initDispatcher) in dispatcher.go : 
> 

[jira] [Commented] (YUNIKORN-2576) Data Race: Flaky tests in dispatcher_test.go

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839985#comment-17839985
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2576:
-

When I run this only the first run passes and all other 9 runs fail with an 
assertion failure like this:
{code:java}
    dispatch_test.go:244: assertion failed: 10 (int32) != 1 (int32) {code}
This test was never designed to be run multiple times as those global var 
values are not reset to clean up. Other tests in the same file also break as 
they expect a 0 value for the async count when tey start. That again is only 
true for the first run not for the runs 2..10.

I do see a data race but the race is triggered by 
{{TestExceedAsyncDispatchLimit()}}

Further point is that we should not use the 
{{atomic.AddInt32(, 1)}} but we should use the 
{{atomic.Int32}} introduced in go 1.19 and call {{asyncDispatchCount.Add(1)}}

Not sure if this requires a full refactor of the dispatcher or that these tests 
need to be fixed to be able to handle multiple runs correctly.

> Data Race: Flaky tests in dispatcher_test.go
> 
>
> Key: YUNIKORN-2576
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2576
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: test - unit
>Reporter: Yu-Lin Chen
>Priority: Major
> Attachments: shim-race.txt
>
>
> How to reproduce:
>  # In Shim, run 'go test ./pkg/... -race -count=10  > shim-race.txt' 
>  
> {code:java}
> WARNING: DATA RACE
> Write at 0x035315e0 by goroutine 88:
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.initDispatcher()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:73 
> +0x2c4
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.createDispatcher()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:305
>  +0x2f
>   runtime.Goexit()
>       /usr/local/go/src/runtime/panic.go:626 +0x5d
>   testing.(*T).FailNow()
>       :1 +0x31
>   gotest.tools/v3/assert.Equal()
>       
> /home/chenyulin0719/go/pkg/mod/gotest.tools/v3@v3.5.1/assert/assert.go:205 
> +0x1aa
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:244
>  +0x2ba
>   testing.tRunner()
>       /usr/local/go/src/testing/testing.go:1689 +0x21e
>   testing.(*T).Run.gowrap1()
>       /usr/local/go/src/testing/testing.go:1742 +0x44Previous read at 
> 0x035315e0 by goroutine 90:
>   
> github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.func1()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:188 
> +0x2f5
>   
> github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.gowrap1()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:197 
> +0x6eGoroutine 88 (running) created at:
>   testing.(*T).Run()
>       /usr/local/go/src/testing/testing.go:1742 +0x825
>   testing.runTests.func1()
>       /usr/local/go/src/testing/testing.go:2161 +0x85
>   testing.tRunner()
>       /usr/local/go/src/testing/testing.go:1689 +0x21e
>   testing.runTests()
>       /usr/local/go/src/testing/testing.go:2159 +0x8be
>   testing.(*M).Run()
>       /usr/local/go/src/testing/testing.go:2027 +0xf17
>   main.main()
>       _testmain.go:55 +0x2bdGoroutine 90 (running) created at:
>   
> github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:178 
> +0x391
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).dispatch()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:164 
> +0xbb
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.Dispatch()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:142 
> +0x71
>   github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout()
>       
> /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:232
>  +0x244
>   testing.tRunner()
>       /usr/local/go/src/testing/testing.go:1689 +0x21e
>   testing.(*T).Run.gowrap1()
>       /usr/local/go/src/testing/testing.go:1742 +0x44
> =={code}
> Root Cause:
>  * The [globla 
> vairables|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L46-L51]
>  in dispatcher.go is not protected when running unit tests. Each unit test 
> will run initDispatcher() through 
> [createDispatcher()|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatch_test.go#L305].
>  * Race occurs if any other unit tests 

[jira] [Created] (YUNIKORN-2575) Make logging for IsPodFitNode clear

2024-04-22 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2575:
---

 Summary: Make logging for IsPodFitNode clear
 Key: YUNIKORN-2575
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2575
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The logging in {{IsPodFitNode()}} logs the same message for a missing pod and 
node. We should log clearly which thing is missing: the node or the pod.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2573) Unit test occasionally failed due to dead lock

2024-04-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839453#comment-17839453
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2573:
-

{quote}Since there always an error or warning from scheduler health check when 
running multiple tests at the same time,
{quote}
Health checks are collecting details when we run other things. There is no 
"stop the world" locking happening which means that while the health checks run 
things can change. This could sometimes lead to a comparison. of data from 
before a change to after a change showing a health issue.

Unless the tests hang and not finish there is no dead lock case.

> Unit test occasionally failed due to dead lock
> --
>
> Key: YUNIKORN-2573
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2573
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Arthur Wang
>Assignee: Arthur Wang
>Priority: Minor
>
> [github 
> pipeline|https://github.com/apache/yunikorn-core/actions/runs/8770718393/job/24067600801]
> Unit test occasionally failed due to dead lock
> Still working on finding root cause.
> Since there always an error or warning from scheduler health check when 
> running multiple tests at the same time,
> maybe it's some test setting issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2571) Add hierarchy icon to queue node

2024-04-21 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2571:

 Fix Version/s: (was: 1.6.0)
Target Version: 1.6.0

Please use target version when setting a release for which the fix is planned. 
The fix version is the release in which the changes are committed and included 
in a release and set on closure of the Jira after the changes are commited.

Even open jiras show up as part of the release notes for that release. This can 
lead to incorrect info in a release.

> Add hierarchy icon to queue node
> 
>
> Key: YUNIKORN-2571
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2571
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Dong-Lin Hsieh
>Assignee: Dong-Lin Hsieh
>Priority: Major
>  Labels: pull-request-available
>
> make queue node looks better !



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2570) Add test cases to break the current preemption flow

2024-04-21 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2570:

 Fix Version/s: (was: 1.6.0)
Target Version: 1.6.0

Please use target version when setting a release for which the fix is planned. 
The fix version is the release in which the changes are committed and included 
in a release and set on closure of the Jira after the changes are commited.

Even open jiras show up as part of the release notes for that release. This can 
lead to incorrect info in a release.

> Add test cases to break the current preemption flow
> ---
>
> Key: YUNIKORN-2570
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2570
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>  Labels: pull-request-available
>
> Add various test cases to break the current preemption flow. These test would 
> fail now. Follow up jira's 
> [https://issues.apache.org/jira/browse/YUNIKORN-2500] should fix the problems 
> in current preemption flow so that these test cases should pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2526) Discrepancy between shim cache and core app/task list after scheduler restart

2024-04-17 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2526:

Target Version: 1.5.1

> Discrepancy between shim cache and core app/task list after scheduler restart
> -
>
> Key: YUNIKORN-2526
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2526
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Shravan Achar
>Priority: Major
> Attachments: log-snippet.txt, state-dump-4-1-3.json, 
> state-dump-4-17.json.zip
>
>
> When scheduler restarts, occasionally it gets into a situation where the 
> application is still in Running state despite the application getting 
> terminated in the cluster. This is confirmed with the attached state dump.
>  
> The scheduler core logs indicate all nodes are being evaluated for 
> non-existing application (also attached). The CPU is being used up doing this 
> unneeded evaluation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-17 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2562:

Target Version: 1.5.1

> Nil pointer in Application.ReplaceAllocation()
> --
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2556) Remove getResourceUsageDAOInfo from test code

2024-04-12 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2556:
---

 Summary: Remove getResourceUsageDAOInfo from test code
 Key: YUNIKORN-2556
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2556
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common
Reporter: Wilfred Spiegelenburg


Remove the {{getResourceUsageDAOInfo()}} call from the test code. If we need to 
retrieve the usage for the whole queueTracker hierarchy we should add that in 
the test code separately instead of using the DAO and convert that back

The DAO object should also not contain the pointer to the resource object. It 
should contain the DAOMap for the resource object similar to all other DAO 
definitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2555) Cleanup placement rules in partition

2024-04-12 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2555:
---

 Summary: Cleanup placement rules in partition
 Key: YUNIKORN-2555
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2555
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler
Reporter: Wilfred Spiegelenburg


The placement rule config is tracked in the partition in the object 
{{partition.rules}} 

This object contains the config with which the placement manager is initialised 
. This was used/needed before the move to always use placement rules.. Since 
the change to always use placement rules it no longer has a function. The 
config is now also out of sync with the rules used in the placement manager.

There is no need to keep this object in the partition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2540) clean up constants in pkg/cache/context_test.go

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2540:
---

 Summary: clean up constants in pkg/cache/context_test.go
 Key: YUNIKORN-2540
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2540
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


Constants are duplicated in the {{pkg/cache/context_test.go}}

example {{fakeNodeName}} is defined multiple times in the files. We should move 
to a central point of defining the constants for the test at the top of the 
file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0

2024-04-04 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834115#comment-17834115
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2534:
-

Documented in this section: 
[https://yunikorn.apache.org/docs/user_guide/queue_config#queues]

To provide full detail and the reasoning behind it. The resource check is 
different. I can specify a quota like this:
{code:java}
vcores: 1000
memory: 1T
nvidia.com/gpu: 0{code}
That is a valid quota and we apply that. It is a different quota than this one:
{code:java}
vcores: 1000
memory: 1T{code}
In the first quota you are not allowed to use the resource {{nvidia.com/gpu}} 
in the second quota there is no limit on how many GPUs you can use.

What is not allowed in quotas is something that only specifies zeros:
{code:java}
vcores: 0{code}
or
{code:java}
vcores: 0
memory: 0
nvidia.com/gpu: 0{code}
This is the category that the max applications falls in.

> [Yunikorn] Quota enforcement checks are failing when we have max-application 
> set to 0
> -
>
> Key: YUNIKORN-2534
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2534
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Rajesh Kanhaiya Lal
>Priority: Major
> Attachments: yunikorn-configs-fresh.yaml
>
>
> The Max-application checks are not working when we are setting 
> max-application to 0 in the yunikorn-config file.
> The Config validation is also ignored in case of max-application is set to 0, 
> for example, the child max-application should be less or equal to the parent 
> queue is also not working when we have the max-application set to 0.
> Attached Yunikorn Config file
> User and Group tracking API also does not log max-application in the response.
>  
> {code:java}
> curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users'
> [
>     {
>         "userName": "nobody",
>         "groups": {
>             "ts333w3": "*",
>             "ts433": "*",
>             "ts544": "*",
>             "ts633": "*"
>         },
>         "queues": {
>             "queuePath": "root",
>             "resourceUsage": {
>                 "Resources": {
>                     "memory": 3,
>                     "pods": 3,
>                     "vcore": 300
>                 }
>             },
>             "runningApplications": [
>                 "ts333w3",
>                 "ts433",
>                 "ts544"
>             ],
>             "children": [
>                 {
>                     "queuePath": "root.default",
>                     "resourceUsage": {
>                         "Resources": {
>                             "memory": 3,
>                             "pods": 3,
>                             "vcore": 300
>                         }
>                     },
>                     "runningApplications": [
>                         "ts333w3",
>                         "ts433",
>                         "ts544"
>                     ]
>                 }
>             ]
>         }
>     }
> ] {code}
> Could You please take a look ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2520) PVC errors in AssumePod() are not handled properly

2024-04-04 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2520.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

Changes merged to master

Volume issues should be handled correctly now.

> PVC errors in AssumePod() are not handled properly
> --
>
> Key: YUNIKORN-2520
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2520
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> When there is an error caused by a volume operation in 
> {{Context.AssumePod()}}, the allocation on core side will not be removed.
> Although we check the result from {{UpdateAllocation}}, the error handling is 
> just logging:
> {noformat}
> if err := callback.UpdateAllocation(response); err != nil {
>   rmp.handleUpdateResponseError(rmID, err)
>   }
> ...
> func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) {
> log.Log(log.RMProxy).Error("failed to handle response",
>zap.String("rmID", rmID),
>zap.Error(err))
> }{noformat}
> I suggest moving volume-related code to {{{}Task.postTaskAllocated()}}. In 
> this case, the task will transition to "Failed" state and we'll have 
> allocationID available, so we can release both the ask and the allocation:
> {noformat}
> func (task *Task) releaseAllocation() {
>   ...
>   var releaseRequest *si.AllocationRequest
>   s := TaskStates()
>   switch task.GetTaskState() {
>   case s.New, s.Pending, s.Scheduling, s.Rejected:
>   releaseRequest = common.CreateReleaseAskRequestForTask(
>   task.applicationID, task.taskID, 
> task.application.partition)  <-- release ask + allocation if possible
>   default:
>   if task.allocationID == "" {
>   ... log error ...
>   return
>   }
>   releaseRequest = 
> common.CreateReleaseAllocationRequestForTask(
>   task.applicationID, task.taskID, 
> task.allocationID, task.application.partition, task.terminationType)
>   }
> ...{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2538) Shim cache context pre-allocate slice

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2538:
---

 Summary: Shim cache context pre-allocate slice
 Key: YUNIKORN-2538
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2538
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


When building the reason string from all volume failure reasons we should 
allocate a slice once based on the size of the reasons object we get returned.

See [review 
comment|https://github.com/apache/yunikorn-k8shim/pull/810#discussion_r1550882867]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2537) cleanup UpdateAllocation in callback

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2537:
---

 Summary: cleanup UpdateAllocation in callback
 Key: YUNIKORN-2537
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2537
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


UpdateAllocation needs a cleanup: {{getTask()}} already checks for the 
application. No need to retrieve the application when we process response.New. 
Sending an event should be linked to the existence of the task not of the 
application.

On top of that we have the appID already in the task so we do not need to get 
it from the app.

The same logic needs to be applied to the whole function, we already do it for 
the release.* handling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2533) Implement String() for TrackedResource

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2533:
---

 Summary: Implement String() for TrackedResource
 Key: YUNIKORN-2533
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2533
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common
Reporter: Wilfred Spiegelenburg


To fix the way TrackedResources are logged it should implement the String() 
function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2532) Resource usage report has an incompatible format change

2024-04-04 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833830#comment-17833830
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2532:
-

{quote}That change was done to make logging more efficient (using Any() is bad 
practice).
{quote}
Not just bad practice, it does an inspection to try and map it to a type it 
knows. If it does not find a type it knows it passes it to the normal 
formatting library which tries to do its best to create a string. It adds a lot 
of overhead.

The types in the logging code could change based on the release of the logging 
code. {{Any()}} is a last resort logger if you are not sure what type the 
object is because you pass interfaces around. That is not the case here. 
Logging now has a stable format for the message. The fact that you noticed a 
difference between {{Any()}} and the {{Stringer()}} already shows the 
formatting was a best guess...

> Resource usage report has an incompatible format change
> ---
>
> Key: YUNIKORN-2532
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2532
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Yongjun Zhang
>Priority: Major
>
> There is some recent change that caused the application resource usage report 
> to have a new format:
> Prior the change, the format was:
> {code:java}
> YK_APP_SUMMARY: {"appID": "adf53ee0-experiment-organicad-94520240-1-1", 
> "submissionTime": 1712169262131, "startTime": 1712169264134, "finishTime": 
> 1712173619983, "user": 
> "system:serviceaccount:spark-operator-02:spark-operator", "queue": 
> "root.queue-large", "state": "Completed", "rmID": "test-cluster", 
> "resourceUsage": 
> {"abc":{"memory":139178200478515200,"pods":1729129,"vcore":5183062000},"def":{"memory":113789789798400,"pods":1413,"vcore":4239000}},
>  "preemptedResource": {}}
>   {code}
> with the change, the new format is:
> {code:java}
>  2024-04-04T00:33:08.532Z INFOcore.scheduler.application.usage
> objects/application_summary.go:60   YK_APP_SUMMARY: {ApplicationID: 
> afa303d0-test-trino-sparksql--20240404-2-1, SubmissionTime: 1712190615461, 
> StartTime: 1712190617496, FinishTime: 1712190788532, User: 
> system:serviceaccount:spark-operator-01:spark-operator, Queue: 
> root.queue-large, State: Completed, RmID: test-cluster, ResourceUsage: 
> TrackedResource{UNKNOWN:pods=177,UNKNOWN:vcore=354000,UNKNOWN:memory=1431454089216},
>  PreemptedResource: TrackedResource{}, PlaceholderResource: 
> TrackedResource{}}{code}
> There are several incompatibilities:
> 1. the class name TrackedResource was not there before, now it is.
> 2. the instance type was outside the resource part before, not it's embedded
> 3. the instance type was reported correctly before the change, now it's 
> UNKNOWN
> #3 may be a different issue, but it's observed by us at the same time.
> I think what should change the format back to the original one, as this is an 
> incompatible change. What do you think [~wilfreds] , [~pbacsko] ,[~ccondit] ?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2527) Allow remove and re-add configured queue within cleanup time

2024-04-03 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2527.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

Queues can now be removed and added back again within a cleanup cycle

> Allow remove and re-add configured queue within cleanup time 
> -
>
> Key: YUNIKORN-2527
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2527
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> When we remove a queue from the config it is marked for cleanup. If we re-add 
> the same queue in the config again before the cleanup gets executed the queue 
> still gets removed.
> reproduction:
>  * edit config map remove a queue, save
>  * immediately edit configmap add the same queue back, save
>  * wait for the cleanup to happen, queue should still exist after the fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2519) Remove bypass ACL check from placement rules

2024-04-03 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2519.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

refactor committed to master for 1.6.0

> Remove bypass ACL check from placement rules
> 
>
> Key: YUNIKORN-2519
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2519
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Instead of returning a flag to not bypass the ACL check by all rules except 
> for the recovery rule special case the recovery rule to bypass checks.
> The recovery queue is created without ACLs, quota and is always a leaf queue. 
> The only rule that can return the recovery queue is the recovery rule which 
> is the last one in the list.
> Use all these facts to simplify the placement processing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2527) Allow remove and re-add configured queue within cleanup time

2024-04-02 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2527:

Description: 
When we remove a queue from the config it is marked for cleanup. If we re-add 
the same queue in the config again before the cleanup gets executed the queue 
still gets removed.

reproduction:
 * edit config map remove a queue, save
 * immediately edit configmap add the same queue back, save
 * wait for the cleanup to happen, queue should still exist after the fix

  was:
When we remove a queue from the config it is marked for cleanup. If we re-add 
the same queue in the config again before the cleanup gets executed the queue 
still gets removed.

reproduction:
 * edit config map remove a queue, save
 * immediately edit configmap add the same queue back, save
 * wait for the cleanup to happen


> Allow remove and re-add configured queue within cleanup time 
> -
>
> Key: YUNIKORN-2527
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2527
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> When we remove a queue from the config it is marked for cleanup. If we re-add 
> the same queue in the config again before the cleanup gets executed the queue 
> still gets removed.
> reproduction:
>  * edit config map remove a queue, save
>  * immediately edit configmap add the same queue back, save
>  * wait for the cleanup to happen, queue should still exist after the fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2527) Allow remove and re-add configured queue within cleanup time

2024-04-02 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2527:
---

 Summary: Allow remove and re-add configured queue within cleanup 
time 
 Key: YUNIKORN-2527
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2527
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - common
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When we remove a queue from the config it is marked for cleanup. If we re-add 
the same queue in the config again before the cleanup gets executed the queue 
still gets removed.

reproduction:
 * edit config map remove a queue, save
 * immediately edit configmap add the same queue back, save
 * wait for the cleanup to happen



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2498) Implement force create flag in k8shim for recovery queue

2024-04-01 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2498.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

> Implement force create flag in k8shim for recovery queue
> 
>
> Key: YUNIKORN-2498
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2498
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> As part of the initialisation changes a new recovery queue was added to allow 
> already running allocation to be restored even if the queue config was 
> changed. The implementation on the k8shim side needs to be added to leverage 
> the forced create flag from YUNIKORN-1887.
> Without that the changes added for the recovery queue will not be used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2522) Move e2e test doc from k8shim to website

2024-04-01 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2522:

Target Version: 1.6.0

> Move e2e test doc from k8shim to website
> 
>
> Key: YUNIKORN-2522
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2522
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: JiaChi Wang
>Assignee: JiaChi Wang
>Priority: Minor
>
> If we move the e2e doc to website under the developer guide, that may be 
> easily to access for users. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2523) Bump go to 1.22

2024-04-01 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832772#comment-17832772
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2523:
-

Before we update the go version we need to at least have some confirmation that 
people have tried it.I have not run any builds or test with Go 1.22 as yet. The 
linter golangci-lint we run might also need updating to a later version to 
support 1.22 and make sure it work correctly. Changes in go have broken the 
linter a number of times over the last years.

With the new toolchain dependency checks we need to leave go.mod as is and just 
update the version file we have in the repo.

> Bump go to 1.22
> ---
>
> Key: YUNIKORN-2523
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2523
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Ryan Lo
>Assignee: Ryan Lo
>Priority: Major
>
> The latest go 1.22 released in this Feb.
> https://go.dev/doc/go1.22
> We should change to use latest go version to build YK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2494) Revisit IsAtorAbove, WithIn, GetRemaining Guaranteed resources calculation

2024-03-28 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2494.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

Functions added to the master code, not actively used yet.

> Revisit IsAtorAbove, WithIn, GetRemaining Guaranteed resources calculation
> --
>
> Key: YUNIKORN-2494
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2494
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - common
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> These 3 methods doesn't expose the actual guaranteed values and returns 
> boolean value based on the calculation. There are cases, where these boolean 
> values are not correct and also there is a need to know the actual guaranteed 
> values. For example, How much is remaining in Guaranteed? How much can be 
> preempted? etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2519) Remove bypass ACL check from placement rules

2024-03-27 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831356#comment-17831356
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2519:
-

Logging of placements should be part of the app processing and not fall under 
the config adding that to the refactor.

> Remove bypass ACL check from placement rules
> 
>
> Key: YUNIKORN-2519
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2519
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> Instead of returning a flag to not bypass the ACL check by all rules except 
> for the recovery rule special case the recovery rule to bypass checks.
> The recovery queue is created without ACLs, quota and is always a leaf queue. 
> The only rule that can return the recovery queue is the recovery rule which 
> is the last one in the list.
> Use all these facts to simplify the placement processing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2519) Remove bypass ACL check from placement rules

2024-03-27 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2519:
---

 Summary: Remove bypass ACL check from placement rules
 Key: YUNIKORN-2519
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2519
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Instead of returning a flag to not bypass the ACL check by all rules except for 
the recovery rule special case the recovery rule to bypass checks.

The recovery queue is created without ACLs, quota and is always a leaf queue. 
The only rule that can return the recovery queue is the recovery rule which is 
the last one in the list.

Use all these facts to simplify the placement processing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2518) Allow recovery queue in REST requests

2024-03-27 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2518:
---

 Summary: Allow recovery queue in REST requests
 Key: YUNIKORN-2518
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2518
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common
Reporter: Wilfred Spiegelenburg


The current checks for the REST requests that require a queue path to be 
provided prevent looking at the {{root.@recover@}} queue.

The validator filters the queue names which makes it impossible to check if the 
queue has any running applications or pod after initialisation using the REST 
requests. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2517) [Yunikorn] Incorrect Placeholder Count for Duplicate Task Groups in Gang scheduling

2024-03-26 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830813#comment-17830813
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2517:
-

This looks like a side effect of YUNIKORN-1931.

> [Yunikorn] Incorrect Placeholder Count for Duplicate Task Groups in Gang 
> scheduling
> ---
>
> Key: YUNIKORN-2517
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2517
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Rajesh Kanhaiya Lal
>Assignee: Manikandan R
>Priority: Major
>
> Hi Team , I am getting incorrect placeholder count for the duplicate task 
> groups in gang scheduling.
> Example :
> {code:java}
> TaskGroups: []v1alpha1.TaskGroup{
>   {Name: "groupdup", MinMember: int32(3), 
> MinResource: map[string]resource.Quantity{
>   "cpu":resource.MustParse("10m"),
>   "memory": resource.MustParse("10M"),
>   }},
>   {Name: "groupdup", MinMember: int32(5), 
> MinResource: map[string]resource.Quantity{
>   "cpu":resource.MustParse("10m"),
>   "memory": resource.MustParse("10M"),
>   }},
>   {Name: "groupa", MinMember: int32(7), 
> MinResource: map[string]resource.Quantity{
>   "cpu":resource.MustParse("10m"),
>   "memory": resource.MustParse("10M"),
>   }},
>   }, {code}
> for the above config, we are getting a total of 17 pods ( 2 groups. + 15 
> Placeholders).
> It's adding the duplicate group placeholder as well.
> Could you please take a look?
> {code:java}
> gangjob-c805x-l4fx9                  1/1     Running   0          47s
> gangjob-c805x-tc8tr                  1/1     Running   0          47s
> tg-appid-oqina-groupa-1ap48pr4us     1/1     Running   0          45s
> tg-appid-oqina-groupa-25t5jubyzl     1/1     Running   0          45s
> tg-appid-oqina-groupa-6oxhqxnebc     1/1     Running   0          45s
> tg-appid-oqina-groupa-bqj9nk3mdq     1/1     Running   0          45s
> tg-appid-oqina-groupa-hugxbjb3xv     1/1     Running   0          45s
> tg-appid-oqina-groupa-o46k68fhw1     1/1     Running   0          45s
> tg-appid-oqina-groupa-vs5kxeop8z     1/1     Running   0          45s
> tg-appid-oqina-groupdup-786dl3gch2   1/1     Running   0          45s
> tg-appid-oqina-groupdup-877tnd4xdl   1/1     Running   0          45s
> tg-appid-oqina-groupdup-b7yef7w47x   1/1     Running   0          45s
> tg-appid-oqina-groupdup-cdqm1fcwbo   1/1     Running   0          45s
> tg-appid-oqina-groupdup-hlxwv9to9z   1/1     Running   0          45s
> tg-appid-oqina-groupdup-mvcd5pkijw   1/1     Running   0          45s
> tg-appid-oqina-groupdup-o4d9s8d02p   1/1     Running   0          45s
> tg-appid-oqina-groupdup-srrxrukstd   1/1     Running   0          45s {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2506) fix deprecation warning for fontsource-roboto

2024-03-19 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2506:

Summary: fix deprecation warning for fontsource-roboto  (was: fix )

> fix deprecation warning for fontsource-roboto
> -
>
> Key: YUNIKORN-2506
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2506
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: webapp
>Reporter: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: newbie
>
> When running make on the web UI project a deprecation warning is printed for 
> the fonts we include:
> {code:java}
>  WARN  deprecated fontsource-roboto@4.0.0: Package relocated. Please install 
> and migrate to @fontsource/roboto. {code}
> Move to {{@fontsource/roboto}} to fix the warning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2506) fix

2024-03-19 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2506:
---

 Summary: fix 
 Key: YUNIKORN-2506
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2506
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: webapp
Reporter: Wilfred Spiegelenburg


When running make on the web UI project a deprecation warning is printed for 
the fonts we include:
{code:java}
 WARN  deprecated fontsource-roboto@4.0.0: Package relocated. Please install 
and migrate to @fontsource/roboto. {code}
Move to {{@fontsource/roboto}} to fix the warning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2498) Implement force create flag in k8shim for recovery queue

2024-03-19 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2498:
---

 Summary: Implement force create flag in k8shim for recovery queue
 Key: YUNIKORN-2498
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2498
 Project: Apache YuniKorn
  Issue Type: Task
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


As part of the initialisation changes a new recovery queue was added to allow 
already running allocation to be restored even if the queue config was changed. 
The implementation on the k8shim side needs to be added to leverage the forced 
create flag from YUNIKORN-1887.

Without that the changes added for the recovery queue will not be used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2497) Update node.js to 18.19.1

2024-03-18 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2497:
---

 Summary: Update node.js to 18.19.1
 Key: YUNIKORN-2497
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2497
 Project: Apache YuniKorn
  Issue Type: Task
  Components: website
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Node 18.x is a LTS version. The version 18.17 has been superseded with two 
other releases 18.18 and 18.19. Both have some CVE fixes which we should be 
including for stability.

Moving the build to 18.19 (currently 18.19.1)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2496) Fix security issues in website javascript

2024-03-18 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2496.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

Change committed all dependabot alerts closed

> Fix security issues in website javascript
> -
>
> Key: YUNIKORN-2496
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2496
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The change to pnmp triggered a large number of security alerts from 
> dependabot.
> 7 could be fixed directly by the 4 PRs opened by dependabot. 6 need manual 
> intervention.
> The change also included an upgrade of the Algolia search component to 3.x. 
> That change prevent running {{{}pnpm audit{}}}. 
> Docusaurus 3.x also contains a large number of backward incompatible changes 
> and an upgrade is planned separately. Using the Algolia 3.x dependency 
> already pushes some of these changes and should be reverted to Algolia 2.x 
> same as the rest of Docusaurus environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2496) Fix security issues in website javascript

2024-03-17 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827831#comment-17827831
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2496:
-

When updating axios via pnpm it gets upgraded to 1.6.8. The build after that 
change does not work anymore. Forcing axios to move to 0.28 (from vulnerable 
0.25) fixes that issue.

> Fix security issues in website javascript
> -
>
> Key: YUNIKORN-2496
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2496
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The change to pnmp triggered a large number of security alerts from 
> dependabot.
> 7 could be fixed directly by the 4 PRs opened by dependabot. 6 need manual 
> intervention.
> The change also included an upgrade of the Algolia search component to 3.x. 
> That change prevent running {{{}pnpm audit{}}}. 
> Docusaurus 3.x also contains a large number of backward incompatible changes 
> and an upgrade is planned separately. Using the Algolia 3.x dependency 
> already pushes some of these changes and should be reverted to Algolia 2.x 
> same as the rest of Docusaurus environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2496) Fix security issues in website javascript

2024-03-17 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2496:
---

 Summary: Fix security issues in website javascript
 Key: YUNIKORN-2496
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2496
 Project: Apache YuniKorn
  Issue Type: Task
  Components: website
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The change to pnmp triggered a large number of security alerts from dependabot.

7 could be fixed directly by the 4 PRs opened by dependabot. 6 need manual 
intervention.

The change also included an upgrade of the Algolia search component to 3.x. 
That change prevent running {{{}pnpm audit{}}}. 
Docusaurus 3.x also contains a large number of backward incompatible changes 
and an upgrade is planned separately. Using the Algolia 3.x dependency already 
pushes some of these changes and should be reverted to Algolia 2.x same as the 
rest of Docusaurus environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2490) Add new PMC and committer members

2024-03-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827340#comment-17827340
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2490:
-

I had made that change but forgot to push before the merge. Did it directly 
after that all is correct.

> Add new PMC and committer members
> -
>
> Key: YUNIKORN-2490
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2490
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> We have elected a new PMC member and some committers. Now that they have 
> accepted we should add them to the website.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



  1   2   3   4   5   6   7   8   9   10   >