[jira] [Updated] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2637: Target Version: 1.6.0, 1.5.2 (was: 1.6.0) > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2665) Gang app originator pod changes after restart
[ https://issues.apache.org/jira/browse/YUNIKORN-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2665: Target Version: 1.6.0, 1.5.2 Affects Version/s: (was: 1.5.2) Priority: Critical (was: Major) The original originator pod will never be released when this happens. Leaking the application on the core side and multiple things like the application, pod and tasks on the k8shim side. Traget for a backport to 1.5.2 > Gang app originator pod changes after restart > - > > Key: YUNIKORN-2665 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2665 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.5.1 >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Critical > > Gang app choose the first pod (who created the app) as originator pod which > becomes the real driver pod later. While processing gang app specifically > after the placeholder creation and in the process of replacement, restart can > lead to the below described incorrect behaviour: > During restore, there is no guarantee on the ordering of pods coming from K8s > lister especially when all the pods created with the same second timestamp. > k8s use the seconds based timestamp, which means all pods created with in > same second has same timestamp. During this situation, whichever pod comes > first from lister, YK designate it as originator pod. So, any placeholder > could become the originator pod and actual originator pod has been lost. This > change could cause rippling effects leading to weird behaviour and needs to > be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2652) Expand getApplication() endpoint handler to optionally return resource usage
[ https://issues.apache.org/jira/browse/YUNIKORN-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2652: Component/s: core - common (was: scheduler-interface) > Expand getApplication() endpoint handler to optionally return resource usage > > > Key: YUNIKORN-2652 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2652 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Rich Scott >Priority: Major > > Some users would like to be able to see resource usage (preempted, > placeholder resource, etc) for applications that have been completed. The > `getApplication()` endpoint handler should be enhanced to take an optional > parameter specifying that the user would like details about resources > included in the response, and a new `ApplicationXXXDAOInfo` object that is a > slight superset of `ApplicationDAOInfo` should be introduced, and can be used > in the response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2581) Expose running placement rules in REST
[ https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850945#comment-17850945 ] Wilfred Spiegelenburg commented on YUNIKORN-2581: - Documentation PR opened > Expose running placement rules in REST > -- > > Key: YUNIKORN-2581 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2581 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > Since introducing the use of placement rules always and the recovery rule the > queue config does not correctly show the running rules. > Also if a config update has been rejected, for any reason, the rules would > not be correct > Exposing the configured rules from the placement manager works around all > these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2655) Cleanup REST API documentation
Wilfred Spiegelenburg created YUNIKORN-2655: --- Summary: Cleanup REST API documentation Key: YUNIKORN-2655 URL: https://issues.apache.org/jira/browse/YUNIKORN-2655 Project: Apache YuniKorn Issue Type: Task Components: documentation Reporter: Wilfred Spiegelenburg The REST API documentation is not up to date with the current behaviour as it does not show any 400 or 404 errors returned by a number of API calls. The error response only shows a 500 code with the same message for each call. We should move to a simple list for each call showing the applicable errors like this: {code:java} ### Error responses **Code** : `400 Bad Request` (URL query is invalid, missing partition name) **Code** : `404 Not Found` (Partition not found) **Code** : `500 Internal Server Error` {code} Remove the error examples as they do not add any detail required -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2654) Remove unused code in k8shim context
Wilfred Spiegelenburg created YUNIKORN-2654: --- Summary: Remove unused code in k8shim context Key: YUNIKORN-2654 URL: https://issues.apache.org/jira/browse/YUNIKORN-2654 Project: Apache YuniKorn Issue Type: Task Components: shim - kubernetes Reporter: Wilfred Spiegelenburg The NotifyApplicationComplete and NotifyApplicationFail function are not called by anything and are unused code. The K8shim does not trigger the application completion or failure. This is triggered by the core when the application no longer has any activity registered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance
Wilfred Spiegelenburg created YUNIKORN-2653: --- Summary: Gang scheduling K8s event formatting compliance Key: YUNIKORN-2653 URL: https://issues.apache.org/jira/browse/YUNIKORN-2653 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg The K8s events provide definitions and rules around the content of the fields within the event. Adjust the content of gang scheduling related events to comply with the rules. Focussed on the reason and action fields only. * 'reason' is the reason this event is generated. 'reason' should be short and unique; it should be in UpperCamelCase format (starting with a capital letter). * 'action' explains what happened with regarding/ what action did the ReportingController take in objects name; it should be in UpperCamelCase format (starting with a capital letter). No space or long text. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-182) fix lint issues
[ https://issues.apache.org/jira/browse/YUNIKORN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850896#comment-17850896 ] Wilfred Spiegelenburg commented on YUNIKORN-182: File a new Jira for this, it needs to be fixed in all our http servers we create in our code, those are spread over multiple repositories and all need to be checked: {code:java} pkg/cmd/admissioncontroller/main.go:143:15: G112: Potential Slowloris Attack because ReadHeaderTimeout is not configured in the http.Server (gosec) {code} This one should get an ignore from the lint side, we do not need crypt quality random here; {code:java} test/e2e/framework/helpers/common/utils.go:105:18: G404: Use of weak random number generator (math/rand instead of crypto/rand) (gosec) b[i] = letters[rand.Intn(len(letters))]{code} All the ineffective assigns and shadowing remarks can and should be fixed. Formatting issues can snd should be fixed The function length ones are dubious and we probably should just add the {{//nolint:funlen}} remark on them specially since they are almost all test functions. > fix lint issues > --- > > Key: YUNIKORN-182 > URL: https://issues.apache.org/jira/browse/YUNIKORN-182 > Project: Apache YuniKorn > Issue Type: Task > Components: build >Reporter: Wilfred Spiegelenburg >Assignee: Yun Sun >Priority: Minor > Labels: pull-request-available > > When we added the lint test most major issues were fixed. There are still a > lot of issues specially in tests that need to be fixed. > This is a container Jira to track that work on both the k8shim as the core > repos. > Work should be split into multiple parts (per linter?) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2581) Expose running placement rules in REST
[ https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850889#comment-17850889 ] Wilfred Spiegelenburg commented on YUNIKORN-2581: - code change committed, working on documentation before closing > Expose running placement rules in REST > -- > > Key: YUNIKORN-2581 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2581 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > Since introducing the use of placement rules always and the recovery rule the > queue config does not correctly show the running rules. > Also if a config update has been rejected, for any reason, the rules would > not be correct > Exposing the configured rules from the placement manager works around all > these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2645) parent queue exceeds maximum resource
[ https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850562#comment-17850562 ] Wilfred Spiegelenburg commented on YUNIKORN-2645: - The side effect of that broken node is that every single pod that we allocate will select that broken node. Based on the node sorting that node stays as the first node in the list to try. Every single pod gets placed but then fails to start. The node usage does not change and thus the node does not get pushed back in the list of available nodes. The scheduler due to that does not make any real progress. I would consider that a hung scheduler but there is nothing that I think we can do about that without some major changes. A possible solution would be for instance rate limit the number of pods we put on a node. Never schedule more than 10 pods per second on a node, including or ignoring failures, and when that is hit we skip the node. That could have made sure we try a couple of times and then try the next node. That could cause a slight delay when a cluster is almost full. It will also delay somewhat in an auto scaling cluster as the scheduler skips a node while the auto scaler does not... > parent queue exceeds maximum resource > - > > Key: YUNIKORN-2645 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2645 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Dmitry >Priority: Major > Attachments: yunikorn-logs.txt.gz > > > We had a node broken in the cluster - kubernetes was creating pods which were > immediately failing with "OutOfGPU" state. The node had 1000+ pods on it. > The scheduler panicked with the log attached and was not scheduling any other > pods. > The config: > {code:yaml} > apiVersion: v1 > data: > admissionController.filtering.bypassNamespaces: > ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$ > queues.yaml: | > partitions: > - name: default > placementrules: > - name: fixed > value: root.scavenging.osg > create: true > filter: > type: allow > users: > - system:serviceaccount:osg-ligo:prp-htcondor-provisioner > - > system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner > - system:serviceaccount:osg-icecube:prp-htcondor-provisioner > - name: tag > value: namespace > create: true > parent: >name: tag >value: namespace.parentqueue > - name: tag > value: namespace > create: true > parent: >name: fixed >value: general > nodesortpolicy: > type: fair > resourceweights: > vcore: 1.0 > memory: 1.0 > nvidia.com/gpu: 4.0 > queues: > - name: root > submitacl: '*' > properties: > application.sort.policy: fair > queues: > - name: system > parent: true > properties: > preemption.policy: disabled > - name: general > parent: true > childtemplate: > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 100 > memory: 1Ti > nvidia.com/gpu: 8 > max: > vcore: 4000 > memory: 15Ti > nvidia.com/gpu: 200 > - name: scavenging > parent: true > childtemplate: > resources: > guaranteed: > vcore: 1 > memory: 1G > nvidia.com/gpu: 1 > properties: > priority.offset: "-10" > - name: interactive > parent: true > childtemplate: > resources: > guaranteed: > vcore: 1000 > memory: 10T > nvidia.com/gpu: 48 > nvidia.com/a100: 4 > properties: > priority.offset: "10" > preemption.policy: disabled > - name: clemson > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 256 > memory: 2T > nvidia.com/gpu: 24 >
[jira] [Created] (YUNIKORN-2648) Add deadlock detection config to the configmap
Wilfred Spiegelenburg created YUNIKORN-2648: --- Summary: Add deadlock detection config to the configmap Key: YUNIKORN-2648 URL: https://issues.apache.org/jira/browse/YUNIKORN-2648 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Wilfred Spiegelenburg The current deadlock detection is configured using environment variables. That requires a change of the image and a restart of the scheduler to take effect and is not easy to maintain. We should be using yunikorn-defaults config map for the settings. We want a default set, turned off, for production use cases. However making the configs loadable from the config map makes turning it on easier. Update the configmap and restart the scheduler to turn the detection on or off. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2647) Flaky test TestUpdateNodeCapacity
Wilfred Spiegelenburg created YUNIKORN-2647: --- Summary: Flaky test TestUpdateNodeCapacity Key: YUNIKORN-2647 URL: https://issues.apache.org/jira/browse/YUNIKORN-2647 Project: Apache YuniKorn Issue Type: Bug Components: test - unit Reporter: Wilfred Spiegelenburg Same as we saw in YUNIKORN-2573 the single node update test might fail: {code:java} --- FAIL: TestUpdateNodeCapacity (0.03s) operation_test.go:446: Expected partition resource map[memory:1 vcore:2], doesn't match with actual partition resource map[memory:1 vcore:2]{code} We calculate the delta resources when updating node capacity with that delta we update resources in partition. The test would fail with following order same as for multiple nodes node.SetCapacity() -> waitForAvailableNodeResource() -> partitionInfo.GetTotalPartitionResource() -> partition.updatePartitionResource() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2646) Deadlock detected during preemption
[ https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850241#comment-17850241 ] Wilfred Spiegelenburg commented on YUNIKORN-2646: - I have found a way to turn this specific detection off. I will create a PR for it a little later. Would be good to backport it to 1.5.2. Need to still think about the default we would use for this. > Deadlock detected during preemption > --- > > Key: YUNIKORN-2646 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2646 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Dmitry >Assignee: Peter Bacsko >Priority: Major > Attachments: yunikorn-logs-lock.txt.gz > > > Hitting deadlocks in 1.5.1 > The log is attached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2646) Deadlock detected during preemption
[ https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850240#comment-17850240 ] Wilfred Spiegelenburg commented on YUNIKORN-2646: - For the analysis of the stack trace. [~pbacsko] is correct, we have seen this before and it is a false positive. This lock order check points towards something like the following case: * app A -> Allocate, trigger preemption -> check if app B can be a victim * app B -> Allocate, trigger preemption -> check if app A can be a victim Two points: # scheduling cycle is single threaded. # the application triggering preemption is never a victim So how does that relate to the stack trace: the {{PartitionContext.tryAllocate}} shown in the logs are never running at the same time. Scheduling also does not run multiple go routines. Last point is that leaving the {{Application.tryAllocate}} for the next cycle all locks that were held have been released. The next cycle could look at the same application again or might use a completely different one. When building the victim list via the {{Queue.FindEligiblePreemptionVictims}} and the recursive version of that call the queue from the application that triggered the preemption is filtered out. The lock held in {{Application.tryAllocate}} is on an application that cannot be later selected as a victim. If that would occur scheduling would immediately stop at that point. We would never see a second instance of this stack trace in the deadlock logging. The lock taken on the application for scheduling is a write lock. Getting a read lock on the same application would block. We need to investigate how we can exclude this from the potential deadlock detection. The only optin I can find at the moment is setting {{Opts.DisableLockOrderDetection}} for the detection code if you want to run this with preemption turned on. > Deadlock detected during preemption > --- > > Key: YUNIKORN-2646 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2646 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Dmitry >Assignee: Peter Bacsko >Priority: Major > Attachments: yunikorn-logs-lock.txt.gz > > > Hitting deadlocks in 1.5.1 > The log is attached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2646) Deadlock detected during preemption
[ https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850227#comment-17850227 ] Wilfred Spiegelenburg commented on YUNIKORN-2646: - A flag in the queue config at the partition level: https://yunikorn.apache.org/docs/user_guide/queue_config#partitions > Deadlock detected during preemption > --- > > Key: YUNIKORN-2646 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2646 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Dmitry >Assignee: Peter Bacsko >Priority: Major > Attachments: yunikorn-logs-lock.txt.gz > > > Hitting deadlocks in 1.5.1 > The log is attached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2645) parent queue exceeds maximum resource
[ https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850226#comment-17850226 ] Wilfred Spiegelenburg commented on YUNIKORN-2645: - thank you [~dimm] for the logs that helped. The scheduler did not panic as it would have shown a restart of the scheduler. It did log a message that should get your attention. If this happens your cluster and the scheduler are in a really bad state. We can only detect this and revert the changes but not fix it from the scheduler side. We keep on scheduling. A panic would be caused by the logger and expected when the logger runs in development mode. This is all linked to the DPANIC level. We use [DPANIC|https://pkg.go.dev/go.uber.org/zap#pkg-constants] in a couple of places. What that level does it logs the error and then causes a panic if running in development mode. If not running in development mode you just see the message. The logger should never be running in development mode unless running as part of unit tests etc. If you see these messages with a DPANIC level in production you have a serious issue. Some background on the {{OutOfCpu}} message from the node: there has been a change in K8s 1.22 kubelet to fix some resource issues. That introduced an increased possibility of a race condition in the kubelet when scheduling short lived pods or pods that did not pass the node admission checks. A mitigation for that race condition was added in 1.22.4 but there is still complaints about it [regularly happening|https://github.com/kubernetes/kubernetes/issues/115325] even in the latest K8s versions with the default K8s scheduler. High pod churn, node and deployment scaling all seem to be related and triggering. The sig_node team has said that it is as good as it will get without causing the original issue to come back. They assessed that the original issue was far worse than this one. > parent queue exceeds maximum resource > - > > Key: YUNIKORN-2645 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2645 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Dmitry >Priority: Major > Attachments: yunikorn-logs.txt.gz > > > We had a node broken in the cluster - kubernetes was creating pods which were > immediately failing with "OutOfGPU" state. The node had 1000+ pods on it. > The scheduler panicked with the log attached and was not scheduling any other > pods. > The config: > {code:yaml} > apiVersion: v1 > data: > admissionController.filtering.bypassNamespaces: > ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$ > queues.yaml: | > partitions: > - name: default > placementrules: > - name: fixed > value: root.scavenging.osg > create: true > filter: > type: allow > users: > - system:serviceaccount:osg-ligo:prp-htcondor-provisioner > - > system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner > - system:serviceaccount:osg-icecube:prp-htcondor-provisioner > - name: tag > value: namespace > create: true > parent: >name: tag >value: namespace.parentqueue > - name: tag > value: namespace > create: true > parent: >name: fixed >value: general > nodesortpolicy: > type: fair > resourceweights: > vcore: 1.0 > memory: 1.0 > nvidia.com/gpu: 4.0 > queues: > - name: root > submitacl: '*' > properties: > application.sort.policy: fair > queues: > - name: system > parent: true > properties: > preemption.policy: disabled > - name: general > parent: true > childtemplate: > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 100 > memory: 1Ti > nvidia.com/gpu: 8 > max: > vcore: 4000 > memory: 15Ti > nvidia.com/gpu: 200 > - name: scavenging > parent: true > childtemplate: > resources: > guaranteed: > vcore: 1 > memory: 1G > nvidia.com/gpu: 1 > properties: > priority.offset: "-10" >
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849878#comment-17849878 ] Wilfred Spiegelenburg commented on YUNIKORN-2629: - I have walked through the draft PR and looked. As a side effect it fixes recoding an event on a node that has been rejected by the core from being added. That is a good change to have. What I do not understand yet is why we need to have a lock on the context. Registering a node does not make any changes in the context. The only two things that make changes to the context are the applications and config map changes. It feels like the context lock is used to synchronise changes in the schedulerCache: i.e. make sure that subsequent calls from the context into the cache see the same scheduler cache. If that really is the reason we should make sure that it is handled in the cache. Example: in the context the ForgetPod method calls GetPod on the cache and then ForgetPod on the cache. That should be one single call to ForgetPod in the cache removing the lookup, and the need for the context lock. The fact that we now unlock the context while waiting for a response to come back makes me wonder if we need the context lock at all during that call stack. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2640) Conside removing config from Clients
[ https://issues.apache.org/jira/browse/YUNIKORN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849857#comment-17849857 ] Wilfred Spiegelenburg commented on YUNIKORN-2640: - After the changes from YUNIKORN-2630 there is only one place left and we should really clean that up. Setting target for 1.6.0 > Conside removing config from Clients > > > Key: YUNIKORN-2640 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2640 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Chenchen Lai >Priority: Minor > > The config (`conf.SchedulerConf`) [0] references to a global singleton object > [1][2]. Also, in the code base `clients#GetConf()` is used 3 times [3] and > `conf.GetSchedulerConf()` is used 61 times [4] > It seems to me `clients#conf` should be removed to avoid confusion. > [0] > https://github.com/apache/yunikorn-k8shim/blob/master/pkg/client/clients.go#L42C8-L42C26 > [1] > https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/plugin/scheduler_plugin.go#L291 > [2] > https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/cmd/shim/main.go#L53 > [3] > https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+GetConf%28%29=code > [4] > https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+conf.GetSchedulerConf%28%29=code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2640) Conside removing config from Clients
[ https://issues.apache.org/jira/browse/YUNIKORN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2640: Target Version: 1.6.0 > Conside removing config from Clients > > > Key: YUNIKORN-2640 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2640 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Chenchen Lai >Priority: Minor > > The config (`conf.SchedulerConf`) [0] references to a global singleton object > [1][2]. Also, in the code base `clients#GetConf()` is used 3 times [3] and > `conf.GetSchedulerConf()` is used 61 times [4] > It seems to me `clients#conf` should be removed to avoid confusion. > [0] > https://github.com/apache/yunikorn-k8shim/blob/master/pkg/client/clients.go#L42C8-L42C26 > [1] > https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/plugin/scheduler_plugin.go#L291 > [2] > https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/cmd/shim/main.go#L53 > [3] > https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+GetConf%28%29=code > [4] > https://github.com/search?q=repo%3Aapache%2Fyunikorn-k8shim+conf.GetSchedulerConf%28%29=code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848863#comment-17848863 ] Wilfred Spiegelenburg commented on YUNIKORN-2637: - The case we are solving here is the correct removal of a pod that was registered and then stopped. In this case if the pod was assigned to a node it gets removed, this includes from the core also. In the case that it was not assigned to a node the request gets removed. I think both core and k8shim are affected by this after looking at the details in YUNIKORN-2526. So I am not sure if that is the root cause of the difference... > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2631) Support canonical labels for queue/applicationId in Admission Controller
[ https://issues.apache.org/jira/browse/YUNIKORN-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2631: Labels: pull-request-available release-notes (was: pull-request-available) > Support canonical labels for queue/applicationId in Admission Controller > > > Key: YUNIKORN-2631 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2631 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Yu-Lin Chen >Assignee: Yu-Lin Chen >Priority: Major > Labels: pull-request-available, release-notes > > Admission controller adds applicationID and label to Pod if they are not > already set in the Pod. > According to the new policy defined in YUNIKORN-1351. > Admission Controller will change to patch canonical label/annotation in the > future releases. > * yunikorn.apache.org/app-id (Canonical Label) > * yunikorn.apache.org/queue (Canonical Label) > To avoid an upgrade problem where the admission controller gets started > first, AM needs to generate both canonical/non-canonical labels in 1.6.0. > (This ensures that the 1.5.0 scheduler could understand labels generated in > the 1.6.0 admission controller) In 1.7.0, we can switch to generating only > the canonical label in AM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848809#comment-17848809 ] Wilfred Spiegelenburg commented on YUNIKORN-2629: - Saw this in a test run locally, adding the deadlock trace that was printed for reference. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2629: Attachment: updateNode_deadlock_trace.txt > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848807#comment-17848807 ] Wilfred Spiegelenburg commented on YUNIKORN-2521: - Collect the data and open a new Jira please. This Jira has been included in a release and will not be re-opened or worked on. The logs for the scheduler should show the details around the possible deadlock. > Scheduler deadlock > -- > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 >Reporter: Noah Yoshida >Assignee: Peter Bacsko >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848435#comment-17848435 ] Wilfred Spiegelenburg commented on YUNIKORN-2637: - The list of pods we iterate over in {{finalizePods()}} are the *original* pods that were processed and returned by the {{registerPods()}} call. In {{registerPods()}} we skip pods in a terminal state. Those pods do not get added to the context. {{AddPod()}} is never called for them. In {{finalizePods()}} we list the pods again and build a map. We then iterate over all pods returned from {{registerPods()}} and see if they are in the in new list we just pulled from K8s. If we have a pod from the registered list that does not show up in the newly pulled list we remove the pod from the context. That last step, removing from the context only makes sense if the pod was added to the context to start with. Pods that were in a terminal state during the {{registerPods()}} processing are not added and thus do not need to be removed as they cannot exist in the context. It does not matter what state they are in in the newly pulled list. A pod in a terminal state cannot return to a running state ever. {quote}I don't think this is safe. The pod may have moved into a terminal state between registerPods() and finalizePods(). In that case, we may lose the transition and end up with a phantom pod still in the system. {quote} This is not the case that needs to be optimised as this does not happen at all with the current code. That is a bug in the code by itself I did not even notice before. The newly pulled list of pods is not status checked. A pod that was running, during {{{}registerPods(){}}}, and now shows as terminated, in {{{}finalizePods(){}}}, shows up in the both the map as well as the iteration and will thus not be removed. Just the existence check in the register list compared to the finalise list is not enough. For that case to work we either need: # filtering of the pods that we put in the finalised map (i.e. skip terminated pods) or # a comparison of the state of the pods during the iteration Option 1 is simplest as it just gives us a map of still running pods and anything that does not exist should be removed. I'll put up a PR that fixes both issues. > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Priority: Major > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848056#comment-17848056 ] Wilfred Spiegelenburg commented on YUNIKORN-2637: - Comments in {{finalizePods()}} should be fixed at the same time as it points to nodes currently which is incorrect. > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Priority: Major > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2638) Simplify finalizeNodes and finalizePods
Wilfred Spiegelenburg created YUNIKORN-2638: --- Summary: Simplify finalizeNodes and finalizePods Key: YUNIKORN-2638 URL: https://issues.apache.org/jira/browse/YUNIKORN-2638 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg In finalizeNodes and finalizePods a map is created to store the newly retrieved pods and nodes. The map is only used as a reference and the pod and node objects themselves are not used. Instead of storing the objects the maps could use a boolean value to store. This also simplifies the check later for the existence of the node or pod to just a single map lookup. We should also set the size of the map, length of the nodes or pod list retrieved, to prevent any re-allocation during the map filling. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2637: Target Version: 1.6.0 > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Priority: Major > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
Wilfred Spiegelenburg created YUNIKORN-2637: --- Summary: finalizePods should ignore pods like registerPods does Key: YUNIKORN-2637 URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Wilfred Spiegelenburg The initialisation code is a two step process for pods: first list all pods and add them to the system in registerPods(). This returns a list of pods processed. The second step happens after event handlers are turned on and nodes have been cleaned up etc. During the second step pods from the first step are checked and removed. However pods that were already in a terminated state in step 1 get removed again. Although the step should be idempotent this is unneeded. When iterating over the existing pods any pod in a terminal state should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2630) Release context lock in shim when processing config in the core
[ https://issues.apache.org/jira/browse/YUNIKORN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2630: Target Version: 1.6.0, 1.5.2 > Release context lock in shim when processing config in the core > --- > > Key: YUNIKORN-2630 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2630 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Labels: pull-request-available > > When an change comes in for a the configmaps we process the change under a > context lock as we need to merge the two configmaps. > We keep this lock even if all the work is done in the shim and processing has > been transferred to the core. This is unneeded as the core has its own > locking an serialisation of the changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2630) Release context lock in shim when processing config in the core
Wilfred Spiegelenburg created YUNIKORN-2630: --- Summary: Release context lock in shim when processing config in the core Key: YUNIKORN-2630 URL: https://issues.apache.org/jira/browse/YUNIKORN-2630 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg When an change comes in for a the configmaps we process the change under a context lock as we need to merge the two configmaps. We keep this lock even if all the work is done in the shim and processing has been transferred to the core. This is unneeded as the core has its own locking an serialisation of the changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2628) fix release announcement links
[ https://issues.apache.org/jira/browse/YUNIKORN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2628. - Fix Version/s: 1.6.0 Resolution: Fixed links are fixed after removing the {{..}} from the path > fix release announcement links > -- > > Key: YUNIKORN-2628 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2628 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > > In YUNIKORN-2595 a regression snuck in breaking the links to the release > announcements. > Need to reverse that path change for the release announcements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847123#comment-17847123 ] Wilfred Spiegelenburg commented on YUNIKORN-2629: - I think we need to look at the context lock in the k8shim in general. The context lock is held while we do none context work. There is no need to hold the lock if all we do is waiting for a response that might trigger post processing or not. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2627) Add K8s 1.30 to the e2e matrix
[ https://issues.apache.org/jira/browse/YUNIKORN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2627. - Fix Version/s: 1.6.0 Resolution: Fixed Upgrdaed kind to version 0.23 and added 1.30 as a new version to test with > Add K8s 1.30 to the e2e matrix > -- > > Key: YUNIKORN-2627 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2627 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Wilfred Spiegelenburg >Assignee: Tseng Hsi-Huang >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > k8s 1.30 support in kind is now available as part of the [0.23 > release|https://github.com/kubernetes-sigs/kind/releases/tag/v0.23.0] > Need to add 1.30 to the matrix for the next release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2626) Add flag to helm chart to disable web container
[ https://issues.apache.org/jira/browse/YUNIKORN-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846827#comment-17846827 ] Wilfred Spiegelenburg commented on YUNIKORN-2626: - I have no strong feelings either way. The default should be the web container on but that is it. Create a PR to make it possible: charts are [here|https://github.com/wilfred-s/yunikorn-release/tree/master/helm-charts/yunikorn] > Add flag to helm chart to disable web container > --- > > Key: YUNIKORN-2626 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2626 > Project: Apache YuniKorn > Issue Type: New Feature > Components: deployment >Reporter: Michael >Priority: Major > > For our use case we only really need the admission controller and scheduler. > The helm chart does currently not provide a way to disable deploying the web > container and it would be great if that is possible. > Is there any reason not to disable the web container? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2628) fix release announcement links
[ https://issues.apache.org/jira/browse/YUNIKORN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2628: Description: In YUNIKORN-2595 a regression snuck in breaking the links to the release announcements. Need to reverse that path change for the release announcements. was: In YUNIKORN-2596 a regression snuck in breaking the links to the release announcements. Need to reverse that path change for the release announcements. > fix release announcement links > -- > > Key: YUNIKORN-2628 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2628 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > > In YUNIKORN-2595 a regression snuck in breaking the links to the release > announcements. > Need to reverse that path change for the release announcements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2628) fix release announcement links
Wilfred Spiegelenburg created YUNIKORN-2628: --- Summary: fix release announcement links Key: YUNIKORN-2628 URL: https://issues.apache.org/jira/browse/YUNIKORN-2628 Project: Apache YuniKorn Issue Type: Task Components: website Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg In YUNIKORN-2596 a regression snuck in breaking the links to the release announcements. Need to reverse that path change for the release announcements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2627) Add K8s 1.30 to the e2e matrix
[ https://issues.apache.org/jira/browse/YUNIKORN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846470#comment-17846470 ] Wilfred Spiegelenburg commented on YUNIKORN-2627: - We should also update from kind 0.20 to kind 0.23 as part of this change. https://github.com/apache/yunikorn-k8shim/blob/master/Makefile#L157-L159 > Add K8s 1.30 to the e2e matrix > -- > > Key: YUNIKORN-2627 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2627 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Wilfred Spiegelenburg >Priority: Major > Labels: newbie > > k8s 1.30 support in kind is now available as part of the [0.23 > release|https://github.com/kubernetes-sigs/kind/releases/tag/v0.23.0] > Need to add 1.30 to the matrix for the next release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2627) Add K8s 1.30 to the e2e matrix
Wilfred Spiegelenburg created YUNIKORN-2627: --- Summary: Add K8s 1.30 to the e2e matrix Key: YUNIKORN-2627 URL: https://issues.apache.org/jira/browse/YUNIKORN-2627 Project: Apache YuniKorn Issue Type: Improvement Reporter: Wilfred Spiegelenburg k8s 1.30 support in kind is now available as part of the [0.23 release|https://github.com/kubernetes-sigs/kind/releases/tag/v0.23.0] Need to add 1.30 to the matrix for the next release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2609) Improve visual style of the Web UI
[ https://issues.apache.org/jira/browse/YUNIKORN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846204#comment-17846204 ] Wilfred Spiegelenburg commented on YUNIKORN-2609: - Also the "Logs" link on the application page I don't think we have that. Or does that point to the allocation logs? In that case we might want to come up with a nice pictogram for that link instead of the text. > Improve visual style of the Web UI > -- > > Key: YUNIKORN-2609 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2609 > Project: Apache YuniKorn > Issue Type: Improvement > Components: webapp >Reporter: Denis Coric >Priority: Major > Labels: newbie > > Implement required CSS changes to tweak the overall look and feel of the web > UI. > The full design can be previewed on this link: [ > [DESIGN|https://xd.adobe.com/view/1d84899f-72a8-472f-b03f-de40451b0956-48d7/] > ] > This should include: > * Fix padding/margin values > * Add rounding on elements to match the design (menu selection, dropdowns, > etc) > * Fix font weight on visual elements to match the design > _Note: Queues page can be skipped as it is being redesigned in YUNIKORN-2341_ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2609) Improve visual style of the Web UI
[ https://issues.apache.org/jira/browse/YUNIKORN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846199#comment-17846199 ] Wilfred Spiegelenburg commented on YUNIKORN-2609: - The design looks OK to me. I do have a question around the resources: it was recently expanded to show more than just memory and CPU. How does that change affect the design that is shown in the link? Do areas expand collapse correctly when the list of resources, specially for nodes but any object is affected, become larger. Most nodes will show 7+ resource types as allocatable and used etc. Some detail is in https://github.com/apache/yunikorn-web/pull/146 > Improve visual style of the Web UI > -- > > Key: YUNIKORN-2609 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2609 > Project: Apache YuniKorn > Issue Type: Improvement > Components: webapp >Reporter: Denis Coric >Priority: Major > Labels: newbie > > Implement required CSS changes to tweak the overall look and feel of the web > UI. > The full design can be previewed on this link: [ > [DESIGN|https://xd.adobe.com/view/1d84899f-72a8-472f-b03f-de40451b0956-48d7/] > ] > This should include: > * Fix padding/margin values > * Add rounding on elements to match the design (menu selection, dropdowns, > etc) > * Fix font weight on visual elements to match the design > _Note: Queues page can be skipped as it is being redesigned in YUNIKORN-2341_ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2531) Create unit tests for AsyncRMCallback
[ https://issues.apache.org/jira/browse/YUNIKORN-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2531. - Fix Version/s: 1.6.0 Resolution: Fixed new tests added to the system to improve coverage > Create unit tests for AsyncRMCallback > - > > Key: YUNIKORN-2531 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2531 > Project: Apache YuniKorn > Issue Type: Test > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > There are no unit tests for the {{AsyncRMCallback}} type in the shim > (scheduler_callback.go). It's tested indirectly but we have no idea about the > coverage or how it behaves in rare scenarios. > At least longer methods such as {{UpdateApplication()}}, > {{UpdateAllocation()}} and {{UpdateNode()}} should be covered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2615) Remove named returns from predicate_manager.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2615. - Fix Version/s: 1.6.0 Resolution: Fixed refactor committed to master for 1.6.0 > Remove named returns from predicate_manager.go > -- > > Key: YUNIKORN-2615 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2615 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Predicate manager has defined named returns on some functions but does not > use them. They should be removed as the way they are used can cause issues > that are hard to debug. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2618) Streamline AsyncRMCallback UpdateAllocation
Wilfred Spiegelenburg created YUNIKORN-2618: --- Summary: Streamline AsyncRMCallback UpdateAllocation Key: YUNIKORN-2618 URL: https://issues.apache.org/jira/browse/YUNIKORN-2618 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg if task is not found, a nil is returned from {{context.getTask}} in for {{response.New}} processing we should just log that fact and proceed to the next alloc. Simplifies the flow as we never need to check for a. nil task. We should never have a pod in the cache that does not exist as a task on an application. We retrieve the application using the application ID from the response to never use the object. We only use the application ID to pass into an event. The context event handler then does the exact same lookup again to process the event on the app. We need to become much smarter in this area, double or triple lookups, generate async events that just change the state of the app or task or kick off another event. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2616) Remove unused bool return from PreemptionPredicates()
Wilfred Spiegelenburg created YUNIKORN-2616: --- Summary: Remove unused bool return from PreemptionPredicates() Key: YUNIKORN-2616 URL: https://issues.apache.org/jira/browse/YUNIKORN-2616 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg The predicate manager method {{PreemptionPredicates()}} returns two values an int and boolean. The boolean is false if the integer is -1 and true for 0 or llarger. There is no need for the boolean as the -1 already indicates the same -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2615) Remove named returns from predicate_manager.go
Wilfred Spiegelenburg created YUNIKORN-2615: --- Summary: Remove named returns from predicate_manager.go Key: YUNIKORN-2615 URL: https://issues.apache.org/jira/browse/YUNIKORN-2615 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Predicate manager has defined named returns on some functions but does not use them. They should be removed as the way they are used can cause issues that are hard to debug. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2601) Update kindest/node: v1.29.1 to v1.29.2, v1.28.6 to v1.28.7, v1.27.10 to v1.27.11, v1.26.13 -> v1.26.14
[ https://issues.apache.org/jira/browse/YUNIKORN-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2601. - Fix Version/s: 1.6.0 Resolution: Fixed Changes committed. No Kind for 1.30 available yet we should log a new Jira to add it later. > Update kindest/node: v1.29.1 to v1.29.2, v1.28.6 to v1.28.7, v1.27.10 to > v1.27.11, v1.26.13 -> v1.26.14 > > > Key: YUNIKORN-2601 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2601 > Project: Apache YuniKorn > Issue Type: Improvement > Components: test - e2e >Reporter: Chia-Ping Tsai >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > as title -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2591) Document placement rules always
[ https://issues.apache.org/jira/browse/YUNIKORN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2591. - Fix Version/s: 1.5.1 1.5.0 1.4.0 Resolution: Fixed Change made to the docs going back to 1.4.0, 1.5.0. Will be part of the 1.5.1. release also > Document placement rules always > --- > > Key: YUNIKORN-2591 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2591 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: Wilfred Spiegelenburg >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.1, 1.5.0, 1.4.0 > > > The current [doc > says|https://yunikorn.apache.org/docs/user_guide/queue_config#placement-rules]: > {quote}If no rules are defined the placement manager is not started and each > application _must_ have a queue set on submit. > {quote} > This is not correct, we moved to placement rules always in YUNIKORN-1793 in > YuniKorn 1.4 The documentation needs to be updated to reflect that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2596) Enhance layout for release announcements
[ https://issues.apache.org/jira/browse/YUNIKORN-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2596. - Fix Version/s: 1.5.1 Resolution: Fixed Fixed and published changes applied to 1.5.0 layout, before the 1.5.1 release. marking as fixed in 1.5.1 > Enhance layout for release announcements > > > Key: YUNIKORN-2596 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2596 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.5.1 > > Attachments: release_announce.png, releasee_announce_updated.png > > > The current release announcements page lacks a decent layout. The page is > generated during the build based on the directory content. > Some simple updates would make the page more readable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2595) Fix download page links
[ https://issues.apache.org/jira/browse/YUNIKORN-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2595. - Fix Version/s: 1.5.1 Resolution: Fixed download page fixed for 1.5.0, deployed before the 1.5.1 release Marking as fixed in 1.5.1 > Fix download page links > --- > > Key: YUNIKORN-2595 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2595 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.5.1 > > > The download links must follow a specific set of rule as specified > [here|https://infra.apache.org/release-download-pages.html]. > We currently do not set the correct download link for the source package. We > dropped the closer.lua resolution for the content network in one of the > releases. With the next release, 1.5.1, coming up we need to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2595) Fix download page links
[ https://issues.apache.org/jira/browse/YUNIKORN-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2595: Priority: Minor (was: Major) > Fix download page links > --- > > Key: YUNIKORN-2595 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2595 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > > The download links must follow a specific set of rule as specified > [here|https://infra.apache.org/release-download-pages.html]. > We currently do not set the correct download link for the source package. We > dropped the closer.lua resolution for the content network in one of the > releases. With the next release, 1.5.1, coming up we need to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2595) Fix download page links
Wilfred Spiegelenburg created YUNIKORN-2595: --- Summary: Fix download page links Key: YUNIKORN-2595 URL: https://issues.apache.org/jira/browse/YUNIKORN-2595 Project: Apache YuniKorn Issue Type: Task Components: website Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg The download links must follow a specific set of rule as specified [here|https://infra.apache.org/release-download-pages.html]. We currently do not set the correct download link for the source package. We dropped the closer.lua resolution for the content network in one of the releases. With the next release, 1.5.1, coming up we need to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2593) Simplify partition name
[ https://issues.apache.org/jira/browse/YUNIKORN-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841875#comment-17841875 ] Wilfred Spiegelenburg commented on YUNIKORN-2593: - We need to be careful here. This now forces a unique partition to be specified by all shims that register. If that is not the case we break. One shim would overwrite the partition setup of a second shim. The first part of the "full" partition name is the hostname and port of the shim that registers the partition allowing remote shims to be identified. If you are going to do this we might as well drop the whole multi & remote shim and multi partition design. Which would mean moving to one repository removing the SI etc along the way. I don't think that is a good idea. What I do not understand is why do we have partition anywhere in the scheduler objects? With objects I refer to anything like application, ask or node etc. Those cannot belong to anything but one partition and are only referenced from that one partition. They should not have the partition details as part of the object. It is redundant information taking up memory. A simple remove of the partition name from all these objects should suffice. BTW: The webservice broke the whole remote and multi shim idea when it was setup and we never got around to fixing that. We do not want to break it further. > Simplify partition name > --- > > Key: YUNIKORN-2593 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2593 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > > Currently, partition names are treated differently in different places within > the core. Specifically, sometimes they are bare (i.e. "default") and other > places they are composite (i.e. "[rm:123]default"). This is confusing and > unnecessary. It also hampers efforts to merge the AllocationAsk and > Allocation objects, as the semantics are different between them. Switch to > using bare form ("default") everywhere instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2591) Document placement rules always
Wilfred Spiegelenburg created YUNIKORN-2591: --- Summary: Document placement rules always Key: YUNIKORN-2591 URL: https://issues.apache.org/jira/browse/YUNIKORN-2591 Project: Apache YuniKorn Issue Type: Improvement Components: documentation Reporter: Wilfred Spiegelenburg The current [doc says|https://yunikorn.apache.org/docs/user_guide/queue_config#placement-rules]: {quote}If no rules are defined the placement manager is not started and each application _must_ have a queue set on submit. {quote} This is not correct, we moved to placement rules always in YUNIKORN-1793 in YuniKorn 1.4 The documentation needs to be updated to reflect that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2590) Handler tests should check for nil request on create
Wilfred Spiegelenburg created YUNIKORN-2590: --- Summary: Handler tests should check for nil request on create Key: YUNIKORN-2590 URL: https://issues.apache.org/jira/browse/YUNIKORN-2590 Project: Apache YuniKorn Issue Type: Improvement Components: core - common, test - unit Reporter: Wilfred Spiegelenburg In the handler_test.go file we have an anti pattern showing a large number (40+) warnings in an IDE: {quote}'req' might have 'nil' or other unexpected value as its corresponding error variable might be not 'nil' {quote} The warning are due to the fact that we have the following pattern: {code:java} req, err = http.NewRequest("GET", "path", strings.NewReader("")) req = req.WithContext(context.WithValue(req.Context(), httprouter.ParamsKey, httprouter.Params{})){code} There is no error assertion after the request creation. We should add a simple {{assert.NilError(t, err, "HTTP request create failed")}} inserted between creating and using the request. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds
[ https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840296#comment-17840296 ] Wilfred Spiegelenburg commented on YUNIKORN-2580: - Work towards one object for allocations and asks is in progress. YUNIKORN-2457 is open and actively worked on, which means the whole ask object is going through major changes soon. At that point things that are no longer needed or were never used will disappear automatically. Doing this one field at a time causes extra churn and makes it more difficult to track the how and why. > Remove executionTimeoutMilliSeconds > --- > > Key: YUNIKORN-2580 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2580 > Project: Apache YuniKorn > Issue Type: Improvement > Components: scheduler-interface >Reporter: Chia-Ping Tsai >Priority: Minor > > [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211] > It is not used actually, and hence we should either remove it or add facility > for it. Personally, I'd like to remove it to simplify the interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2581) Expose running placement rules in REST
Wilfred Spiegelenburg created YUNIKORN-2581: --- Summary: Expose running placement rules in REST Key: YUNIKORN-2581 URL: https://issues.apache.org/jira/browse/YUNIKORN-2581 Project: Apache YuniKorn Issue Type: New Feature Components: core - common Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Since introducing the use of placement rules always and the recovery rule the queue config does not correctly show the running rules. Also if a config update has been rejected, for any reason, the rules would not be correct Exposing the configured rules from the placement manager works around all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2575) Make logging for IsPodFitNode clear
[ https://issues.apache.org/jira/browse/YUNIKORN-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2575. - Fix Version/s: 1.6.0 Resolution: Fixed unique errors are returned for all failure cases which at DEBUG level will show exactly why the failure occurred. > Make logging for IsPodFitNode clear > --- > > Key: YUNIKORN-2575 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2575 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > The logging in {{IsPodFitNode()}} logs the same message for a missing pod and > node. We should log clearly which thing is missing: the node or the pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds
[ https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840250#comment-17840250 ] Wilfred Spiegelenburg edited comment on YUNIKORN-2580 at 4/24/24 12:05 AM: --- This is used for the placeholder timeout and cannot be removed. See handleSubmitApplicationEvent [here|https://github.com/apache/yunikorn-k8shim/blob/741c0d801ac4530669b8850706efe3f0bc0d5718/pkg/cache/application.go#L437] was (Author: wifreds): This is used for the placeholder timeout and cannot be removed. > Remove executionTimeoutMilliSeconds > --- > > Key: YUNIKORN-2580 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2580 > Project: Apache YuniKorn > Issue Type: Improvement > Components: scheduler-interface >Reporter: Chia-Ping Tsai >Priority: Minor > > [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211] > It is not used actually, and hence we should either remove it or add facility > for it. Personally, I'd like to remove it to simplify the interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds
[ https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2580. - Resolution: Won't Fix This is used for the placeholder timeout and cannot be removed. > Remove executionTimeoutMilliSeconds > --- > > Key: YUNIKORN-2580 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2580 > Project: Apache YuniKorn > Issue Type: Improvement > Components: scheduler-interface >Reporter: Chia-Ping Tsai >Priority: Minor > > [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211] > It is not used actually, and hence we should either remove it or add facility > for it. Personally, I'd like to remove it to simplify the interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-2577) Remove named returns from IsPodFitNodeViaPreemption
[ https://issues.apache.org/jira/browse/YUNIKORN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840149#comment-17840149 ] Wilfred Spiegelenburg edited comment on YUNIKORN-2577 at 4/23/24 5:13 PM: -- BTW: not sure why {{GetPodNoLock}} returns two values. The pod is nil if the boolean is false, pod is not nil if boolean is true The signature can be simplified to just returning the pod. Should probably be a new jira. edit: lpgged YUNIKORN-2578 for the refactor was (Author: wifreds): BTW: not sure why {{GetPodNoLock}} returns two values. The pod is nil if the boolean is false, pod is not nil if boolean is true The signature can be simplified to just returning the pod. Should probably be a new jira. > Remove named returns from IsPodFitNodeViaPreemption > --- > > Key: YUNIKORN-2577 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2577 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Minor > Labels: newbie > > IsPodFitNodeViaPreemption has defined named returns but does not use them. > They should be removed as the way they are used can cause issues that are > hard to debug. > As part of this change we need to further cleanup: > * The variable {{ok}} also gets shadowed multiple times, not just from the > named return declaration. > * The if construct around {{GetPodNoLock()}} is not needed as it returns a > nil for the pod if it returns false. Just adding the result for the pod > always has the same effect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2578) Refactor SchedulerCache.GetPod() remove bool return
Wilfred Spiegelenburg created YUNIKORN-2578: --- Summary: Refactor SchedulerCache.GetPod() remove bool return Key: YUNIKORN-2578 URL: https://issues.apache.org/jira/browse/YUNIKORN-2578 Project: Apache YuniKorn Issue Type: Task Components: shim - kubernetes Reporter: Wilfred Spiegelenburg SchedulerCache {{GetPod()}} and {{GetPodNoLock()}} retrun two values: # *v1.Pod # bool The boolean value is redundant as it is false if the pod is not found and a nil is returned for the pod. The boolean is true if the pod has a value. Testing for a nil pod has the same result. We do not cache a nil pod in the cache for a pod UID -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2577) Remove named returns from IsPodFitNodeViaPreemption
Wilfred Spiegelenburg created YUNIKORN-2577: --- Summary: Remove named returns from IsPodFitNodeViaPreemption Key: YUNIKORN-2577 URL: https://issues.apache.org/jira/browse/YUNIKORN-2577 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg IsPodFitNodeViaPreemption has defined named returns but does not use them. They should be removed as the way they are used can cause issues that are hard to debug. As part of this change we need to further cleanup: * The variable {{ok}} also gets shadowed multiple times, not just from the named return declaration. * The if construct around {{GetPodNoLock()}} is not needed as it returns a nil for the pod if it returns false. Just adding the result for the pod always has the same effect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2576) Data Race: Flaky tests in dispatcher_test.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839986#comment-17839986 ] Wilfred Spiegelenburg commented on YUNIKORN-2576: - The panic that triggered the race in the test you logged shows that we have a bigger problem than just a race condition in this test at the moment. > Data Race: Flaky tests in dispatcher_test.go > > > Key: YUNIKORN-2576 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2576 > Project: Apache YuniKorn > Issue Type: Bug > Components: test - unit >Reporter: Yu-Lin Chen >Priority: Major > Attachments: shim-race.txt > > > How to reproduce: > # In Shim, run 'go test ./pkg/... -race -count=10 > shim-race.txt' > > {code:java} > WARNING: DATA RACE > Write at 0x035315e0 by goroutine 88: > github.com/apache/yunikorn-k8shim/pkg/dispatcher.initDispatcher() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:73 > +0x2c4 > github.com/apache/yunikorn-k8shim/pkg/dispatcher.createDispatcher() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:305 > +0x2f > runtime.Goexit() > /usr/local/go/src/runtime/panic.go:626 +0x5d > testing.(*T).FailNow() > :1 +0x31 > gotest.tools/v3/assert.Equal() > > /home/chenyulin0719/go/pkg/mod/gotest.tools/v3@v3.5.1/assert/assert.go:205 > +0x1aa > github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:244 > +0x2ba > testing.tRunner() > /usr/local/go/src/testing/testing.go:1689 +0x21e > testing.(*T).Run.gowrap1() > /usr/local/go/src/testing/testing.go:1742 +0x44Previous read at > 0x035315e0 by goroutine 90: > > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.func1() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:188 > +0x2f5 > > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.gowrap1() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:197 > +0x6eGoroutine 88 (running) created at: > testing.(*T).Run() > /usr/local/go/src/testing/testing.go:1742 +0x825 > testing.runTests.func1() > /usr/local/go/src/testing/testing.go:2161 +0x85 > testing.tRunner() > /usr/local/go/src/testing/testing.go:1689 +0x21e > testing.runTests() > /usr/local/go/src/testing/testing.go:2159 +0x8be > testing.(*M).Run() > /usr/local/go/src/testing/testing.go:2027 +0xf17 > main.main() > _testmain.go:55 +0x2bdGoroutine 90 (running) created at: > > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:178 > +0x391 > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).dispatch() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:164 > +0xbb > github.com/apache/yunikorn-k8shim/pkg/dispatcher.Dispatch() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:142 > +0x71 > github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:232 > +0x244 > testing.tRunner() > /usr/local/go/src/testing/testing.go:1689 +0x21e > testing.(*T).Run.gowrap1() > /usr/local/go/src/testing/testing.go:1742 +0x44 > =={code} > Root Cause: > * The [globla > vairables|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L46-L51] > in dispatcher.go is not protected when running unit tests. Each unit test > will run initDispatcher() through > [createDispatcher()|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatch_test.go#L305]. > * Race occurs if any other unit tests read/write the global variables before > or after initDispatcher(). ex: TestDispatchTimeout() > [https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L188] > > Solution to be discussed: > # Refactor dispatcher.go and encapsulates global variables to Dispatcher > struct > , change Dispatcher.Start(), Dispatcher.Stop() to type method > # Implement Singleton in getDispatcher() and add a new function > newDispatcher() > # Create a new Dispatcher for each unit test > > The race issue only happens in unit test becasue the shared vairable was > protected by > once.Do(initDispatcher) in dispatcher.go : >
[jira] [Commented] (YUNIKORN-2576) Data Race: Flaky tests in dispatcher_test.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839985#comment-17839985 ] Wilfred Spiegelenburg commented on YUNIKORN-2576: - When I run this only the first run passes and all other 9 runs fail with an assertion failure like this: {code:java} dispatch_test.go:244: assertion failed: 10 (int32) != 1 (int32) {code} This test was never designed to be run multiple times as those global var values are not reset to clean up. Other tests in the same file also break as they expect a 0 value for the async count when tey start. That again is only true for the first run not for the runs 2..10. I do see a data race but the race is triggered by {{TestExceedAsyncDispatchLimit()}} Further point is that we should not use the {{atomic.AddInt32(, 1)}} but we should use the {{atomic.Int32}} introduced in go 1.19 and call {{asyncDispatchCount.Add(1)}} Not sure if this requires a full refactor of the dispatcher or that these tests need to be fixed to be able to handle multiple runs correctly. > Data Race: Flaky tests in dispatcher_test.go > > > Key: YUNIKORN-2576 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2576 > Project: Apache YuniKorn > Issue Type: Bug > Components: test - unit >Reporter: Yu-Lin Chen >Priority: Major > Attachments: shim-race.txt > > > How to reproduce: > # In Shim, run 'go test ./pkg/... -race -count=10 > shim-race.txt' > > {code:java} > WARNING: DATA RACE > Write at 0x035315e0 by goroutine 88: > github.com/apache/yunikorn-k8shim/pkg/dispatcher.initDispatcher() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:73 > +0x2c4 > github.com/apache/yunikorn-k8shim/pkg/dispatcher.createDispatcher() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:305 > +0x2f > runtime.Goexit() > /usr/local/go/src/runtime/panic.go:626 +0x5d > testing.(*T).FailNow() > :1 +0x31 > gotest.tools/v3/assert.Equal() > > /home/chenyulin0719/go/pkg/mod/gotest.tools/v3@v3.5.1/assert/assert.go:205 > +0x1aa > github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:244 > +0x2ba > testing.tRunner() > /usr/local/go/src/testing/testing.go:1689 +0x21e > testing.(*T).Run.gowrap1() > /usr/local/go/src/testing/testing.go:1742 +0x44Previous read at > 0x035315e0 by goroutine 90: > > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.func1() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:188 > +0x2f5 > > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.gowrap1() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:197 > +0x6eGoroutine 88 (running) created at: > testing.(*T).Run() > /usr/local/go/src/testing/testing.go:1742 +0x825 > testing.runTests.func1() > /usr/local/go/src/testing/testing.go:2161 +0x85 > testing.tRunner() > /usr/local/go/src/testing/testing.go:1689 +0x21e > testing.runTests() > /usr/local/go/src/testing/testing.go:2159 +0x8be > testing.(*M).Run() > /usr/local/go/src/testing/testing.go:2027 +0xf17 > main.main() > _testmain.go:55 +0x2bdGoroutine 90 (running) created at: > > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:178 > +0x391 > github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).dispatch() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:164 > +0xbb > github.com/apache/yunikorn-k8shim/pkg/dispatcher.Dispatch() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:142 > +0x71 > github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout() > > /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:232 > +0x244 > testing.tRunner() > /usr/local/go/src/testing/testing.go:1689 +0x21e > testing.(*T).Run.gowrap1() > /usr/local/go/src/testing/testing.go:1742 +0x44 > =={code} > Root Cause: > * The [globla > vairables|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L46-L51] > in dispatcher.go is not protected when running unit tests. Each unit test > will run initDispatcher() through > [createDispatcher()|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatch_test.go#L305]. > * Race occurs if any other unit tests
[jira] [Created] (YUNIKORN-2575) Make logging for IsPodFitNode clear
Wilfred Spiegelenburg created YUNIKORN-2575: --- Summary: Make logging for IsPodFitNode clear Key: YUNIKORN-2575 URL: https://issues.apache.org/jira/browse/YUNIKORN-2575 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg The logging in {{IsPodFitNode()}} logs the same message for a missing pod and node. We should log clearly which thing is missing: the node or the pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2573) Unit test occasionally failed due to dead lock
[ https://issues.apache.org/jira/browse/YUNIKORN-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839453#comment-17839453 ] Wilfred Spiegelenburg commented on YUNIKORN-2573: - {quote}Since there always an error or warning from scheduler health check when running multiple tests at the same time, {quote} Health checks are collecting details when we run other things. There is no "stop the world" locking happening which means that while the health checks run things can change. This could sometimes lead to a comparison. of data from before a change to after a change showing a health issue. Unless the tests hang and not finish there is no dead lock case. > Unit test occasionally failed due to dead lock > -- > > Key: YUNIKORN-2573 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2573 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Arthur Wang >Assignee: Arthur Wang >Priority: Minor > > [github > pipeline|https://github.com/apache/yunikorn-core/actions/runs/8770718393/job/24067600801] > Unit test occasionally failed due to dead lock > Still working on finding root cause. > Since there always an error or warning from scheduler health check when > running multiple tests at the same time, > maybe it's some test setting issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2571) Add hierarchy icon to queue node
[ https://issues.apache.org/jira/browse/YUNIKORN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2571: Fix Version/s: (was: 1.6.0) Target Version: 1.6.0 Please use target version when setting a release for which the fix is planned. The fix version is the release in which the changes are committed and included in a release and set on closure of the Jira after the changes are commited. Even open jiras show up as part of the release notes for that release. This can lead to incorrect info in a release. > Add hierarchy icon to queue node > > > Key: YUNIKORN-2571 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2571 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Dong-Lin Hsieh >Assignee: Dong-Lin Hsieh >Priority: Major > Labels: pull-request-available > > make queue node looks better ! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2570) Add test cases to break the current preemption flow
[ https://issues.apache.org/jira/browse/YUNIKORN-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2570: Fix Version/s: (was: 1.6.0) Target Version: 1.6.0 Please use target version when setting a release for which the fix is planned. The fix version is the release in which the changes are committed and included in a release and set on closure of the Jira after the changes are commited. Even open jiras show up as part of the release notes for that release. This can lead to incorrect info in a release. > Add test cases to break the current preemption flow > --- > > Key: YUNIKORN-2570 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2570 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Labels: pull-request-available > > Add various test cases to break the current preemption flow. These test would > fail now. Follow up jira's > [https://issues.apache.org/jira/browse/YUNIKORN-2500] should fix the problems > in current preemption flow so that these test cases should pass. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2526) Discrepancy between shim cache and core app/task list after scheduler restart
[ https://issues.apache.org/jira/browse/YUNIKORN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2526: Target Version: 1.5.1 > Discrepancy between shim cache and core app/task list after scheduler restart > - > > Key: YUNIKORN-2526 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2526 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Shravan Achar >Priority: Major > Attachments: log-snippet.txt, state-dump-4-1-3.json, > state-dump-4-17.json.zip > > > When scheduler restarts, occasionally it gets into a situation where the > application is still in Running state despite the application getting > terminated in the cluster. This is confirmed with the attached state dump. > > The scheduler core logs indicate all nodes are being evaluated for > non-existing application (also attached). The CPU is being used up doing this > unneeded evaluation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()
[ https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2562: Target Version: 1.5.1 > Nil pointer in Application.ReplaceAllocation() > -- > > Key: YUNIKORN-2562 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2562 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Priority: Major > > The following panic was generated during placeholder replacement: > {noformat} > 2024-04-16T13:46:58.583Z INFOshim.cache.task cache/task.go:542 > releasing allocations {"numOfAsksToRelease": 1, > "numOfAllocationsToRelease": 1} > 2024-04-16T13:46:58.583Z INFOshim.fsmcache/task_state.go:380 > Task state transition {"app": "application-spark-abrdrsmo8no2", "task": > "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": > "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", > "source": "Bound", "destination": "Completed", "event": "CompleteTask"} > 2024-04-16T13:46:58.584Z INFOcore.scheduler.application > objects/application.go:616 ask removed successfully from application > {"appID": "application-spark-abrdrsmo8no2", "ask": > "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"} > 2024-04-16T13:46:58.584Z INFOcore.scheduler.partition > scheduler/partition.go:1281 replacing placeholder allocation > {"appID": "application-spark-abrdrsmo8no2", "allocationID": > "cd73be15-af61-4248-89e1-d3296e72214e"} > panic: runtime error: invalid memory address or nil pointer dereference > [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255] > goroutine 117 [running]: > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600, > {0xc007710cf0, 0x24}) > > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745 > +0x615 > github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?, > 0xc009786700) > > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 > +0x28b > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?, > {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9}) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 > +0x9e > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?, > 0xc0071a3f10?) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 > +0xa5 > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 > +0x1c5 > created by > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in > goroutine 1 > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 > +0x9c > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2556) Remove getResourceUsageDAOInfo from test code
Wilfred Spiegelenburg created YUNIKORN-2556: --- Summary: Remove getResourceUsageDAOInfo from test code Key: YUNIKORN-2556 URL: https://issues.apache.org/jira/browse/YUNIKORN-2556 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Wilfred Spiegelenburg Remove the {{getResourceUsageDAOInfo()}} call from the test code. If we need to retrieve the usage for the whole queueTracker hierarchy we should add that in the test code separately instead of using the DAO and convert that back The DAO object should also not contain the pointer to the resource object. It should contain the DAOMap for the resource object similar to all other DAO definitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2555) Cleanup placement rules in partition
Wilfred Spiegelenburg created YUNIKORN-2555: --- Summary: Cleanup placement rules in partition Key: YUNIKORN-2555 URL: https://issues.apache.org/jira/browse/YUNIKORN-2555 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler Reporter: Wilfred Spiegelenburg The placement rule config is tracked in the partition in the object {{partition.rules}} This object contains the config with which the placement manager is initialised . This was used/needed before the move to always use placement rules.. Since the change to always use placement rules it no longer has a function. The config is now also out of sync with the rules used in the placement manager. There is no need to keep this object in the partition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2540) clean up constants in pkg/cache/context_test.go
Wilfred Spiegelenburg created YUNIKORN-2540: --- Summary: clean up constants in pkg/cache/context_test.go Key: YUNIKORN-2540 URL: https://issues.apache.org/jira/browse/YUNIKORN-2540 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Constants are duplicated in the {{pkg/cache/context_test.go}} example {{fakeNodeName}} is defined multiple times in the files. We should move to a central point of defining the constants for the test at the top of the file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0
[ https://issues.apache.org/jira/browse/YUNIKORN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834115#comment-17834115 ] Wilfred Spiegelenburg commented on YUNIKORN-2534: - Documented in this section: [https://yunikorn.apache.org/docs/user_guide/queue_config#queues] To provide full detail and the reasoning behind it. The resource check is different. I can specify a quota like this: {code:java} vcores: 1000 memory: 1T nvidia.com/gpu: 0{code} That is a valid quota and we apply that. It is a different quota than this one: {code:java} vcores: 1000 memory: 1T{code} In the first quota you are not allowed to use the resource {{nvidia.com/gpu}} in the second quota there is no limit on how many GPUs you can use. What is not allowed in quotas is something that only specifies zeros: {code:java} vcores: 0{code} or {code:java} vcores: 0 memory: 0 nvidia.com/gpu: 0{code} This is the category that the max applications falls in. > [Yunikorn] Quota enforcement checks are failing when we have max-application > set to 0 > - > > Key: YUNIKORN-2534 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2534 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Rajesh Kanhaiya Lal >Priority: Major > Attachments: yunikorn-configs-fresh.yaml > > > The Max-application checks are not working when we are setting > max-application to 0 in the yunikorn-config file. > The Config validation is also ignored in case of max-application is set to 0, > for example, the child max-application should be less or equal to the parent > queue is also not working when we have the max-application set to 0. > Attached Yunikorn Config file > User and Group tracking API also does not log max-application in the response. > > {code:java} > curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users' > [ > { > "userName": "nobody", > "groups": { > "ts333w3": "*", > "ts433": "*", > "ts544": "*", > "ts633": "*" > }, > "queues": { > "queuePath": "root", > "resourceUsage": { > "Resources": { > "memory": 3, > "pods": 3, > "vcore": 300 > } > }, > "runningApplications": [ > "ts333w3", > "ts433", > "ts544" > ], > "children": [ > { > "queuePath": "root.default", > "resourceUsage": { > "Resources": { > "memory": 3, > "pods": 3, > "vcore": 300 > } > }, > "runningApplications": [ > "ts333w3", > "ts433", > "ts544" > ] > } > ] > } > } > ] {code} > Could You please take a look ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2520) PVC errors in AssumePod() are not handled properly
[ https://issues.apache.org/jira/browse/YUNIKORN-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2520. - Fix Version/s: 1.6.0 Resolution: Fixed Changes merged to master Volume issues should be handled correctly now. > PVC errors in AssumePod() are not handled properly > -- > > Key: YUNIKORN-2520 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2520 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > When there is an error caused by a volume operation in > {{Context.AssumePod()}}, the allocation on core side will not be removed. > Although we check the result from {{UpdateAllocation}}, the error handling is > just logging: > {noformat} > if err := callback.UpdateAllocation(response); err != nil { > rmp.handleUpdateResponseError(rmID, err) > } > ... > func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) { > log.Log(log.RMProxy).Error("failed to handle response", >zap.String("rmID", rmID), >zap.Error(err)) > }{noformat} > I suggest moving volume-related code to {{{}Task.postTaskAllocated()}}. In > this case, the task will transition to "Failed" state and we'll have > allocationID available, so we can release both the ask and the allocation: > {noformat} > func (task *Task) releaseAllocation() { > ... > var releaseRequest *si.AllocationRequest > s := TaskStates() > switch task.GetTaskState() { > case s.New, s.Pending, s.Scheduling, s.Rejected: > releaseRequest = common.CreateReleaseAskRequestForTask( > task.applicationID, task.taskID, > task.application.partition) <-- release ask + allocation if possible > default: > if task.allocationID == "" { > ... log error ... > return > } > releaseRequest = > common.CreateReleaseAllocationRequestForTask( > task.applicationID, task.taskID, > task.allocationID, task.application.partition, task.terminationType) > } > ...{noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2538) Shim cache context pre-allocate slice
Wilfred Spiegelenburg created YUNIKORN-2538: --- Summary: Shim cache context pre-allocate slice Key: YUNIKORN-2538 URL: https://issues.apache.org/jira/browse/YUNIKORN-2538 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg When building the reason string from all volume failure reasons we should allocate a slice once based on the size of the reasons object we get returned. See [review comment|https://github.com/apache/yunikorn-k8shim/pull/810#discussion_r1550882867] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2537) cleanup UpdateAllocation in callback
Wilfred Spiegelenburg created YUNIKORN-2537: --- Summary: cleanup UpdateAllocation in callback Key: YUNIKORN-2537 URL: https://issues.apache.org/jira/browse/YUNIKORN-2537 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg UpdateAllocation needs a cleanup: {{getTask()}} already checks for the application. No need to retrieve the application when we process response.New. Sending an event should be linked to the existence of the task not of the application. On top of that we have the appID already in the task so we do not need to get it from the app. The same logic needs to be applied to the whole function, we already do it for the release.* handling. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2533) Implement String() for TrackedResource
Wilfred Spiegelenburg created YUNIKORN-2533: --- Summary: Implement String() for TrackedResource Key: YUNIKORN-2533 URL: https://issues.apache.org/jira/browse/YUNIKORN-2533 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Wilfred Spiegelenburg To fix the way TrackedResources are logged it should implement the String() function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2532) Resource usage report has an incompatible format change
[ https://issues.apache.org/jira/browse/YUNIKORN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833830#comment-17833830 ] Wilfred Spiegelenburg commented on YUNIKORN-2532: - {quote}That change was done to make logging more efficient (using Any() is bad practice). {quote} Not just bad practice, it does an inspection to try and map it to a type it knows. If it does not find a type it knows it passes it to the normal formatting library which tries to do its best to create a string. It adds a lot of overhead. The types in the logging code could change based on the release of the logging code. {{Any()}} is a last resort logger if you are not sure what type the object is because you pass interfaces around. That is not the case here. Logging now has a stable format for the message. The fact that you noticed a difference between {{Any()}} and the {{Stringer()}} already shows the formatting was a best guess... > Resource usage report has an incompatible format change > --- > > Key: YUNIKORN-2532 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2532 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Yongjun Zhang >Priority: Major > > There is some recent change that caused the application resource usage report > to have a new format: > Prior the change, the format was: > {code:java} > YK_APP_SUMMARY: {"appID": "adf53ee0-experiment-organicad-94520240-1-1", > "submissionTime": 1712169262131, "startTime": 1712169264134, "finishTime": > 1712173619983, "user": > "system:serviceaccount:spark-operator-02:spark-operator", "queue": > "root.queue-large", "state": "Completed", "rmID": "test-cluster", > "resourceUsage": > {"abc":{"memory":139178200478515200,"pods":1729129,"vcore":5183062000},"def":{"memory":113789789798400,"pods":1413,"vcore":4239000}}, > "preemptedResource": {}} > {code} > with the change, the new format is: > {code:java} > 2024-04-04T00:33:08.532Z INFOcore.scheduler.application.usage > objects/application_summary.go:60 YK_APP_SUMMARY: {ApplicationID: > afa303d0-test-trino-sparksql--20240404-2-1, SubmissionTime: 1712190615461, > StartTime: 1712190617496, FinishTime: 1712190788532, User: > system:serviceaccount:spark-operator-01:spark-operator, Queue: > root.queue-large, State: Completed, RmID: test-cluster, ResourceUsage: > TrackedResource{UNKNOWN:pods=177,UNKNOWN:vcore=354000,UNKNOWN:memory=1431454089216}, > PreemptedResource: TrackedResource{}, PlaceholderResource: > TrackedResource{}}{code} > There are several incompatibilities: > 1. the class name TrackedResource was not there before, now it is. > 2. the instance type was outside the resource part before, not it's embedded > 3. the instance type was reported correctly before the change, now it's > UNKNOWN > #3 may be a different issue, but it's observed by us at the same time. > I think what should change the format back to the original one, as this is an > incompatible change. What do you think [~wilfreds] , [~pbacsko] ,[~ccondit] ? > Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2527) Allow remove and re-add configured queue within cleanup time
[ https://issues.apache.org/jira/browse/YUNIKORN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2527. - Fix Version/s: 1.6.0 Resolution: Fixed Queues can now be removed and added back again within a cleanup cycle > Allow remove and re-add configured queue within cleanup time > - > > Key: YUNIKORN-2527 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2527 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > When we remove a queue from the config it is marked for cleanup. If we re-add > the same queue in the config again before the cleanup gets executed the queue > still gets removed. > reproduction: > * edit config map remove a queue, save > * immediately edit configmap add the same queue back, save > * wait for the cleanup to happen, queue should still exist after the fix -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2519) Remove bypass ACL check from placement rules
[ https://issues.apache.org/jira/browse/YUNIKORN-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2519. - Fix Version/s: 1.6.0 Resolution: Fixed refactor committed to master for 1.6.0 > Remove bypass ACL check from placement rules > > > Key: YUNIKORN-2519 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2519 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Instead of returning a flag to not bypass the ACL check by all rules except > for the recovery rule special case the recovery rule to bypass checks. > The recovery queue is created without ACLs, quota and is always a leaf queue. > The only rule that can return the recovery queue is the recovery rule which > is the last one in the list. > Use all these facts to simplify the placement processing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2527) Allow remove and re-add configured queue within cleanup time
[ https://issues.apache.org/jira/browse/YUNIKORN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2527: Description: When we remove a queue from the config it is marked for cleanup. If we re-add the same queue in the config again before the cleanup gets executed the queue still gets removed. reproduction: * edit config map remove a queue, save * immediately edit configmap add the same queue back, save * wait for the cleanup to happen, queue should still exist after the fix was: When we remove a queue from the config it is marked for cleanup. If we re-add the same queue in the config again before the cleanup gets executed the queue still gets removed. reproduction: * edit config map remove a queue, save * immediately edit configmap add the same queue back, save * wait for the cleanup to happen > Allow remove and re-add configured queue within cleanup time > - > > Key: YUNIKORN-2527 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2527 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > When we remove a queue from the config it is marked for cleanup. If we re-add > the same queue in the config again before the cleanup gets executed the queue > still gets removed. > reproduction: > * edit config map remove a queue, save > * immediately edit configmap add the same queue back, save > * wait for the cleanup to happen, queue should still exist after the fix -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2527) Allow remove and re-add configured queue within cleanup time
Wilfred Spiegelenburg created YUNIKORN-2527: --- Summary: Allow remove and re-add configured queue within cleanup time Key: YUNIKORN-2527 URL: https://issues.apache.org/jira/browse/YUNIKORN-2527 Project: Apache YuniKorn Issue Type: Bug Components: core - common Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg When we remove a queue from the config it is marked for cleanup. If we re-add the same queue in the config again before the cleanup gets executed the queue still gets removed. reproduction: * edit config map remove a queue, save * immediately edit configmap add the same queue back, save * wait for the cleanup to happen -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2498) Implement force create flag in k8shim for recovery queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2498. - Fix Version/s: 1.6.0 Resolution: Fixed > Implement force create flag in k8shim for recovery queue > > > Key: YUNIKORN-2498 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2498 > Project: Apache YuniKorn > Issue Type: Task > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > As part of the initialisation changes a new recovery queue was added to allow > already running allocation to be restored even if the queue config was > changed. The implementation on the k8shim side needs to be added to leverage > the forced create flag from YUNIKORN-1887. > Without that the changes added for the recovery queue will not be used -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2522) Move e2e test doc from k8shim to website
[ https://issues.apache.org/jira/browse/YUNIKORN-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2522: Target Version: 1.6.0 > Move e2e test doc from k8shim to website > > > Key: YUNIKORN-2522 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2522 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: JiaChi Wang >Assignee: JiaChi Wang >Priority: Minor > > If we move the e2e doc to website under the developer guide, that may be > easily to access for users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2523) Bump go to 1.22
[ https://issues.apache.org/jira/browse/YUNIKORN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832772#comment-17832772 ] Wilfred Spiegelenburg commented on YUNIKORN-2523: - Before we update the go version we need to at least have some confirmation that people have tried it.I have not run any builds or test with Go 1.22 as yet. The linter golangci-lint we run might also need updating to a later version to support 1.22 and make sure it work correctly. Changes in go have broken the linter a number of times over the last years. With the new toolchain dependency checks we need to leave go.mod as is and just update the version file we have in the repo. > Bump go to 1.22 > --- > > Key: YUNIKORN-2523 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2523 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Ryan Lo >Assignee: Ryan Lo >Priority: Major > > The latest go 1.22 released in this Feb. > https://go.dev/doc/go1.22 > We should change to use latest go version to build YK. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2494) Revisit IsAtorAbove, WithIn, GetRemaining Guaranteed resources calculation
[ https://issues.apache.org/jira/browse/YUNIKORN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2494. - Fix Version/s: 1.6.0 Resolution: Fixed Functions added to the master code, not actively used yet. > Revisit IsAtorAbove, WithIn, GetRemaining Guaranteed resources calculation > -- > > Key: YUNIKORN-2494 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2494 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - common >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > These 3 methods doesn't expose the actual guaranteed values and returns > boolean value based on the calculation. There are cases, where these boolean > values are not correct and also there is a need to know the actual guaranteed > values. For example, How much is remaining in Guaranteed? How much can be > preempted? etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2519) Remove bypass ACL check from placement rules
[ https://issues.apache.org/jira/browse/YUNIKORN-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831356#comment-17831356 ] Wilfred Spiegelenburg commented on YUNIKORN-2519: - Logging of placements should be part of the app processing and not fall under the config adding that to the refactor. > Remove bypass ACL check from placement rules > > > Key: YUNIKORN-2519 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2519 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > Instead of returning a flag to not bypass the ACL check by all rules except > for the recovery rule special case the recovery rule to bypass checks. > The recovery queue is created without ACLs, quota and is always a leaf queue. > The only rule that can return the recovery queue is the recovery rule which > is the last one in the list. > Use all these facts to simplify the placement processing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2519) Remove bypass ACL check from placement rules
Wilfred Spiegelenburg created YUNIKORN-2519: --- Summary: Remove bypass ACL check from placement rules Key: YUNIKORN-2519 URL: https://issues.apache.org/jira/browse/YUNIKORN-2519 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Instead of returning a flag to not bypass the ACL check by all rules except for the recovery rule special case the recovery rule to bypass checks. The recovery queue is created without ACLs, quota and is always a leaf queue. The only rule that can return the recovery queue is the recovery rule which is the last one in the list. Use all these facts to simplify the placement processing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2518) Allow recovery queue in REST requests
Wilfred Spiegelenburg created YUNIKORN-2518: --- Summary: Allow recovery queue in REST requests Key: YUNIKORN-2518 URL: https://issues.apache.org/jira/browse/YUNIKORN-2518 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Wilfred Spiegelenburg The current checks for the REST requests that require a queue path to be provided prevent looking at the {{root.@recover@}} queue. The validator filters the queue names which makes it impossible to check if the queue has any running applications or pod after initialisation using the REST requests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2517) [Yunikorn] Incorrect Placeholder Count for Duplicate Task Groups in Gang scheduling
[ https://issues.apache.org/jira/browse/YUNIKORN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830813#comment-17830813 ] Wilfred Spiegelenburg commented on YUNIKORN-2517: - This looks like a side effect of YUNIKORN-1931. > [Yunikorn] Incorrect Placeholder Count for Duplicate Task Groups in Gang > scheduling > --- > > Key: YUNIKORN-2517 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2517 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Rajesh Kanhaiya Lal >Assignee: Manikandan R >Priority: Major > > Hi Team , I am getting incorrect placeholder count for the duplicate task > groups in gang scheduling. > Example : > {code:java} > TaskGroups: []v1alpha1.TaskGroup{ > {Name: "groupdup", MinMember: int32(3), > MinResource: map[string]resource.Quantity{ > "cpu":resource.MustParse("10m"), > "memory": resource.MustParse("10M"), > }}, > {Name: "groupdup", MinMember: int32(5), > MinResource: map[string]resource.Quantity{ > "cpu":resource.MustParse("10m"), > "memory": resource.MustParse("10M"), > }}, > {Name: "groupa", MinMember: int32(7), > MinResource: map[string]resource.Quantity{ > "cpu":resource.MustParse("10m"), > "memory": resource.MustParse("10M"), > }}, > }, {code} > for the above config, we are getting a total of 17 pods ( 2 groups. + 15 > Placeholders). > It's adding the duplicate group placeholder as well. > Could you please take a look? > {code:java} > gangjob-c805x-l4fx9 1/1 Running 0 47s > gangjob-c805x-tc8tr 1/1 Running 0 47s > tg-appid-oqina-groupa-1ap48pr4us 1/1 Running 0 45s > tg-appid-oqina-groupa-25t5jubyzl 1/1 Running 0 45s > tg-appid-oqina-groupa-6oxhqxnebc 1/1 Running 0 45s > tg-appid-oqina-groupa-bqj9nk3mdq 1/1 Running 0 45s > tg-appid-oqina-groupa-hugxbjb3xv 1/1 Running 0 45s > tg-appid-oqina-groupa-o46k68fhw1 1/1 Running 0 45s > tg-appid-oqina-groupa-vs5kxeop8z 1/1 Running 0 45s > tg-appid-oqina-groupdup-786dl3gch2 1/1 Running 0 45s > tg-appid-oqina-groupdup-877tnd4xdl 1/1 Running 0 45s > tg-appid-oqina-groupdup-b7yef7w47x 1/1 Running 0 45s > tg-appid-oqina-groupdup-cdqm1fcwbo 1/1 Running 0 45s > tg-appid-oqina-groupdup-hlxwv9to9z 1/1 Running 0 45s > tg-appid-oqina-groupdup-mvcd5pkijw 1/1 Running 0 45s > tg-appid-oqina-groupdup-o4d9s8d02p 1/1 Running 0 45s > tg-appid-oqina-groupdup-srrxrukstd 1/1 Running 0 45s {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2506) fix deprecation warning for fontsource-roboto
[ https://issues.apache.org/jira/browse/YUNIKORN-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2506: Summary: fix deprecation warning for fontsource-roboto (was: fix ) > fix deprecation warning for fontsource-roboto > - > > Key: YUNIKORN-2506 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2506 > Project: Apache YuniKorn > Issue Type: Improvement > Components: webapp >Reporter: Wilfred Spiegelenburg >Priority: Minor > Labels: newbie > > When running make on the web UI project a deprecation warning is printed for > the fonts we include: > {code:java} > WARN deprecated fontsource-roboto@4.0.0: Package relocated. Please install > and migrate to @fontsource/roboto. {code} > Move to {{@fontsource/roboto}} to fix the warning -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2506) fix
Wilfred Spiegelenburg created YUNIKORN-2506: --- Summary: fix Key: YUNIKORN-2506 URL: https://issues.apache.org/jira/browse/YUNIKORN-2506 Project: Apache YuniKorn Issue Type: Improvement Components: webapp Reporter: Wilfred Spiegelenburg When running make on the web UI project a deprecation warning is printed for the fonts we include: {code:java} WARN deprecated fontsource-roboto@4.0.0: Package relocated. Please install and migrate to @fontsource/roboto. {code} Move to {{@fontsource/roboto}} to fix the warning -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2498) Implement force create flag in k8shim for recovery queue
Wilfred Spiegelenburg created YUNIKORN-2498: --- Summary: Implement force create flag in k8shim for recovery queue Key: YUNIKORN-2498 URL: https://issues.apache.org/jira/browse/YUNIKORN-2498 Project: Apache YuniKorn Issue Type: Task Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg As part of the initialisation changes a new recovery queue was added to allow already running allocation to be restored even if the queue config was changed. The implementation on the k8shim side needs to be added to leverage the forced create flag from YUNIKORN-1887. Without that the changes added for the recovery queue will not be used -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2497) Update node.js to 18.19.1
Wilfred Spiegelenburg created YUNIKORN-2497: --- Summary: Update node.js to 18.19.1 Key: YUNIKORN-2497 URL: https://issues.apache.org/jira/browse/YUNIKORN-2497 Project: Apache YuniKorn Issue Type: Task Components: website Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Node 18.x is a LTS version. The version 18.17 has been superseded with two other releases 18.18 and 18.19. Both have some CVE fixes which we should be including for stability. Moving the build to 18.19 (currently 18.19.1) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2496) Fix security issues in website javascript
[ https://issues.apache.org/jira/browse/YUNIKORN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2496. - Fix Version/s: 1.6.0 Resolution: Fixed Change committed all dependabot alerts closed > Fix security issues in website javascript > - > > Key: YUNIKORN-2496 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2496 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > The change to pnmp triggered a large number of security alerts from > dependabot. > 7 could be fixed directly by the 4 PRs opened by dependabot. 6 need manual > intervention. > The change also included an upgrade of the Algolia search component to 3.x. > That change prevent running {{{}pnpm audit{}}}. > Docusaurus 3.x also contains a large number of backward incompatible changes > and an upgrade is planned separately. Using the Algolia 3.x dependency > already pushes some of these changes and should be reverted to Algolia 2.x > same as the rest of Docusaurus environment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2496) Fix security issues in website javascript
[ https://issues.apache.org/jira/browse/YUNIKORN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827831#comment-17827831 ] Wilfred Spiegelenburg commented on YUNIKORN-2496: - When updating axios via pnpm it gets upgraded to 1.6.8. The build after that change does not work anymore. Forcing axios to move to 0.28 (from vulnerable 0.25) fixes that issue. > Fix security issues in website javascript > - > > Key: YUNIKORN-2496 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2496 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The change to pnmp triggered a large number of security alerts from > dependabot. > 7 could be fixed directly by the 4 PRs opened by dependabot. 6 need manual > intervention. > The change also included an upgrade of the Algolia search component to 3.x. > That change prevent running {{{}pnpm audit{}}}. > Docusaurus 3.x also contains a large number of backward incompatible changes > and an upgrade is planned separately. Using the Algolia 3.x dependency > already pushes some of these changes and should be reverted to Algolia 2.x > same as the rest of Docusaurus environment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2496) Fix security issues in website javascript
Wilfred Spiegelenburg created YUNIKORN-2496: --- Summary: Fix security issues in website javascript Key: YUNIKORN-2496 URL: https://issues.apache.org/jira/browse/YUNIKORN-2496 Project: Apache YuniKorn Issue Type: Task Components: website Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg The change to pnmp triggered a large number of security alerts from dependabot. 7 could be fixed directly by the 4 PRs opened by dependabot. 6 need manual intervention. The change also included an upgrade of the Algolia search component to 3.x. That change prevent running {{{}pnpm audit{}}}. Docusaurus 3.x also contains a large number of backward incompatible changes and an upgrade is planned separately. Using the Algolia 3.x dependency already pushes some of these changes and should be reverted to Algolia 2.x same as the rest of Docusaurus environment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2490) Add new PMC and committer members
[ https://issues.apache.org/jira/browse/YUNIKORN-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827340#comment-17827340 ] Wilfred Spiegelenburg commented on YUNIKORN-2490: - I had made that change but forgot to push before the merge. Did it directly after that all is correct. > Add new PMC and committer members > - > > Key: YUNIKORN-2490 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2490 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Trivial > Labels: pull-request-available > Fix For: 1.6.0 > > > We have elected a new PMC member and some committers. Now that they have > accepted we should add them to the website. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org