[jira] [Commented] (YUNIKORN-2280) Possible memory leak in scheduler

2023-12-20 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799194#comment-17799194
 ] 

Weiwei Yang commented on YUNIKORN-2280:
---

Hi [~ccondit] I think another angle of this problem is we need to review what 
API calls are behind this.
E.g when we send events to k8s, we do rate limiting some places otherwise it 
may be too overwhelming. I am unsure if this might be related to that, maybe 
somewhere we send too many events?

> Possible memory leak in scheduler
> -
>
> Key: YUNIKORN-2280
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2280
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.3.0, 1.4.0
> Environment: EKS 1.24, we observed same behavior with YK 1.3.0 & 1.4.0
>Reporter: Timothy Potter
>Priority: Major
> Attachments: goroutine-dump.out, heap-dump-1001.out, 
> heap-dump-1255.out, yunikor-scheduler-process-memory.png, 
> yunikorn-process-memory-last9hours.png, yunikorn-scheduler-goroutines.png
>
>
> Memory for our scheduler pod slowly increases until it gets killed by kubelet 
> for surpassing its memory limit. 
> I've included two heap dump files collected about 3 hours apart, see process 
> memory chart for the same period. Not really sure what to make of these heap 
> dumps so hoping someone else who knows the code better might have some 
> insights?
> from heap-dump-1001.out:
> {code}
>   flat  flat%   sum%cum   cum%
> 1.46GB 24.68% 24.68% 1.46GB 24.68%  reflect.unsafe_NewArray
> 1.30GB 21.94% 46.63% 1.32GB 22.35%  
> sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
> 1.06GB 17.96% 64.58% 1.06GB 17.96%  
> k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).UnmarshalJSON
> 0.88GB 14.87% 79.45% 0.88GB 14.87%  reflect.mapassign_faststr0
> {code}
> from heap-dump-1255.out:
> {code}
>   flat  flat%   sum%cum   cum%
>  1756.18MB 23.53% 23.53%  1756.18MB 23.53%  reflect.unsafe_NewArray
>  1612.36MB 21.60% 45.13%  1645.86MB 22.05%  
> sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
>  1359.86MB 18.22% 63.35%  1359.86MB 18.22%  
> k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).UnmarshalJSON
>  1136.40MB 15.22% 78.57%  1136.40MB 15.22%  reflect.mapassign_faststr0
> {code}
> We also see odd spikes in the # of goroutines but that doesn't seem 
> correlated with the increase in memory (mainly just mentioning this in case 
> it's unexpected)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2271) Incorrect handling of GPU only resources

2023-12-15 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797321#comment-17797321
 ] 

Weiwei Yang commented on YUNIKORN-2271:
---

Thanks a lot [~zhuqi] for looking into this!

> Incorrect handling of GPU only resources
> 
>
> Key: YUNIKORN-2271
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2271
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/yunikorn-k8shim/blob/a118ba6c4d84804e2a407f9d91196ece4690cf09/pkg/common/resource.go#L61-L63
>  this code seems to have a bug. When I define resource like this:
> {code}
> request:
>   nvidia.com/gpu: 1
> limit:
>   nvidia.com/gpu: 1
>  {code}
> this is considered as QoS best effort and returned with just
> {code}
> Resources:
>   pod:1
> {code}
> but I think this is a valid configuration that a pod only specifies GPU 
> resource without memory or CPU. It seems this is the K8s upstream code: 
> qos.GetPodQOS() causes this..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2271) Incorrect handling of GPU only resources

2023-12-14 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-2271:
--
Description: 
https://github.com/apache/yunikorn-k8shim/blob/a118ba6c4d84804e2a407f9d91196ece4690cf09/pkg/common/resource.go#L61-L63
 this code seems to have a bug. When I define resource like this:

{code}
request:
  nvidia.com/gpu: 1
limit:
  nvidia.com/gpu: 1
 {code}

this is considered as QoS best effort and returned with just

{code}
Resources:
  pod:1
{code}

but I think this is a valid configuration that a pod only specifies GPU 
resource without memory or CPU. It seems this is the K8s upstream code: 
qos.GetPodQOS() causes this..

  was:
https://github.com/apache/yunikorn-k8shim/blob/a118ba6c4d84804e2a407f9d91196ece4690cf09/pkg/common/resource.go#L61-L63
 this code seems to have a bug. When I define resource like this:

{code}
request:
  nvidia.com/gpu: 1
limit:
  nvidia.com/gpu: 1
 {code}

this is considered as QoS best effort and returned with just

{code}
Resources:
  pod:1
{code}

but I think this is a valid configuration that a pod only specifies GPU 
resource without memory or CPU. 


> Incorrect handling of GPU only resources
> 
>
> Key: YUNIKORN-2271
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2271
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Priority: Major
>
> https://github.com/apache/yunikorn-k8shim/blob/a118ba6c4d84804e2a407f9d91196ece4690cf09/pkg/common/resource.go#L61-L63
>  this code seems to have a bug. When I define resource like this:
> {code}
> request:
>   nvidia.com/gpu: 1
> limit:
>   nvidia.com/gpu: 1
>  {code}
> this is considered as QoS best effort and returned with just
> {code}
> Resources:
>   pod:1
> {code}
> but I think this is a valid configuration that a pod only specifies GPU 
> resource without memory or CPU. It seems this is the K8s upstream code: 
> qos.GetPodQOS() causes this..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2271) Incorrect handling of GPU only resources

2023-12-14 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796961#comment-17796961
 ] 

Weiwei Yang commented on YUNIKORN-2271:
---

Simple code to reproduce:

{code}
func TestGPUOnlyResources(t *testing.T) {
containers := make([]v1.Container, 0)

// container 01
c1Resources := make(map[v1.ResourceName]resource.Quantity)
c1Resources[v1.ResourceName("nvidia.com/gpu")] = resource.MustParse("1")
containers = append(containers, v1.Container{
Name: "container-01",
Resources: v1.ResourceRequirements{
Requests: c1Resources,
},
})

pod := {
TypeMeta: apis.TypeMeta{
Kind:   "Pod",
APIVersion: "v1",
},
ObjectMeta: apis.ObjectMeta{
Name: "pod-resource-test-1",
UID:  "UID-1",
},
Spec: v1.PodSpec{
Containers: containers,
},
}

res := GetPodResource(pod)
fmt.Println(res.String())
}
{code}

> Incorrect handling of GPU only resources
> 
>
> Key: YUNIKORN-2271
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2271
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Priority: Major
>
> https://github.com/apache/yunikorn-k8shim/blob/a118ba6c4d84804e2a407f9d91196ece4690cf09/pkg/common/resource.go#L61-L63
>  this code seems to have a bug. When I define resource like this:
> {code}
> request:
>   nvidia.com/gpu: 1
> limit:
>   nvidia.com/gpu: 1
>  {code}
> this is considered as QoS best effort and returned with just
> {code}
> Resources:
>   pod:1
> {code}
> but I think this is a valid configuration that a pod only specifies GPU 
> resource without memory or CPU. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2271) Incorrect handling of GPU only resources

2023-12-14 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-2271:
-

 Summary: Incorrect handling of GPU only resources
 Key: YUNIKORN-2271
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2271
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Weiwei Yang


https://github.com/apache/yunikorn-k8shim/blob/a118ba6c4d84804e2a407f9d91196ece4690cf09/pkg/common/resource.go#L61-L63
 this code seems to have a bug. When I define resource like this:

{code}
request:
  nvidia.com/gpu: 1
limit:
  nvidia.com/gpu: 1
 {code}

this is considered as QoS best effort and returned with just

{code}
Resources:
  pod:1
{code}

but I think this is a valid configuration that a pod only specifies GPU 
resource without memory or CPU. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2270) GPU Preemption is not triggered as expected when all available GPUs are used

2023-12-13 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796508#comment-17796508
 ] 

Weiwei Yang edited comment on YUNIKORN-2270 at 12/14/23 1:42 AM:
-

Based on my investigation. I think the issue is because of these lines of code: 
https://github.com/apache/yunikorn-core/blob/620687afe10638d3e191edffbc81959985a4/pkg/scheduler/objects/preemption.go#L576-L587.
 In my case, because all GPUs are used and there are 300 pods pending on other 
queues. The head room for GPU is always 0. So it did not go the reserve code. 
And goes to: "Preempting allocations for ask, but not reserving yet as queue is 
still above capacity". So the asks in queue a are marked as triggered 
preemption, but unable to get the preempted resources.

When I comment out this check
{code}
p.headRoom.FitInMaxUndef(p.ask.GetAllocatedResource())
{code}

and always return a reserved allocation, it then works pretty well. [~ccondit] 
can you please take a look and share your thoughts?



was (Author: wwei):
Based on my investigation. I think the issue is because of these lines of code: 
https://github.com/apache/yunikorn-core/blob/620687afe10638d3e191edffbc81959985a4/pkg/scheduler/objects/preemption.go#L576-L587.
 In my case, because all GPUs are used and there are 300 pods pending on other 
queues. The head room for GPU is always 0. So it did not go the reserve code. 
And goes to: "Preempting allocations for ask, but not reserving yet as queue is 
still above capacity". So the asks in queue a are marked as triggered 
preemption, but unable to get the preempted resources.


> GPU Preemption is not triggered as expected when all available GPUs are used
> 
>
> Key: YUNIKORN-2270
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2270
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Priority: Major
>
> I am testing an important scenario of preemption for GPU. The design a 
> scenario is like the following:
> queue structure is pretty simple:
> {code}
> root.a (min=100, max=300)
> root.b (min=0, max=300)
> {code}
> the cluster has a total of 300 GPUs available, no autoscaling. Reproducing 
> steps:
> 1. Create 600 pods in root.b queue, each needs 1 GPU. This will consume all 
> 300 GPUs available in the cluster, and 300 pods pending
> 2. Create 100 pods in root.a queue, each needs 1 GPU. The expectation is 
> queue a will preempt 100 GPU from queue b reach the guarantee. 
> observation: a small number of pods preempted resources from queue b got 
> started on queue a, the result is not stable. it could not reach guaranteed 
> resources. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2270) GPU Preemption is not triggered as expected when all available GPUs are used

2023-12-13 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796508#comment-17796508
 ] 

Weiwei Yang commented on YUNIKORN-2270:
---

Based on my investigation. I think the issue is because of these lines of code: 
https://github.com/apache/yunikorn-core/blob/620687afe10638d3e191edffbc81959985a4/pkg/scheduler/objects/preemption.go#L576-L587.
 In my case, because all GPUs are used and there are 300 pods pending on other 
queues. The head room for GPU is always 0. So it did not go the reserve code. 
And goes to: "Preempting allocations for ask, but not reserving yet as queue is 
still above capacity". So the asks in queue a are marked as triggered 
preemption, but unable to get the preempted resources.


> GPU Preemption is not triggered as expected when all available GPUs are used
> 
>
> Key: YUNIKORN-2270
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2270
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Priority: Major
>
> I am testing an important scenario of preemption for GPU. The design a 
> scenario is like the following:
> queue structure is pretty simple:
> {code}
> root.a (min=100, max=300)
> root.b (min=0, max=300)
> {code}
> the cluster has a total of 300 GPUs available, no autoscaling. Reproducing 
> steps:
> 1. Create 600 pods in root.b queue, each needs 1 GPU. This will consume all 
> 300 GPUs available in the cluster, and 300 pods pending
> 2. Create 100 pods in root.a queue, each needs 1 GPU. The expectation is 
> queue a will preempt 100 GPU from queue b reach the guarantee. 
> observation: a small number of pods preempted resources from queue b got 
> started on queue a, the result is not stable. it could not reach guaranteed 
> resources. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2270) GPU Preemption is not triggered as expected when all available GPUs are used

2023-12-13 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-2270:
-

 Summary: GPU Preemption is not triggered as expected when all 
available GPUs are used
 Key: YUNIKORN-2270
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2270
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Weiwei Yang


I am testing an important scenario of preemption for GPU. The design a scenario 
is like the following:

queue structure is pretty simple:

{code}
root.a (min=100, max=300)
root.b (min=0, max=300)
{code}

the cluster has a total of 300 GPUs available, no autoscaling. Reproducing 
steps:

1. Create 600 pods in root.b queue, each needs 1 GPU. This will consume all 300 
GPUs available in the cluster, and 300 pods pending
2. Create 100 pods in root.a queue, each needs 1 GPU. The expectation is queue 
a will preempt 100 GPU from queue b reach the guarantee. 

observation: a small number of pods preempted resources from queue b got 
started on queue a, the result is not stable. it could not reach guaranteed 
resources. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2255) Minor updates on who-we-are page

2023-12-13 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-2255.
---
Fix Version/s: 1.5.0
   Resolution: Fixed

> Minor updates on who-we-are page
> 
>
> Key: YUNIKORN-2255
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2255
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.0
>
>
> Minor updates for Junping, changing the organization to Datastrato 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2255) Minor updates on who-we-are page

2023-12-09 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-2255:
--
Target Version:   (was: 1.5.0)

> Minor updates on who-we-are page
> 
>
> Key: YUNIKORN-2255
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2255
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: documentation
>Reporter: Weiwei Yang
>Priority: Major
>
> Minor updates for Junping, changing the organization to Datastrato 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-2255) Minor updates on who-we-are page

2023-12-09 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YUNIKORN-2255:
-

Assignee: Weiwei Yang

> Minor updates on who-we-are page
> 
>
> Key: YUNIKORN-2255
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2255
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>
> Minor updates for Junping, changing the organization to Datastrato 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2255) Minor updates on who-we-are page

2023-12-09 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-2255:
-

 Summary: Minor updates on who-we-are page
 Key: YUNIKORN-2255
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2255
 Project: Apache YuniKorn
  Issue Type: Task
  Components: documentation
Reporter: Weiwei Yang


Minor updates for Junping, changing the organization to Datastrato 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1907) Integrate KubeRay with YuniKorn

2023-10-03 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771633#comment-17771633
 ] 

Weiwei Yang commented on YUNIKORN-1907:
---

hi [~rainieli] thanks a lot for creating this work.
This is a great collaboration between YK and Ray's community. Glad you already 
have a ticket opening in Ray. 
Will follow up on this one, really great work, thank you !!

> Integrate KubeRay with YuniKorn 
> 
>
> Key: YUNIKORN-1907
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1907
> Project: Apache YuniKorn
>  Issue Type: New Feature
>Reporter: Rainie Li
>Assignee: Rainie Li
>Priority: Major
> Fix For: 1.4.0
>
>
> Will work on integrate KubeRay with Yunikorn to schedule Ray jobs on EKS
>  * adding changes to KuberRay side
>  * Validate ray jobs with YuniKorn queue, gang scheduler features etc.
>  * Adding other features to YuniKorn side if needed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1724) Improve the performance of shim side scheduling cycle

2023-05-05 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720080#comment-17720080
 ] 

Weiwei Yang commented on YUNIKORN-1724:
---

hi [~pbacsko] do you have the throughput comparison before & after applying 
your patch?
the GetNewTasks() doesn't seem very expensive, as it just holds the particular 
app's Rlock. Most of the time, one app's tasks get added really fast, and then 
holding the read lock shouldn't be a big problem. I am guessing the expensive 
part is the 
[sorting|https://github.com/apache/yunikorn-k8shim/blob/cc14a81fdeb0371db861279d05fbd45a98cffe54/pkg/cache/application.go#L314-L318],
 maybe a better fix is to replace the task map with a sorted map in 
application.go?

> Improve the performance of shim side scheduling cycle
> -
>
> Key: YUNIKORN-1724
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1724
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Attachments: getNewTasks.png
>
>
> Performance testing of Yunikorn uncovered that a lot of time is spent in 
> {{Application.Schedule()}} in the shim. The problem is related to the fact 
> that we collect task objects based on their state which is maintained by 
> {{{}fsm.FSM{}}}. Even though we run {{Application.Schedule()}} once per 
> second, it's still an issue due to the large number of {{RWMutex.RLock()}} 
> calls. With a lot of pods, this consumes significant amount of CPU time.
> Also, different code paths are affected:
> The first is inside the switch-case part in {{{}Schedule(){}}}. We want to 
> know the number of tasks in "New" state and we end up scanning all task 
> objects for their status. 
> The second is retrieving the "New" tasks from {{taskMap}} structure. This is 
> done by {{GetNewTasks()}} / {{{}getTasks(){}}}, copying tasks based on their 
> respective state to a new slice.
> To speed things up, we have to track the "New" tasks in a new map which is 
> dynamically maintained when a new task added and when it leaves the New state 
> (or the task gets removed). Knowing how many tasks we have also becomes 
> trivial and won't require slice iteration/filtering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1642) Scheduler recovery failed due to listing operation timeout

2023-03-20 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703015#comment-17703015
 ] 

Weiwei Yang commented on YUNIKORN-1642:
---

hi [~wilfreds] 

We can't let this pass with a WARN, if we couldn't get the results back from 
the api-server, we should let it fail to prevent more problems.
Glad this is configurable now, this is helpful so we can tune on large 
clusters. Even so, we still need to have the fatal logic in place, otherwise, 
it will bypass and cause more problems.

> Scheduler recovery failed due to listing operation timeout
> --
>
> Key: YUNIKORN-1642
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1642
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>
> The listing operation in the recovery phase: 
> https://github.com/apache/yunikorn-k8shim/blob/c25ac60ffbc175c4966f917da21d184f34dea7b4/pkg/client/apifactory.go#L225.
>  This could sometimes fail on some large clusters, the response time from API 
> server is not guaranteed. And we see logs like this
> {noformat}
> 2023-03-16T07:00:46.181Z  WARNclient/apifactory.go:218Failed 
> to sync informers{"error": "timeout waiting for condition"}
> 2023-03-16T07:00:46.182Z  INFOgeneral/general.go:344  Pod list 
> retrieved from api server  {"nr of pods": 0}
> 2023-03-16T07:00:46.182Z  INFOgeneral/general.go:365  Application 
> recovery statistics {"nr of recoverable apps": 0, "nr of total pods": 0, "nr 
> of pods without application metadata": 0, "nr of pods to be recovered": 0}
> I0316 07:00:51.319100   1 trace.go:205] Trace[140954425]: "Reflector 
> ListAndWatch" 
> name:pkg/mod/k8s.io/client-go@v0.20.11/tools/cache/reflector.go:167 
> (16-Mar-2023 07:00:16.168) (total time: 35150ms):
> {noformat}
> Since it is a WARN, it continues but the informers did not return anything. 
> This confuses the scheduler that nothing needs to be recovered, and it goes 
> ahead doing the scheduling. This causes subsequential scheduler failures.  
> And eventually, nothing can be scheduled anymore.
> This should be a FATAL error. So the scheduler can be restarted to retry 
> recoverying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1642) Scheduler recovery failed due to listing operation timeout

2023-03-20 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1642:
-

 Summary: Scheduler recovery failed due to listing operation timeout
 Key: YUNIKORN-1642
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1642
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Weiwei Yang


The listing operation in the recovery phase: 
https://github.com/apache/yunikorn-k8shim/blob/c25ac60ffbc175c4966f917da21d184f34dea7b4/pkg/client/apifactory.go#L225.
 This could sometimes fail on some large clusters, the response time from API 
server is not guaranteed. And we see logs like this

{noformat}
2023-03-16T07:00:46.181ZWARNclient/apifactory.go:218Failed 
to sync informers{"error": "timeout waiting for condition"}
2023-03-16T07:00:46.182ZINFOgeneral/general.go:344  Pod list 
retrieved from api server  {"nr of pods": 0}
2023-03-16T07:00:46.182ZINFOgeneral/general.go:365  Application 
recovery statistics {"nr of recoverable apps": 0, "nr of total pods": 0, "nr of 
pods without application metadata": 0, "nr of pods to be recovered": 0}
I0316 07:00:51.319100   1 trace.go:205] Trace[140954425]: "Reflector 
ListAndWatch" 
name:pkg/mod/k8s.io/client-go@v0.20.11/tools/cache/reflector.go:167 
(16-Mar-2023 07:00:16.168) (total time: 35150ms):
{noformat}

Since it is a WARN, it continues but the informers did not return anything. 
This confuses the scheduler that nothing needs to be recovered, and it goes 
ahead doing the scheduling. This causes subsequential scheduler failures.  And 
eventually, nothing can be scheduled anymore.

This should be a FATAL error. So the scheduler can be restarted to retry 
recoverying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-1642) Scheduler recovery failed due to listing operation timeout

2023-03-20 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YUNIKORN-1642:
-

Assignee: Weiwei Yang

> Scheduler recovery failed due to listing operation timeout
> --
>
> Key: YUNIKORN-1642
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1642
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>
> The listing operation in the recovery phase: 
> https://github.com/apache/yunikorn-k8shim/blob/c25ac60ffbc175c4966f917da21d184f34dea7b4/pkg/client/apifactory.go#L225.
>  This could sometimes fail on some large clusters, the response time from API 
> server is not guaranteed. And we see logs like this
> {noformat}
> 2023-03-16T07:00:46.181Z  WARNclient/apifactory.go:218Failed 
> to sync informers{"error": "timeout waiting for condition"}
> 2023-03-16T07:00:46.182Z  INFOgeneral/general.go:344  Pod list 
> retrieved from api server  {"nr of pods": 0}
> 2023-03-16T07:00:46.182Z  INFOgeneral/general.go:365  Application 
> recovery statistics {"nr of recoverable apps": 0, "nr of total pods": 0, "nr 
> of pods without application metadata": 0, "nr of pods to be recovered": 0}
> I0316 07:00:51.319100   1 trace.go:205] Trace[140954425]: "Reflector 
> ListAndWatch" 
> name:pkg/mod/k8s.io/client-go@v0.20.11/tools/cache/reflector.go:167 
> (16-Mar-2023 07:00:16.168) (total time: 35150ms):
> {noformat}
> Since it is a WARN, it continues but the informers did not return anything. 
> This confuses the scheduler that nothing needs to be recovered, and it goes 
> ahead doing the scheduling. This causes subsequential scheduler failures.  
> And eventually, nothing can be scheduled anymore.
> This should be a FATAL error. So the scheduler can be restarted to retry 
> recoverying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1414) Adding Chinese translations of Sorting Policies

2023-03-16 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1414.
---
Target Version: 1.3.0
Resolution: Fixed

> Adding Chinese translations of  Sorting Policies 
> -
>
> Key: YUNIKORN-1414
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1414
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Chen, Kai-Chun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1414) Adding Chinese translations of Sorting Policies

2023-03-04 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696527#comment-17696527
 ] 

Weiwei Yang commented on YUNIKORN-1414:
---

hi [~KatLantyss] I have reverted the PR, because the commit failed: see 
https://github.com/apache/yunikorn-site/actions/runs/4334649473. Seems like 
there is some broken links introduced:

{quote}
Exhaustive list of all broken links found:

- On source page path = /zh-cn/docs/next/user_guide/resource_quota_management:
   -> linking to sorting_policies.md#StateAwarePolicy (resolved as: 
/zh-cn/docs/next/user_guide/sorting_policies.md)
{quote}

could you please recreate a PR and get this fixed, thanks!

> Adding Chinese translations of  Sorting Policies 
> -
>
> Key: YUNIKORN-1414
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1414
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Chen, Kai-Chun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Reopened] (YUNIKORN-1414) Adding Chinese translations of Sorting Policies

2023-03-04 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reopened YUNIKORN-1414:
---

> Adding Chinese translations of  Sorting Policies 
> -
>
> Key: YUNIKORN-1414
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1414
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Chen, Kai-Chun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1414) Adding Chinese translations of Sorting Policies

2023-03-04 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1414.
---
Fix Version/s: 1.3.0
   Resolution: Fixed

> Adding Chinese translations of  Sorting Policies 
> -
>
> Key: YUNIKORN-1414
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1414
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Chen, Kai-Chun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1512) Adding Chinese translation of Translation

2023-02-12 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1512.
---
Fix Version/s: 1.3.0
   Resolution: Fixed

> Adding Chinese translation of Translation
> -
>
> Key: YUNIKORN-1512
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1512
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Chen Yu Teng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1539) Add KubeConf talk info to events page

2023-01-17 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1539.
---
Fix Version/s: 1.2.0
 Assignee: Weiwei Yang
   Resolution: Fixed

> Add KubeConf talk info to events page
> -
>
> Key: YUNIKORN-1539
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1539
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.0
>
>
> Add the recent KubeConf talk info to the events page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1539) Add KubeConf talk info to events page

2023-01-17 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1539:
-

 Summary: Add KubeConf talk info to events page
 Key: YUNIKORN-1539
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1539
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: website
Reporter: Weiwei Yang


Add the recent KubeConf talk info to the events page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1522) Add Chinese translation for release announcement 1.1

2023-01-06 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1522.
---
Fix Version/s: 1.2.0
   Resolution: Fixed

> Add Chinese translation for release announcement 1.1
> 
>
> Key: YUNIKORN-1522
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1522
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: documentation
>Reporter: Wilfred Spiegelenburg
>Assignee: Wu hsuang zong
>Priority: Critical
>  Labels: newbie, pull-request-available
> Fix For: 1.2.0
>
>
> Release announcement for the release 1.1 is missing from the zh-cn translated 
> documents



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1409) Adding Chinese translation of User and Group Resolution

2022-12-24 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1409.
---
Fix Version/s: 1.2.0
   Resolution: Fixed

> Adding Chinese translation of User and Group Resolution
> ---
>
> Key: YUNIKORN-1409
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1409
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Wu hsuang zong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1387) Adding Chinese translation of partition and queue configuration

2022-12-21 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1387.
---
Fix Version/s: 1.2.0
   Resolution: Fixed

> Adding Chinese translation of partition and queue configuration
> ---
>
> Key: YUNIKORN-1387
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1387
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Chen Yu Teng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1422) Adding Chinese translations of Run TensorFlow Jobs

2022-11-25 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1422.
---
Fix Version/s: 1.2.0
   Resolution: Fixed

> Adding Chinese translations of Run TensorFlow Jobs
> --
>
> Key: YUNIKORN-1422
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1422
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: documentation
>Reporter: Wu hsuang zong
>Assignee: Wu hsuang zong
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.2.0
>
>
> While YUNIKORN-1339 add tutorial about time-slicing in Run TensorFlow Jobs, 
> the Chinese version isn't synced yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1386) Adding Chinese translation of deployment modes

2022-11-17 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1386.
---
Fix Version/s: 1.2.0
   Resolution: Fixed

> Adding Chinese translation of deployment modes
> --
>
> Key: YUNIKORN-1386
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1386
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Chen Yu Teng
>Assignee: Chen Yu Teng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1379) update meeting links on the community page

2022-11-17 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1379.
---
Fix Version/s: 1.2.0
   Resolution: Fixed

> update meeting links on the community page
> --
>
> Key: YUNIKORN-1379
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1379
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Jagadeesan A S
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 1.2.0
>
>
> The community page is not in sync with the google doc for the community sync.
> The zoom links on the page are the old expired links and should be updated to 
> the new ones from the doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1408) Adding Chinese translation of workload/Overview

2022-11-17 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1408.
---
Fix Version/s: 1.2.0
   Resolution: Fixed

> Adding Chinese translation of workload/Overview
> ---
>
> Key: YUNIKORN-1408
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1408
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Chen Yu Teng
>Assignee: Chen Yu Teng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1385) To provide visibility to application's aggregated resource consumption

2022-11-10 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17631932#comment-17631932
 ] 

Weiwei Yang commented on YUNIKORN-1385:
---

BTW: [~yzhangal]  it will be great if you put this in a google doc so it is 
easier for folks to comment. 


> To provide visibility to application's aggregated resource consumption
> --
>
> Key: YUNIKORN-1385
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1385
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common, core - scheduler, shim - kubernetes
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Major
> Attachments: image-2022-11-09-11-06-20-479.png, 
> yunikornResourceUsageVisibility.pdf
>
>
> Currently the "Used Resource" for a given application reported at Yunikorn 
> GUI is actually the snapshot of the resources currently allocated to the 
> application. We need to provide visibility to how much resources an 
> application has used so far, and it would be important to report  how much 
> total resources is used by the application. 
> The unit of resources used can be memoryseconds/vcoreseconds like how Hadoop 
> Yarn does in its app summary report. However, there is more complexity with 
> K8s which supports different instance types (Hadoop Yarn does support 
> multiple instance types but its app summary report ignored instance types). 
> It would be nice to report how much resource is used for each instance types 
> used by an application.
> This is a top level Jira for this new feature. Subtask jira will be created 
> when we work on it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1385) To provide visibility to application's aggregated resource consumption

2022-11-10 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17631930#comment-17631930
 ] 

Weiwei Yang commented on YUNIKORN-1385:
---

hi [~yzhangal] thanks for putting all these together in a doc.
I want to confirm that we are on the same page for one thing: we only aggregate 
the "requested" resource seconds, and we do not track the actually utilized 
resources. To give an example, if an executor requests for 3 cores and runs for 
10 seconds, then the resource seconds is 30. But behind this on the K8s node, 
this pod may only use 1 core or use 5 cores, that's the "actual" utilization 
and we do not want to track that.
K8s allows you to set a limit, that limits per pod actual usage, and it has a 
metrics server (a separate component) that can track the "actual" usage. Also 
some other 3rd party tools can achieve the similar goal, such as datadog. These 
are not part of the scope we are discussing here. 

> To provide visibility to application's aggregated resource consumption
> --
>
> Key: YUNIKORN-1385
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1385
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common, core - scheduler, shim - kubernetes
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Major
> Attachments: image-2022-11-09-11-06-20-479.png, 
> yunikornResourceUsageVisibility.pdf
>
>
> Currently the "Used Resource" for a given application reported at Yunikorn 
> GUI is actually the snapshot of the resources currently allocated to the 
> application. We need to provide visibility to how much resources an 
> application has used so far, and it would be important to report  how much 
> total resources is used by the application. 
> The unit of resources used can be memoryseconds/vcoreseconds like how Hadoop 
> Yarn does in its app summary report. However, there is more complexity with 
> K8s which supports different instance types (Hadoop Yarn does support 
> multiple instance types but its app summary report ignored instance types). 
> It would be nice to report how much resource is used for each instance types 
> used by an application.
> This is a top level Jira for this new feature. Subtask jira will be created 
> when we work on it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-1385) To provide visibility to application's aggregated resource consumption

2022-11-04 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YUNIKORN-1385:
-

Assignee: Yongjun Zhang

> To provide visibility to application's aggregated resource consumption
> --
>
> Key: YUNIKORN-1385
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1385
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common, core - scheduler, shim - kubernetes
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Major
>
> Currently the "Used Resource" for a given application reported at Yunikorn 
> GUI is actually the snapshot of the resources currently allocated to the 
> application. We need to provide visibility to how much resources an 
> application has used so far, and it would be important to report  how much 
> total resources is used by the application. 
> The unit of resources used can be memoryseconds/vcoreseconds like how Hadoop 
> Yarn does in its app summary report. However, there is more complexity with 
> K8s which supports different instance types (Hadoop Yarn does support 
> multiple instance types but its app summary report ignored instance types). 
> It would be nice to report how much resource is used for each instance types 
> used by an application.
> This is a top level Jira for this new feature. Subtask jira will be created 
> when we work on it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1385) To provide visibility to application's aggregated resource consumption

2022-11-04 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629224#comment-17629224
 ] 

Weiwei Yang commented on YUNIKORN-1385:
---

Thanks a lot [~yzhangal].
+ [~wilfreds], [~ccondit], we briefly discussed this in the last meetup.
[~yzhangal] would u pls draft your idea in a google doc and share it in this 
ticket? No need to be very thoughtful, just covering the motivation, basic 
idea, work flow will be great. I think we need some discussions before going 
into detail implementations.

> To provide visibility to application's aggregated resource consumption
> --
>
> Key: YUNIKORN-1385
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1385
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common, core - scheduler, shim - kubernetes
>Reporter: Yongjun Zhang
>Priority: Major
>
> Currently the "Used Resource" for a given application reported at Yunikorn 
> GUI is actually the snapshot of the resources currently allocated to the 
> application. We need to provide visibility to how much resources an 
> application has used so far, and it would be important to report  how much 
> total resources is used by the application. 
> The unit of resources used can be memoryseconds/vcoreseconds like how Hadoop 
> Yarn does in its app summary report. However, there is more complexity with 
> K8s which supports different instance types (Hadoop Yarn does support 
> multiple instance types but its app summary report ignored instance types). 
> It would be nice to report how much resource is used for each instance types 
> used by an application.
> This is a top level Jira for this new feature. Subtask jira will be created 
> when we work on it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1293) Add custom redirects to the current version doc

2022-08-25 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584640#comment-17584640
 ] 

Weiwei Yang commented on YUNIKORN-1293:
---

This is a very good point. Let's move the discussion to the Spark community: 
apache/spark#37622. This PR is not a sustainable solution, it's gonna break 
when we release a new version. BTW, I've spent hours and this is the best 
solution so far that can work.

Keep this JIRA open, we can revert this once we reach a consensus there. 

> Add custom redirects to the current version doc
> ---
>
> Key: YUNIKORN-1293
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1293
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: website
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>
> Docusaurus hides the version number in the URL for the latest versions, e.g 
> right now our latest version is 1.0.0, but http://localhost:3000/docs/1.0.0 
> gives a 404. Ideally, we should make have an accessible URL for this version 
> as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1293) Add custom redirects to the current version doc

2022-08-24 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1293:
--
Issue Type: Improvement  (was: Bug)

> Add custom redirects to the current version doc
> ---
>
> Key: YUNIKORN-1293
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1293
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: website
>Reporter: Weiwei Yang
>Priority: Major
>
> Docusaurus hides the version number in the URL for the latest versions, e.g 
> right now our latest version is 1.0.0, but http://localhost:3000/docs/1.0.0 
> gives a 404. Ideally, we should make have an accessible URL for this version 
> as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1293) Add custom redirects to the current version doc

2022-08-24 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1293:
-

 Summary: Add custom redirects to the current version doc
 Key: YUNIKORN-1293
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1293
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: website
Reporter: Weiwei Yang


Docusaurus hides the version number in the URL for the latest versions, e.g 
right now our latest version is 1.0.0, but http://localhost:3000/docs/1.0.0 
gives a 404. Ideally, we should make have an accessible URL for this version as 
well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1292) Fix github pages issues after moving to TLP

2022-08-23 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1292.
---
Fix Version/s: 1.1.0
 Assignee: Weiwei Yang
   Resolution: Fixed

> Fix github pages issues after moving to TLP
> ---
>
> Key: YUNIKORN-1292
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1292
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> https://apache.github.io/yunikorn-release/ still has the out-of-dated issue. 
> This was pointed out by the Spark community in this PR: 
> https://github.com/apache/spark/pull/37622#discussion_r952877147



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1292) Fix github pages issues after moving to TLP

2022-08-23 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1292:
-

 Summary: Fix github pages issues after moving to TLP
 Key: YUNIKORN-1292
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1292
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Weiwei Yang


https://apache.github.io/yunikorn-release/ still has the out-of-dated issue. 
This was pointed out by the Spark community in this PR: 
https://github.com/apache/spark/pull/37622#discussion_r952877147



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1253) PVCs won't get past WaitForFirstConsumer with Apache Yunikorn

2022-07-12 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565930#comment-17565930
 ] 

Weiwei Yang commented on YUNIKORN-1253:
---

Hi [~Yukali] thanks a lot. Pls keep us posted on updates.


> PVCs won't get past WaitForFirstConsumer with Apache Yunikorn
> -
>
> Key: YUNIKORN-1253
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1253
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Alex Kalenyuk
>Assignee: Chen Yu Teng
>Priority: Major
> Attachments: pv.yaml, storageclass.yaml
>
>
> It seems that with Apache Yunikorn, WaitForFirstConsumer volume binding 
> storage classes are not supported (not sure if this is intended or not).
> This makes it problematic to use storage that is not globally accessible from 
> all nodes:
> [https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode]
> To reproduce a simple failure to use a WaitForFirstConsumer binding-backed 
> PVC:
> ```
> kind: Pod
> apiVersion: v1
> metadata:
>   name: test
>   namespace: default
>   labels:
>     app: sleep
>     applicationId: "sleep0001-node-selector-test"
>     queue: "root.sandbox"
> spec:
>   schedulerName: yunikorn
>   nodeSelector:
>     storage/ssd: 'true'
>   containers:
>     - name: test
>       resources:
>         limits:
>           cpu: 1
>           memory: 1G
>         requests:
>           cpu: 1
>           memory: 1G
>       image: busybox
>       command:
>         - sleep
>         - '100'
>       volumeMounts:
>         - name: scratch-volume
>           mountPath: /data
>   volumes:
>     - name: scratch-volume
>       ephemeral:
>         volumeClaimTemplate:
>           spec:
>             accessModes:
>               - ReadWriteOnce
>             resources:
>               requests:
>                 storage: 1Gi
>             storageClassName: hostpath-provisioner
>             volumeMode: Filesystem
> ```
> Storage used:
> https://github.com/kubevirt/hostpath-provisioner-operator
> A similar issue was spotted in:
> [https://github.com/kubernetes/kubernetes/issues/86262]
> And this PR seems to introduce the VolumeBinding filter but comments it out:
> [https://github.com/apache/yunikorn-k8shim/pull/313]
>  
> I might be off with the "Bug" type here so feel free to correct me;
> My thinking was that if introducing support for WFFC is trivial, this may 
> make sense to exist in older versions too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1253) PVCs won't get past WaitForFirstConsumer with Apache Yunikorn

2022-07-05 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562955#comment-17562955
 ] 

Weiwei Yang commented on YUNIKORN-1253:
---

Discussed this issue with [~Yukali], please help to investigate, thanks!

> PVCs won't get past WaitForFirstConsumer with Apache Yunikorn
> -
>
> Key: YUNIKORN-1253
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1253
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Alex Kalenyuk
>Assignee: Chen Yu Teng
>Priority: Major
>
> It seems that with Apache Yunikorn, WaitForFirstConsumer volume binding 
> storage classes are not supported (not sure if this is intended or not).
> This makes it problematic to use storage that is not globally accessible from 
> all nodes:
> [https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode]
> To reproduce a simple failure to use a WaitForFirstConsumer binding-backed 
> PVC:
> ```
> kind: Pod
> apiVersion: v1
> metadata:
>   name: test
>   namespace: default
>   labels:
>     app: sleep
>     applicationId: "sleep0001-node-selector-test"
>     queue: "root.sandbox"
> spec:
>   schedulerName: yunikorn
>   nodeSelector:
>     storage/ssd: 'true'
>   containers:
>     - name: test
>       resources:
>         limits:
>           cpu: 1
>           memory: 1G
>         requests:
>           cpu: 1
>           memory: 1G
>       image: busybox
>       command:
>         - sleep
>         - '100'
>       volumeMounts:
>         - name: scratch-volume
>           mountPath: /data
>   volumes:
>     - name: scratch-volume
>       ephemeral:
>         volumeClaimTemplate:
>           spec:
>             accessModes:
>               - ReadWriteOnce
>             resources:
>               requests:
>                 storage: 1Gi
>             storageClassName: hostpath-provisioner
>             volumeMode: Filesystem
> ```
> Storage used:
> https://github.com/kubevirt/hostpath-provisioner-operator
> A similar issue was spotted in:
> [https://github.com/kubernetes/kubernetes/issues/86262]
> And this PR seems to introduce the VolumeBinding filter but comments it out:
> [https://github.com/apache/yunikorn-k8shim/pull/313]
>  
> I might be off with the "Bug" type here so feel free to correct me;
> My thinking was that if introducing support for WFFC is trivial, this may 
> make sense to exist in older versions too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1253) PVCs won't get past WaitForFirstConsumer with Apache Yunikorn

2022-07-05 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562783#comment-17562783
 ] 

Weiwei Yang commented on YUNIKORN-1253:
---

Thanks [~akalenyu], the description helps a lot.
As far as I know, we haven't tested the "WaitForFirstConsumer" mode before, 
this is probably not supported today.
If somebody has some bandwidth to take a look at this, that will be great. Cc 
[~wilfreds], [~ccondit], [~yuteng], [~tingyao]


> PVCs won't get past WaitForFirstConsumer with Apache Yunikorn
> -
>
> Key: YUNIKORN-1253
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1253
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Alex Kalenyuk
>Priority: Major
>
> It seems that with Apache Yunikorn, WaitForFirstConsumer volume binding 
> storage classes are not supported (not sure if this is intended or not).
> This makes it problematic to use storage that is not globally accessible from 
> all nodes:
> [https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode]
> To reproduce a simple failure to use a WaitForFirstConsumer binding-backed 
> PVC:
> ```
> kind: Pod
> apiVersion: v1
> metadata:
>   name: test
>   namespace: default
>   labels:
>     app: sleep
>     applicationId: "sleep0001-node-selector-test"
>     queue: "root.sandbox"
> spec:
>   schedulerName: yunikorn
>   nodeSelector:
>     storage/ssd: 'true'
>   containers:
>     - name: test
>       resources:
>         limits:
>           cpu: 1
>           memory: 1G
>         requests:
>           cpu: 1
>           memory: 1G
>       image: busybox
>       command:
>         - sleep
>         - '100'
>       volumeMounts:
>         - name: scratch-volume
>           mountPath: /data
>   volumes:
>     - name: scratch-volume
>       ephemeral:
>         volumeClaimTemplate:
>           spec:
>             accessModes:
>               - ReadWriteOnce
>             resources:
>               requests:
>                 storage: 1Gi
>             storageClassName: hostpath-provisioner
>             volumeMode: Filesystem
> ```
> Storage used:
> https://github.com/kubevirt/hostpath-provisioner-operator
> A similar issue was spotted in:
> [https://github.com/kubernetes/kubernetes/issues/86262]
> And this PR seems to introduce the VolumeBinding filter but comments it out:
> [https://github.com/apache/yunikorn-k8shim/pull/313]
>  
> I might be off with the "Bug" type here so feel free to correct me;
> My thinking was that if introducing support for WFFC is trivial, this may 
> make sense to exist in older versions too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-1237) Queue usage bar in web UI is broken

2022-06-15 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554823#comment-17554823
 ] 

Weiwei Yang edited comment on YUNIKORN-1237 at 6/16/22 12:31 AM:
-

cc [~wilfreds], [~ccondit], [~akhilpb]


was (Author: wwei):
cc [~wilfreds], [~ccondit]

> Queue usage bar in web UI is broken
> ---
>
> Key: YUNIKORN-1237
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1237
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Weiwei Yang
>Priority: Major
> Attachments: YuniKorn_UI.jpg
>
>
> After upgrading to 1.0, the queue usage is no longer working. Screenshot 
> attached



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1237) Queue usage bar in web UI is broken

2022-06-15 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554823#comment-17554823
 ] 

Weiwei Yang commented on YUNIKORN-1237:
---

cc [~wilfreds], [~ccondit]

> Queue usage bar in web UI is broken
> ---
>
> Key: YUNIKORN-1237
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1237
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Weiwei Yang
>Priority: Major
> Attachments: YuniKorn_UI.jpg
>
>
> After upgrading to 1.0, the queue usage is no longer working. Screenshot 
> attached



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1237) Queue usage bar in web UI is broken

2022-06-14 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1237:
--
Attachment: YuniKorn_UI.jpg

> Queue usage bar in web UI is broken
> ---
>
> Key: YUNIKORN-1237
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1237
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Weiwei Yang
>Priority: Major
> Attachments: YuniKorn_UI.jpg
>
>
> After upgrading to 1.0, the queue usage is no longer working. Screenshot 
> attached



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1237) Queue usage bar in web UI is broken

2022-06-14 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1237:
-

 Summary: Queue usage bar in web UI is broken
 Key: YUNIKORN-1237
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1237
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: webapp
Reporter: Weiwei Yang


After upgrading to 1.0, the queue usage is no longer working. Screenshot 
attached



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1226) Deprecate maturity page in YuniKorn

2022-06-02 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1226.
---
Resolution: Fixed

> Deprecate maturity page in YuniKorn
> ---
>
> Key: YUNIKORN-1226
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1226
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Sunil G
>Assignee: Saad Ur Rahman
>Priority: Major
>  Labels: newbee, pull-request-available, trivial
> Fix For: 1.1.0
>
>
> https://yunikorn.apache.org/community/maturity needs to be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1226) Deprecate maturity page in YuniKorn

2022-06-01 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545216#comment-17545216
 ] 

Weiwei Yang commented on YUNIKORN-1226:
---

The commit seemed to break the build: 
https://github.com/apache/yunikorn-site/actions/runs/2425375564.
I have reverted the commit, but it seems the build is still failing... not sure 
why
I wonder if this is because we have a Chinese version of maturity doc that 
might also need to be removed. [~surahman] could you pls take a look?
Before submitting the PR, you could verify your changes locally by building the 
web-site: https://github.com/apache/yunikorn-site#local-build

> Deprecate maturity page in YuniKorn
> ---
>
> Key: YUNIKORN-1226
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1226
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Sunil G
>Assignee: Saad Ur Rahman
>Priority: Major
>  Labels: newbee, pull-request-available, trivial
> Fix For: 1.1.0
>
>
> https://yunikorn.apache.org/community/maturity needs to be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Reopened] (YUNIKORN-1226) Deprecate maturity page in YuniKorn

2022-06-01 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reopened YUNIKORN-1226:
---

> Deprecate maturity page in YuniKorn
> ---
>
> Key: YUNIKORN-1226
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1226
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Sunil G
>Assignee: Saad Ur Rahman
>Priority: Major
>  Labels: newbee, pull-request-available, trivial
> Fix For: 1.1.0
>
>
> https://yunikorn.apache.org/community/maturity needs to be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1226) Deprecate maturity page in YuniKorn

2022-06-01 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1226.
---
Fix Version/s: 1.1.0
   Resolution: Fixed

Merged, thanks for taking care of this.

> Deprecate maturity page in YuniKorn
> ---
>
> Key: YUNIKORN-1226
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1226
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Sunil G
>Assignee: Saad Ur Rahman
>Priority: Major
>  Labels: newbee, pull-request-available, trivial
> Fix For: 1.1.0
>
>
> https://yunikorn.apache.org/community/maturity needs to be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-1226) Deprecate maturity page in YuniKorn

2022-06-01 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YUNIKORN-1226:
-

Assignee: Saad Ur Rahman

> Deprecate maturity page in YuniKorn
> ---
>
> Key: YUNIKORN-1226
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1226
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Sunil G
>Assignee: Saad Ur Rahman
>Priority: Major
>  Labels: newbee, pull-request-available, trivial
>
> https://yunikorn.apache.org/community/maturity needs to be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1228) Race condition when serializing K8s objects

2022-06-01 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545186#comment-17545186
 ] 

Weiwei Yang commented on YUNIKORN-1228:
---

Hi [~ccondit]

Running predicates and at the same time update the pod condition could trigger 
this?
What is the fix for this?

> Race condition when serializing K8s objects
> ---
>
> Key: YUNIKORN-1228
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1228
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>
> A race was recently uncovered during testing:
>  
> {noformat}
> ==
> WARNING: DATA RACE
> Read at 0x00c000581810 by goroutine 46:
>   k8s.io/apimachinery/pkg/apis/meta/v1.(*TypeMeta).GroupVersionKind()
>       
> /home/testuser/go/pkg/mod/k8s.io/apimachinery@v0.20.11/pkg/apis/meta/v1/meta.go:128
>  +0x64
>   k8s.io/client-go/tools/reference.GetReference()
>       
> /home/testuser/go/pkg/mod/k8s.io/client-go@v0.20.11/tools/reference/ref.go:59 
> +0x17d
>   k8s.io/client-go/tools/events.(*recorderImpl).Eventf()
>       
> /home/testuser/go/pkg/mod/k8s.io/client-go@v0.20.11/tools/events/event_recorder.go:46
>  +0x119
>   
> github.com/apache/yunikorn-k8shim/pkg/plugin/predicates.(*predicateManagerImpl).predicatesAllocate()
>       
> /home/testuser/repos/incubator-yunikorn-k8shim/pkg/plugin/predicates/predicate_manager.go:80
>  +0x36f
>   
> github.com/apache/yunikorn-k8shim/pkg/plugin/predicates.(*predicateManagerImpl).Predicates()
>       
> /home/testuser/repos/incubator-yunikorn-k8shim/pkg/plugin/predicates/predicate_manager.go:64
>  +0x52
>   github.com/apache/yunikorn-k8shim/pkg/cache.(*Context).IsPodFitNode()
>       /home/testuser/repos/incubator-yunikorn-k8shim/pkg/cache/context.go:341 
> +0x241
>   
> github.com/apache/yunikorn-k8shim/pkg/callback.(*AsyncRMCallback).Predicates()
>       
> /home/testuser/repos/incubator-yunikorn-k8shim/pkg/callback/scheduler_callback.go:187
>  +0xb7
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Node).preConditions()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/objects/node.go:386
>  +0x1c7
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Node).preAllocateConditions()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/objects/node.go:368
>  +0xe4
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1190
>  +0xe7
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodes()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1112
>  +0x7c4
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryAllocate()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:849
>  +0x7a4
>   github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryAllocate()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/objects/queue.go:1070
>  +0x18c
>   github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryAllocate()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/objects/queue.go:1082
>  +0xf7
>   
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).tryAllocate()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/partition.go:831 
> +0x15c
>   github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).schedule()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/context.go:137 
> +0x1b6
>   
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).internalSchedule()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/scheduler.go:77 
> +0x47
>   
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService.func2()
>       
> /home/testuser/repos/incubator-yunikorn-core/pkg/scheduler/scheduler.go:67 
> +0x39Previous write at 0x00c000581810 by goroutine 47:
>   k8s.io/apimachinery/pkg/apis/meta/v1.(*TypeMeta).SetGroupVersionKind()
>       
> /home/testuser/go/pkg/mod/k8s.io/apimachinery@v0.20.11/pkg/apis/meta/v1/meta.go:123
>  +0x190
>   k8s.io/apimachinery/pkg/runtime.WithVersionEncoder.Encode()
>       
> /home/testuser/go/pkg/mod/k8s.io/apimachinery@v0.20.11/pkg/runtime/helper.go:241
>  +0x408
>   k8s.io/apimachinery/pkg/runtime.(*WithVersionEncoder).Encode()
>       :1 +0xfb
>   k8s.io/apimachinery/pkg/runtime.Encode()
>       
> /home/testuser/go/pkg/mod/k8s.io/apimachinery@v0.20.11/pkg/runtime/codec.go:50
>  +0xb3
>   k8s.io/client-go/rest.(*Request).Body()
>       

[jira] [Commented] (YUNIKORN-1226) Deprecate maturity page in YuniKorn

2022-06-01 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545195#comment-17545195
 ] 

Weiwei Yang commented on YUNIKORN-1226:
---

[~surahman] want to give a hand on this?
The code is in yunikorn-site repo: https://github.com/apache/yunikorn-site.

> Deprecate maturity page in YuniKorn
> ---
>
> Key: YUNIKORN-1226
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1226
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Sunil G
>Priority: Major
>  Labels: newbee, trivial
>
> https://yunikorn.apache.org/community/maturity needs to be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1224) Failed to publish website due to incompatible node version

2022-05-29 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1224.
---
Resolution: Fixed

Merged, thanks for the fix [~yuchaoran2011]

> Failed to publish website due to incompatible node version
> --
>
> Key: YUNIKORN-1224
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1224
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Chaoran Yu
>Assignee: Chaoran Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> Website publish no longer works. Example: 
> [https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:73.]
>  
> The error is:
>  
> yarn install v1.22.15 
> [64|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:65]info
>  No lockfile found. 
> [65|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:66][1/4]
>  Resolving packages... 
> [66|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:67][2/4]
>  Fetching packages... 
> [67|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:68]error
>  @docusaurus/core@2.0.0-beta.21: The engine "node" is incompatible with this 
> module. Expected version ">=16.14". Got "16.13.0" 
> [68|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:69]info
>  Visit [https://yarnpkg.com/en/docs/cli/install] for documentation about this 
> command. 
> [69|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:70]error
>  Found incompatible module. 
> [70|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:71]The
>  command '/bin/sh -c yarn install' returned a non-zero code: 1 
> [71|https://github.com/apache/yunikorn-site/runs/6638395350?check_suite_focus=true#step:3:72]
>  
> Need to change the base image to use node 16.14 or higher



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1202) Add metrics to track partition resources

2022-05-25 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542373#comment-17542373
 ] 

Weiwei Yang commented on YUNIKORN-1202:
---

hi [~wilfreds], [~surahman]

Could you please double-check this? 
Looks like we only expose the leaf queue resources today, code comment:
https://github.com/apache/yunikorn-core/blob/7f0ca094f04653f61ee6a369bfd8c3f352bf7c62/pkg/scheduler/objects/queue.go#L1266
https://github.com/apache/yunikorn-core/blob/7f0ca094f04653f61ee6a369bfd8c3f352bf7c62/pkg/scheduler/objects/queue.go#L1277

I want to set up some graph to show queue usage, but I do not see where I can 
get the root queue or partition resources.

> Add metrics to track partition resources
> 
>
> Key: YUNIKORN-1202
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1202
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Priority: Major
>
> When we monitor the cluster resources, we need to track what is available vs 
> what is used. In the queue metrics, currently, we have per queue used 
> resource metrics e.g yunikorn_queue_root_xyz_used_resource. But we do not 
> have metrics to track what's the total partition resources (both used and 
> total), we need to add that too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1221) [Umbrella] Service Configuration Design

2022-05-23 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541209#comment-17541209
 ] 

Weiwei Yang commented on YUNIKORN-1221:
---

Google docs for the draft, and the final version uploaded to 
https://yunikorn.apache.org/docs/next/design/architecture


> [Umbrella] Service Configuration Design
> ---
>
> Key: YUNIKORN-1221
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1221
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler, shim - kubernetes
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>
> As YuniKorn has grown in complexity, we have created several configuration 
> options (and styles of configuration) that would benefit from being 
> standardized.
> This umbrella JIRA is to track design docs and work required to unify our 
> configuration story.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1213) The interval of the background health checker needs to be configurable

2022-05-23 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541191#comment-17541191
 ] 

Weiwei Yang commented on YUNIKORN-1213:
---

And I agree with [~wilfreds], it's better to have a design doc and get this 
fully clarified.
[~surahman] if you are interested, please help us to reach a consensus with a 
design doc once you have enough details for this issue.
Maybe we should have an umbrella for the scheduler configs improvement. Making 
the health checker configurable can come after that.

> The interval of the background health checker needs to be configurable
> --
>
> Key: YUNIKORN-1213
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Saad Ur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> YUNIKORN-1107 adds a background running health checker to verify the 
> scheduler data correctness in the fixed time interval 30s: 
> https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
>  We need to make this configurable, either let the user set a longer/shorter 
> interval, or completely disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1213) The interval of the background health checker needs to be configurable

2022-05-23 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541190#comment-17541190
 ] 

Weiwei Yang commented on YUNIKORN-1213:
---

Thanks [~wilfreds], [~surahman]. Based on what has been discussed, and concerns 
raised, I can see two options

1) Use 1 configmap for the scheduler configs, adding a section for global 
scheduler configs. This section is *immutable*. That means if the user wants to 
change anything in this section, they need to restart the scheduler. Then the 
hot-refresh is still working as before, just parse and update the partitions 
section. 
2) Add another configmap for scheduler configs, and the configmap is 
*immutable*. (see: 
https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable).
 Our configwatcher won't look at this configmap. The changes will be mainly on 
the deployment side as we need to attach another configmap to the scheduler pod.

any other solutions?

> The interval of the background health checker needs to be configurable
> --
>
> Key: YUNIKORN-1213
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Saad Ur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> YUNIKORN-1107 adds a background running health checker to verify the 
> scheduler data correctness in the fixed time interval 30s: 
> https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
>  We need to make this configurable, either let the user set a longer/shorter 
> interval, or completely disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1069) Make the state dump file's log rotation parameters configurable

2022-05-19 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539334#comment-17539334
 ] 

Weiwei Yang commented on YUNIKORN-1069:
---

hi [~steinsgateted] could you please share your personal email address with me, 
I have a question for you. Feel free to email me via w...@apache.org. Thanks.

> Make the state dump file's log rotation parameters configurable
> ---
>
> Key: YUNIKORN-1069
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1069
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Anuraag Nalluri
>Assignee: ted
>Priority: Major
>
> We need to make the state dump file's log rotation parameters (max file size, 
> # of backups) configurable. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1213) The interval of the background health checker needs to be configurable

2022-05-18 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539106#comment-17539106
 ] 

Weiwei Yang commented on YUNIKORN-1213:
---

hi [~surahman]

This is a good one. Right now we do not have a global config section, that's 
why I suggested putting it under the partition section.
Even today from what I know, people only use 1 partition, but that doesn't mean 
we should be limited our use cases to a single partition.
Apparently, the config we are trying to add applies to all partitions. 
Given that, maybe we should discuss adding a global config section, something 
like:

{code}
scheduler:
  - healthcheck:
 enabled: true
 interval: 30s
partitions:
  - name: a
 xxx
  - name: b
xxx
{code}

Let me raise this up in the discussion channel and see if others have better 
ideas.

> The interval of the background health checker needs to be configurable
> --
>
> Key: YUNIKORN-1213
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Priority: Major
>
> YUNIKORN-1107 adds a background running health checker to verify the 
> scheduler data correctness in the fixed time interval 30s: 
> https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
>  We need to make this configurable, either let the user set a longer/shorter 
> interval, or completely disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1218) Scheduler crashed with concurrent map access error in health checker

2022-05-18 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539093#comment-17539093
 ] 

Weiwei Yang commented on YUNIKORN-1218:
---

Thanks a lot for the quick review and merge [~ccondit]. 

> Scheduler crashed with concurrent map access error in health checker
> 
>
> Key: YUNIKORN-1218
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1218
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
> Attachments: reproduce.patch, stacktrace.log
>
>
> After YUNIKORN-1107, the health checker runs as a background thread in 30s 
> interval. We observed a few scheduler restarts in the past week that seems to 
> be caused by this thread, because it has an unsafe access to the partition 
> context without proper read lock. I have uploaded a patch to reproduce this 
> locally, and a file of the stack trace when crash happens. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1218) Scheduler crashed with concurrent map access error in health checker

2022-05-18 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1218:
--
Attachment: stacktrace.log

> Scheduler crashed with concurrent map access error in health checker
> 
>
> Key: YUNIKORN-1218
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1218
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: reproduce.patch, stacktrace.log
>
>
> After YUNIKORN-1107, the health checker runs as a background thread in 30s 
> interval. We observed a few scheduler restarts in the past week that seems to 
> be caused by this thread, because it has an unsafe access to the partition 
> context without proper read lock. I have uploaded a patch to reproduce this 
> locally, and a file of the stack trace when crash happens. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1218) Scheduler crashed with concurrent map access error in health checker

2022-05-18 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1218:
--
Attachment: reproduce.patch

> Scheduler crashed with concurrent map access error in health checker
> 
>
> Key: YUNIKORN-1218
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1218
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: reproduce.patch, stacktrace.log
>
>
> After YUNIKORN-1107, the health checker runs as a background thread in 30s 
> interval. We observed a few scheduler restarts in the past week that seems to 
> be caused by this thread, because it has an unsafe access to the partition 
> context without proper read lock. I have uploaded a patch to reproduce this 
> locally, and a file of the stack trace when crash happens. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1218) Scheduler crashed with concurrent map access error in health checker

2022-05-18 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1218:
-

 Summary: Scheduler crashed with concurrent map access error in 
health checker
 Key: YUNIKORN-1218
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1218
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Weiwei Yang
Assignee: Weiwei Yang


After YUNIKORN-1107, the health checker runs as a background thread in 30s 
interval. We observed a few scheduler restarts in the past week that seems to 
be caused by this thread, because it has an unsafe access to the partition 
context without proper read lock. I have uploaded a patch to reproduce this 
locally, and a file of the stack trace when crash happens. 





--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1213) The interval of the background health checker needs to be configurable

2022-05-17 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538573#comment-17538573
 ] 

Weiwei Yang commented on YUNIKORN-1213:
---

hi [~surahman]

{quote}
For backward compatibility, we will likely have to retain the default check 
interval of 30 sec if no entries are found in the ConfigMap.
{quote}
Correct, if nothing is set, the default value is enabled, and interval is 30s.

For retrieving the configs, we should be able to access all configs via 
https://github.com/apache/yunikorn-core/blob/master/pkg/common/configs/configs.go.
 So we can get the configs by something like 
"configs.ConfigContext.Get(schedulerContext.GetPolicyGroup())" I guess.

When healthcheck is disabled, we shouldn't start the healthcheck thread. And 
the healthcheck endpoint can return either an empty result or something like 
you proposed. I think it should be fine.

> The interval of the background health checker needs to be configurable
> --
>
> Key: YUNIKORN-1213
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Priority: Major
>
> YUNIKORN-1107 adds a background running health checker to verify the 
> scheduler data correctness in the fixed time interval 30s: 
> https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
>  We need to make this configurable, either let the user set a longer/shorter 
> interval, or completely disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-1214) Yunikorn does not honor queue acls when adding tasks to existing application

2022-05-17 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YUNIKORN-1214:
-

Assignee: Mit Desai

> Yunikorn does not honor queue acls when adding tasks to existing application
> 
>
> Key: YUNIKORN-1214
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1214
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Mit Desai
>Assignee: Mit Desai
>Priority: Major
>
> Currently when a pod is submitted without an application id, yunikorn 
> generates an application id using the namespace where the pod is submitted 
> with this format. 'yunikorn--autogen'
> When another pod without an application id is submitted to the same 
> namespace, there already exist an application with the name generated name. 
> So the next one gets added as a task to the existing application. I see that 
> the queue acls are also not taken into consideration when this happens and 
> the new pod also becomes the part of the same queue as before.
> For Example:
> 1. Pod submitted to namespace 'a' with get an application id 
> yunikorn-a-autogen and based on the acls, lets assume it lands in queue 
> 'queue-a'
> 2. If another pod (which does not have an app id) is submitted in the same 
> namespace 'a' by a test user who is not authorized to run apps in queue 
> 'queue-a', it will be grouped with the the previous application and will 
> start running in the same queue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1213) The interval of the background health checker needs to be configurable

2022-05-17 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538479#comment-17538479
 ] 

Weiwei Yang commented on YUNIKORN-1213:
---

hi [~surahman]

Thanks for looking at this. Yes, I think we can load this from the config file. 
Currently, yunikorn loads configuration from a configmap, and the format is 
documented here: 
https://yunikorn.apache.org/docs/user_guide/queue_config#configuration. As you 
can see, the configuration has 2 parts, partitions and queues, I think we can 
add a config property in the partition level, such as:

{code}
healthcheck:
  enabled: true/false
  interval: 30s
{code}

what do you think?

> The interval of the background health checker needs to be configurable
> --
>
> Key: YUNIKORN-1213
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Priority: Major
>
> YUNIKORN-1107 adds a background running health checker to verify the 
> scheduler data correctness in the fixed time interval 30s: 
> https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
>  We need to make this configurable, either let the user set a longer/shorter 
> interval, or completely disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1213) The interval of the background health checker needs to be configurable

2022-05-17 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1213:
-

 Summary: The interval of the background health checker needs to be 
configurable
 Key: YUNIKORN-1213
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler
Reporter: Weiwei Yang


YUNIKORN-1107 adds a background running health checker to verify the scheduler 
data correctness in the fixed time interval 30s: 
https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
 We need to make this configurable, either let the user set a longer/shorter 
interval, or completely disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1203) Missing pending/available resource metrics for queues

2022-05-05 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1203.
---
Resolution: Invalid

> Missing pending/available resource metrics for queues
> -
>
> Key: YUNIKORN-1203
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1203
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Priority: Major
>
> Per document here: 
> https://yunikorn.apache.org/docs/next/performance/metrics#queue-metrics. 
> Right now only the usedResourceMetrics is available, pending and available 
> metrics are both missing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1203) Missing pending/available resource metrics for queues

2022-05-05 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1203:
-

 Summary: Missing pending/available resource metrics for queues
 Key: YUNIKORN-1203
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1203
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Weiwei Yang


Per document here: 
https://yunikorn.apache.org/docs/next/performance/metrics#queue-metrics. Right 
now only the usedResourceMetrics is available, pending and available metrics 
are both missing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1193) Release build failed on arm64

2022-04-27 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1193:
-

 Summary: Release build failed on arm64
 Key: YUNIKORN-1193
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1193
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: webapp
Affects Versions: 1.0.0
Reporter: Weiwei Yang


Download the released tarball, and build the docker images from source with 
"make", I got the following error on M1 chip Mac with arm64 arch.

{code}
 => ERROR [buildstage 5/6] RUN yarn install 

74.7s
--
 > [buildstage 5/6] RUN yarn install:
#11 0.467 yarn install v1.22.18
#11 0.602 [1/4] Resolving packages...
#11 0.963 [2/4] Fetching packages...
#11 54.52 [3/4] Linking dependencies...
#11 72.57 [4/4] Building fresh packages...
#11 73.85 info Visit https://yarnpkg.com/en/docs/cli/install for documentation 
about this command.
#11 73.85 error /usr/uiapp/node_modules/puppeteer: Command failed.
#11 73.85 Exit code: 1
#11 73.85 Command: node install.js
#11 73.85 Arguments:
#11 73.85 Directory: /usr/uiapp/node_modules/puppeteer
#11 73.85 Output:
#11 73.85 The chromium binary is not available for arm64.
#11 73.85 If you are on Ubuntu, you can install with:
#11 73.85
#11 73.85  sudo apt install chromium
#11 73.85
#11 73.85
#11 73.85  sudo apt install chromium-browser
#11 73.85
#11 73.85 
/usr/uiapp/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserFetcher.js:119
#11 73.85 throw new Error();
#11 73.85 ^
#11 73.85
#11 73.85 Error
#11 73.85 at 
/usr/uiapp/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserFetcher.js:119:27
#11 73.85 at FSReqCallback.oncomplete (node:fs:198:21)
#11 73.93 warning Error running install script for optional dependency: 
"/usr/uiapp/node_modules/nice-napi: Command failed.
#11 73.93 Exit code: 1
#11 73.93 Command: node-gyp-build
#11 73.93 Arguments:
#11 73.93 Directory: /usr/uiapp/node_modules/nice-napi
#11 73.93 Output:
#11 73.93 gyp info it worked if it ends with ok
#11 73.93 gyp info using node-gyp@8.4.1
#11 73.93 gyp info using node@16.14.2 | linux | arm64
#11 73.93 gyp ERR! find Python
#11 73.93 gyp ERR! find Python Python is not set from command line or npm 
configuration
#11 73.93 gyp ERR! find Python Python is not set from environment variable 
PYTHON
#11 73.93 gyp ERR! find Python checking if \"python3\" can be used
#11 73.93 gyp ERR! find Python - \"python3\" is not in PATH or produced an error
#11 73.93 gyp ERR! find Python checking if \"python\" can be used
#11 73.93 gyp ERR! find Python - \"python\" is not in PATH or produced an error
#11 73.93 gyp ERR! find Python
#11 73.93 gyp ERR! find Python 
**
#11 73.93 gyp ERR! find Python You need to install the latest version of Python.
#11 73.93 gyp ERR! find Python Node-gyp should be able to find and use Python. 
If not,
#11 73.93 gyp ERR! find Python you can try one of the following options:
#11 73.93 gyp ERR! find Python - Use the switch 
--python=\"/path/to/pythonexecutable\"
#11 73.93 gyp ERR! find Python   (accepted by both node-gyp and npm)
#11 73.93 gyp ERR! find Python - Set the environment variable PYTHON
#11 73.93 gyp ERR! find Python - Set the npm configuration variable python:
#11 73.93 gyp ERR! find Python   npm config set python 
\"/path/to/pythonexecutable\"
#11 73.93 gyp ERR! find Python For more information consult the documentation 
at:
#11 73.93 gyp ERR! find Python https://github.com/nodejs/node-gyp#installation
#11 73.93 gyp ERR! find Python 
**
#11 73.93 gyp ERR! find Python
#11 73.93 gyp ERR! configure error
#11 73.93 gyp ERR! stack Error: Could not find any Python installation to use
#11 73.93 gyp ERR! stack at PythonFinder.fail 
(/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/find-python.js:330:47)
#11 73.93 gyp ERR! stack at PythonFinder.runChecks 
(/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/find-python.js:159:21)
#11 73.93 gyp ERR! stack at PythonFinder. 
(/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/find-python.js:202:16)
#11 73.93 gyp ERR! stack at PythonFinder.execFileCallback 
(/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/find-python.js:294:16)
#11 73.93 gyp ERR! stack at exithandler (node:child_process:406:5)
#11 73.93 gyp ERR! stack at ChildProcess.errorhandler 
(node:child_process:418:5)
#11 73.93 gyp ERR! stack at ChildProcess.emit (node:events:526:28)
#11 73.93 gyp ERR! stack at Process.ChildProcess._handle.onexit 
(node:internal/child_process:289:12)
#11 73.93 gyp ERR! stack at onErrorNT (node:internal/child_process:478:16)
#11 73.93 gyp ERR! stack at 

[jira] [Commented] (YUNIKORN-1185) Small applications starve large ones in the same FIFO queue

2022-04-25 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527803#comment-17527803
 ] 

Weiwei Yang commented on YUNIKORN-1185:
---

+ [~wilfreds], [~ccondit]

FIFO policy won't solve this problem. The FIFO policy is as simple as 
scheduling things in FIFO order, if one doesn't work, move to the next one. 
That means if a job in the front of the queue, and has a pretty large request 
couldn't be satisfied at this point, the scheduler will go ahead scheduling 
other jobs in the later position of the queue. This could cause the problem as 
you mentioned, and we knew that in the beginning.

To alleviate this issue, we have created the reservation logic. Basically, it 
reserves resources on a certain node for a large pod, to give it a higher 
priority during the scheduling phase. We were striving for a balance between 
resource reservation and utilization. If we reserve the resources for a long 
time and don't release them unless it is satisfied or expired, we could solve 
the "starving" issue as mentioned in the description.  The old behavior was 
like that, but that introduces a few problems, hurts the utilization, and 
conflict with cluster-autoscaler. The latter issue is a bigger one and probably 
the major reason why we relax the reservation semantics. As an example, if we 
reserve 20G resources on a node for a large pod, if we do not allocate 
resources to other pending pods, the cluster will be unable to scale up. 
Because the auto-scaler doesn't recognize the reserved resources and still 
thinks they are available for the pending pods.

The current logic still makes reservations, which means in each scheduling 
cycle, we still try to allocate resources for the starving pods first. If A is 
submitted before B, and A is currently starving and reserved on node X, when A 
and B are both pending, and node X (or any other node) has available resources 
for A, we will still allocate resources to A first. If otherwise, the scheduler 
continues to look for the next candidate. This logic works with cluster 
autoscaler and solves the utilization issue, but couldn't solve the "starving" 
problem in every scenario. 

Maybe there is another algorithm that can solve this problem, I would love to 
hear more ideas.

> Small applications starve large ones in the same FIFO queue
> ---
>
> Key: YUNIKORN-1185
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1185
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Adam Novak
>Priority: Major
>
> Even when I set my queue to use a {{fifo}} application sort policy, 
> applications that enter the queue later are able to run before applications 
> that are submitted earlier; the queue does not behave like a first-in, 
> first-out queue.
> Specifically, this happens when the later applications are smaller than the 
> earlier ones. If enough small jobs applications are available in the queue to 
> immediately fill any space that opens up, they will schedule as soon as space 
> is available. YuniKorn doesn't wait for enough space to become free to 
> schedule waiting large applications, no matter how much older they are than 
> the things that are passing them in the queue.
> The result of this is that a steady supply of small applications can keep a 
> larger application waiting indefinitely, causing starvation.
> The relevant code seems to be [in Queue's tryAllocate method|#L1069-L1070]. 
> YuniKorn goes through all the applications in the queue in order, and 
> greedily schedules work items until no more fit. If no space large enough to 
> fit any work form the first application currently exists, it will always fill 
> what space there is with work from applications later in the queue. It will 
> never wait to drain out space on a node to fit work from that first 
> application.
> How can I configure or modify YuniKorn to prevent starvation, and make the 
> applications in a queue execute in order, or at least not arbitrarily far out 
> of order?
> (I already tried the {{stateaware}} queue sort, but it doesn't seem to work 
> well with applications as small as mine. It appeared to run only one 
> application at a time, because my applications finish so fast.)
> h4. Replication
> First, have a Kubernetes cluster with a node {{k1.kube}} with 96 cores.
> Next, set up YuniKorn 0.12.2 with this {{values.yml}} for the Helm chart:
>  
> {code:java}
> embedAdmissionController: false
> configuration: |
>   partitions:
>     -
>       name: default
>       placementrules:
>         - name: tag
>           value: namespace
>           create: true
>       queues:
>         - name: root
>           submitacl: '*'
>           childtemplate:
>            properties:
>              application.sort.policy: 

[jira] [Resolved] (YUNIKORN-1160) Fix codecov after migration

2022-03-31 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1160.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Fix codecov after migration
> ---
>
> Key: YUNIKORN-1160
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1160
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Craig Condit
>Assignee: Weiwei Yang
>Priority: Major
> Fix For: 1.0.0
>
>
> After the rename of the repos, our code coverage reports are no longer 
> happening. This seems to be due to passing the new repo names into the 
> codecov API. Someone with credentials to that site will need to update to the 
> new structure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1160) Fix codecov after migration

2022-03-31 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515146#comment-17515146
 ] 

Weiwei Yang commented on YUNIKORN-1160:
---

Codecov is working, all repo are setup correctly.

> Fix codecov after migration
> ---
>
> Key: YUNIKORN-1160
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1160
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Craig Condit
>Assignee: Weiwei Yang
>Priority: Major
>
> After the rename of the repos, our code coverage reports are no longer 
> happening. This seems to be due to passing the new repo names into the 
> codecov API. Someone with credentials to that site will need to update to the 
> new structure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1160) Fix codecov after migration

2022-03-29 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513878#comment-17513878
 ] 

Weiwei Yang commented on YUNIKORN-1160:
---

I do not have extra credentials. Only the ASF infra team do, I have created 
INFRA-23047 to track this issue. 

> Fix codecov after migration
> ---
>
> Key: YUNIKORN-1160
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1160
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Craig Condit
>Assignee: Weiwei Yang
>Priority: Major
>
> After the rename of the repos, our code coverage reports are no longer 
> happening. This seems to be due to passing the new repo names into the 
> codecov API. Someone with credentials to that site will need to update to the 
> new structure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1141) [Umbrella] Post graduation tasks

2022-03-21 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1141:
--
Description: 
As of March 16th, 2022. YuniKorn has been officially graduated from the 
incubator and become an ASF top-level project, the roster has been established: 
https://whimsy.apache.org/roster/committee/yunikorn. This JIRA is an umbrella 
to track the post-graduation tasks, including:
# Press Releases for new TLPs
# Handover
# Transferring Resources
# Final Revision of Podling Incubation Records

related doc: 
https://svn.apache.org/repos/infra/websites/production/incubator/content/guides/graduation.html#top-level-board-proposal

  was:
As of March 16th, YuniKorn has been officially graduated from the incubator and 
become an ASF top-level project, the roster has been established: 
https://whimsy.apache.org/roster/committee/yunikorn. This JIRA is an umbrella 
to track the post-graduation tasks, including:
# Press Releases for new TLPs
# Handover
# Transferring Resources
# Final Revision of Podling Incubation Records

related doc: 
https://svn.apache.org/repos/infra/websites/production/incubator/content/guides/graduation.html#top-level-board-proposal


> [Umbrella] Post graduation tasks
> 
>
> Key: YUNIKORN-1141
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1141
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common, release, scheduler-interface, shim - 
> kubernetes, webapp, website
>Reporter: Weiwei Yang
>Priority: Major
>
> As of March 16th, 2022. YuniKorn has been officially graduated from the 
> incubator and become an ASF top-level project, the roster has been 
> established: https://whimsy.apache.org/roster/committee/yunikorn. This JIRA 
> is an umbrella to track the post-graduation tasks, including:
> # Press Releases for new TLPs
> # Handover
> # Transferring Resources
> # Final Revision of Podling Incubation Records
> related doc: 
> https://svn.apache.org/repos/infra/websites/production/incubator/content/guides/graduation.html#top-level-board-proposal



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1141) [Umbrella] Post graduation tasks

2022-03-21 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1141:
--
Description: 
As of March 16th, YuniKorn has been officially graduated from the incubator and 
become an ASF top-level project, the roster has been established: 
https://whimsy.apache.org/roster/committee/yunikorn. This JIRA is an umbrella 
to track the post-graduation tasks, including:
# Press Releases for new TLPs
# Handover
# Transferring Resources
# Final Revision of Podling Incubation Records

related doc: 
https://svn.apache.org/repos/infra/websites/production/incubator/content/guides/graduation.html#top-level-board-proposal

  was:
As of March 16th, YuniKorn has been officially graduated from the incubator and 
become an ASF top-level project, the roster has been established: 
https://whimsy.apache.org/roster/committee/yunikorn. This JIRA is an umbrella 
to track the post-graduation tasks, including:
# Press Releases for new TLPs
# Handover
# Transferring Resources
# Final Revision of Podling Incubation Records
related doc: 
https://svn.apache.org/repos/infra/websites/production/incubator/content/guides/graduation.html#top-level-board-proposal


> [Umbrella] Post graduation tasks
> 
>
> Key: YUNIKORN-1141
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1141
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common, release, scheduler-interface, shim - 
> kubernetes, webapp, website
>Reporter: Weiwei Yang
>Priority: Major
>
> As of March 16th, YuniKorn has been officially graduated from the incubator 
> and become an ASF top-level project, the roster has been established: 
> https://whimsy.apache.org/roster/committee/yunikorn. This JIRA is an umbrella 
> to track the post-graduation tasks, including:
> # Press Releases for new TLPs
> # Handover
> # Transferring Resources
> # Final Revision of Podling Incubation Records
> related doc: 
> https://svn.apache.org/repos/infra/websites/production/incubator/content/guides/graduation.html#top-level-board-proposal



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1141) [Umbrella] Post graduation tasks

2022-03-21 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1141:
-

 Summary: [Umbrella] Post graduation tasks
 Key: YUNIKORN-1141
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1141
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common, release, scheduler-interface, shim - 
kubernetes, webapp, website
Reporter: Weiwei Yang


As of March 16th, YuniKorn has been officially graduated from the incubator and 
become an ASF top-level project, the roster has been established: 
https://whimsy.apache.org/roster/committee/yunikorn. This JIRA is an umbrella 
to track the post-graduation tasks, including:
# Press Releases for new TLPs
# Handover
# Transferring Resources
# Final Revision of Podling Incubation Records
related doc: 
https://svn.apache.org/repos/infra/websites/production/incubator/content/guides/graduation.html#top-level-board-proposal



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1137) Fix typo in placement related messages

2022-03-18 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1137:
--
Summary: Fix typo in placement related messages  (was: Missing e in the 
word placment)

> Fix typo in placement related messages
> --
>
> Key: YUNIKORN-1137
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1137
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: ted
>Assignee: ted
>Priority: Minor
>  Labels: newbie, pull-request-available
>
> Missing "e" in the word "placment"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1137) Fix typo in placement related messages

2022-03-18 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1137.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Fix typo in placement related messages
> --
>
> Key: YUNIKORN-1137
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1137
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: ted
>Assignee: ted
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 1.0.0
>
>
> Missing "e" in the word "placment"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-951) Add perf-tool description into benchmarking tutorial page

2022-03-16 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-951.
--
Fix Version/s: 1.0.0
   Resolution: Fixed

> Add perf-tool description into benchmarking tutorial page
> -
>
> Key: YUNIKORN-951
> URL: https://issues.apache.org/jira/browse/YUNIKORN-951
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: release, website
>Reporter: Chen Yu Teng
>Assignee: Chen Yu Teng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Describe performance tool and how to use this.
> Update perf tools doc into yunikorn 
> website([https://yunikorn.apache.org/docs/performance/performance_tutorial])
> Excepted context:
>  #  Cases setting in conf.yaml
>  ** Describe perf cases with default parameters according to conf.yaml 
> context  
>  ** Parameters description
>  #  How to start test
>  ** commands 
>  #  Meaning of outputs.
>  ** Explain what diagrams will produce according to default conf.yaml



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1122) Move constants to scheduler interface

2022-03-15 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1122:
-

 Summary: Move constants to scheduler interface
 Key: YUNIKORN-1122
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1122
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common, scheduler-interface, shim - kubernetes
Reporter: Weiwei Yang
Assignee: TingYao Huang


While reviewing YUNIKORN-1103, I found there are quite some constants are still 
defined in shim/core repo. Since we have the ability to define constants in SI, 
we should move all COMMON constants to SI.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1103) Support fetching queue name from pod annotation "yunikorn.apache.org/queue"

2022-03-15 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1103.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Support fetching queue name from pod annotation "yunikorn.apache.org/queue"
> ---
>
> Key: YUNIKORN-1103
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1103
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Today, when we submit a Spark job, in order to know which queue this job will 
> be submitted to, we fetch the queue name from pod's spec, under the label:
> Pod {
>  Label:
>  queue: "root.abc"
> }
> besides that, we also want to support fetching queue names from pod 
> annotation:
> Pod {
>  annotation:
>  yunikorn.apache.org/queue: "root.abc"
> }
> BTW, this is for the static queue case, not with the placement rule.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-1103) Support fetching queue name from pod annotation "yunikorn.apache.org/queue"

2022-03-01 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-1103:
--
Target Version: 1.0.0

> Support fetching queue name from pod annotation "yunikorn.apache.org/queue"
> ---
>
> Key: YUNIKORN-1103
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1103
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>
> Today, when we submit a Spark job, in order to know which queue this job will 
> be submitted to, we fetch the queue name from pod's spec, under the label:
> Pod {
>  Label:
>  queue: "root.abc"
> }
> besides that, we also want to support fetching queue names from pod 
> annotation:
> Pod {
>  annotation:
>  yunikorn.apache.org/queue: "root.abc"
> }
> BTW, this is for the static queue case, not with the placement rule.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-596) document pod labels and annotations

2022-03-01 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499895#comment-17499895
 ] 

Weiwei Yang commented on YUNIKORN-596:
--

Please make sure YUNIKORN-1103 gets done first before adding the document.

> document pod labels and annotations
> ---
>
> Key: YUNIKORN-596
> URL: https://issues.apache.org/jira/browse/YUNIKORN-596
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Wilfred Spiegelenburg
>Assignee: Ting Yao,Huang
>Priority: Minor
>  Labels: newbie
>
> We have a range of labels and annotations that are set or can be set by the 
> end user on the pods. These labels and annotations are used and get checked 
> in the code but we have not defined and documented any of them.
> We should keep a list of labels and annotations we know and use and what they 
> are used for.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-1103) Support fetching queue name from pod annotation "yunikorn.apache.org/queue"

2022-03-01 Thread Weiwei Yang (Jira)
Weiwei Yang created YUNIKORN-1103:
-

 Summary: Support fetching queue name from pod annotation 
"yunikorn.apache.org/queue"
 Key: YUNIKORN-1103
 URL: https://issues.apache.org/jira/browse/YUNIKORN-1103
 Project: Apache YuniKorn
  Issue Type: Improvement
Reporter: Weiwei Yang
Assignee: TingYao Huang


Today, when we submit a Spark job, in order to know which queue this job will 
be submitted to, we fetch the queue name from pod's spec, under the label:

Pod {
 Label:
 queue: "root.abc"
}

besides that, we also want to support fetching queue names from pod annotation:

Pod {
 annotation:
 yunikorn.apache.org/queue: "root.abc"
}

BTW, this is for the static queue case, not with the placement rule.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-1103) Support fetching queue name from pod annotation "yunikorn.apache.org/queue"

2022-03-01 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YUNIKORN-1103:
-

Assignee: Ting Yao,Huang  (was: TingYao Huang)

> Support fetching queue name from pod annotation "yunikorn.apache.org/queue"
> ---
>
> Key: YUNIKORN-1103
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1103
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>
> Today, when we submit a Spark job, in order to know which queue this job will 
> be submitted to, we fetch the queue name from pod's spec, under the label:
> Pod {
>  Label:
>  queue: "root.abc"
> }
> besides that, we also want to support fetching queue names from pod 
> annotation:
> Pod {
>  annotation:
>  yunikorn.apache.org/queue: "root.abc"
> }
> BTW, this is for the static queue case, not with the placement rule.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-1042) "Why YuniKorn" is not rendered properly on mobile

2022-02-26 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498535#comment-17498535
 ] 

Weiwei Yang commented on YUNIKORN-1042:
---

Sure, [~sankeerthan.kasilingam], thanks for offering the help!
Assigned the jira to you, pls go ahead. Thx

> "Why YuniKorn" is not rendered properly on mobile
> -
>
> Key: YUNIKORN-1042
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1042
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Chaoran Yu
>Assignee: Sankeerthan Kasilingam
>Priority: Major
>  Labels: newbie
> Attachments: IMG_499E342B4680-1.jpeg
>
>
> Likely due to YUNIKORN-1036, now the Why YuniKorn section is not rendered 
> properly when viewed on mobile. See the attached screenshot



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-1042) "Why YuniKorn" is not rendered properly on mobile

2022-02-26 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YUNIKORN-1042:
-

Assignee: Sankeerthan Kasilingam

> "Why YuniKorn" is not rendered properly on mobile
> -
>
> Key: YUNIKORN-1042
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1042
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: website
>Reporter: Chaoran Yu
>Assignee: Sankeerthan Kasilingam
>Priority: Major
>  Labels: newbie
> Attachments: IMG_499E342B4680-1.jpeg
>
>
> Likely due to YUNIKORN-1036, now the Why YuniKorn section is not rendered 
> properly when viewed on mobile. See the attached screenshot



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-725) Support arm64

2022-02-25 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498359#comment-17498359
 ] 

Weiwei Yang commented on YUNIKORN-725:
--

That's fine, thanks [~srisco]. Thanks for the remark as well : )


> Support arm64
> -
>
> Key: YUNIKORN-725
> URL: https://issues.apache.org/jira/browse/YUNIKORN-725
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: release
>Reporter: Holden Karau
>Priority: Major
>
> It would be good to support arm64, this is probably not too painful and can 
> be done by swapping docker build with docker buildx, but there's often edge 
> cases where some code changes are needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-725) Support arm64

2022-02-23 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497183#comment-17497183
 ] 

Weiwei Yang commented on YUNIKORN-725:
--

hi [~wilfreds] all good points, thanks. I agree we can go with the option using 
docker manifest. [~srisco] not sure if you want to take this and move on to the 
next step if you agree with this approach?

> Support arm64
> -
>
> Key: YUNIKORN-725
> URL: https://issues.apache.org/jira/browse/YUNIKORN-725
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: release
>Reporter: Holden Karau
>Priority: Major
>
> It would be good to support arm64, this is probably not too painful and can 
> be done by swapping docker build with docker buildx, but there's often edge 
> cases where some code changes are needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-990) DocSearch Migration

2022-02-23 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497049#comment-17497049
 ] 

Weiwei Yang commented on YUNIKORN-990:
--

Reopen this issue as it seems the search no longer works after the migration.
[~Yukali] please help to work with the algolia community to get this solved. 
Thanks!

> DocSearch Migration
> ---
>
> Key: YUNIKORN-990
> URL: https://issues.apache.org/jira/browse/YUNIKORN-990
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Weiwei Yang
>Assignee: Chen Yu Teng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> h1. DocSearch is migrating!
> h1. 
> With DocSearch, we put developers needs first, so we always try to improve 
> their experience with our tools. We teamed up with our beloved Algolia 
> Crawler to provide a better platform for you to:
> * Access to the full Algolia Platform - Explore additional products and 
> features for free
> * Better collaboration - Invite and manage additional team members
> * Power and flexibility of the Algolia Crawler - Previously only available to 
> paid enterprise customers, it provides the ability to customize and refine 
> your indexing like never before
> * Work on your schedule - Schedule or trigger crawls based on your demands 
> from the Crawler interface or the GitHub action
> * Advanced tooling - The Crawler interface includes a live editor to maintain 
> your config and allow you to test your search results with DocSearch v3
> h1. What do I need to do?
> h1. 
> We tried making the migrations as smooth as possible, so all you have to do 
> to migrate your yunikorn index to our new infra is:
> # Create an Algolia account with your email address (the one that received 
> this email)
> # Join your own Algolia application (the invite is valid 7 days, please 
> contact us if you need a new one)
> # Update your DocSearch frontend integration with your new credentials: (ASK 
> PPMC)
> You can also read more about the migration on our documentation or contact us 
> at docsea...@algolia.com or on Discord
> h1. Can I still use my old credentials? When do I need to migrate by?
> h1. 
> Old credentials and indices will still be available, but crawl jobs will be 
> stopped 3 months after you've received this email.
> h1. Can I still use the legacy DocSearch scraper locally?
> h1. 
> Yes, the configs and the scraper will still be available, but not maintained.
> h1. How can I configure/update my new Crawler?
> h1. 
> After you've created your Algolia account and joined your Algolia 
> application, you can visit the Crawler interface and configure your crawlers!
> For any informations regarding your DocSearch configuration, please visit our 
> new documentation. If you were familiar with the legacy DocSearch scraper and 
> configs, please read our key parity page.You can also read more about the 
> Algolia Crawler on the Algolia Documentation.
> h1. Do I need to trigger my crawls manually?
> h1. 
> No, your crawls are already scheduled to run once a week, but you are now 
> able to trigger a new one whenever you want!
> h1. I'm not using this index/these credentials anymore
> h1. 
> Keep in mind that DocSearch is a community project, please let us know if 
> that's the case so we can disable it.
> Read more in our migration guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-990) DocSearch Migration

2022-02-23 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-990:
-
Priority: Blocker  (was: Major)

> DocSearch Migration
> ---
>
> Key: YUNIKORN-990
> URL: https://issues.apache.org/jira/browse/YUNIKORN-990
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Weiwei Yang
>Assignee: Chen Yu Teng
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> h1. DocSearch is migrating!
> h1. 
> With DocSearch, we put developers needs first, so we always try to improve 
> their experience with our tools. We teamed up with our beloved Algolia 
> Crawler to provide a better platform for you to:
> * Access to the full Algolia Platform - Explore additional products and 
> features for free
> * Better collaboration - Invite and manage additional team members
> * Power and flexibility of the Algolia Crawler - Previously only available to 
> paid enterprise customers, it provides the ability to customize and refine 
> your indexing like never before
> * Work on your schedule - Schedule or trigger crawls based on your demands 
> from the Crawler interface or the GitHub action
> * Advanced tooling - The Crawler interface includes a live editor to maintain 
> your config and allow you to test your search results with DocSearch v3
> h1. What do I need to do?
> h1. 
> We tried making the migrations as smooth as possible, so all you have to do 
> to migrate your yunikorn index to our new infra is:
> # Create an Algolia account with your email address (the one that received 
> this email)
> # Join your own Algolia application (the invite is valid 7 days, please 
> contact us if you need a new one)
> # Update your DocSearch frontend integration with your new credentials: (ASK 
> PPMC)
> You can also read more about the migration on our documentation or contact us 
> at docsea...@algolia.com or on Discord
> h1. Can I still use my old credentials? When do I need to migrate by?
> h1. 
> Old credentials and indices will still be available, but crawl jobs will be 
> stopped 3 months after you've received this email.
> h1. Can I still use the legacy DocSearch scraper locally?
> h1. 
> Yes, the configs and the scraper will still be available, but not maintained.
> h1. How can I configure/update my new Crawler?
> h1. 
> After you've created your Algolia account and joined your Algolia 
> application, you can visit the Crawler interface and configure your crawlers!
> For any informations regarding your DocSearch configuration, please visit our 
> new documentation. If you were familiar with the legacy DocSearch scraper and 
> configs, please read our key parity page.You can also read more about the 
> Algolia Crawler on the Algolia Documentation.
> h1. Do I need to trigger my crawls manually?
> h1. 
> No, your crawls are already scheduled to run once a week, but you are now 
> able to trigger a new one whenever you want!
> h1. I'm not using this index/these credentials anymore
> h1. 
> Keep in mind that DocSearch is a community project, please let us know if 
> that's the case so we can disable it.
> Read more in our migration guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Reopened] (YUNIKORN-990) DocSearch Migration

2022-02-23 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reopened YUNIKORN-990:
--

> DocSearch Migration
> ---
>
> Key: YUNIKORN-990
> URL: https://issues.apache.org/jira/browse/YUNIKORN-990
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Weiwei Yang
>Assignee: Chen Yu Teng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> h1. DocSearch is migrating!
> h1. 
> With DocSearch, we put developers needs first, so we always try to improve 
> their experience with our tools. We teamed up with our beloved Algolia 
> Crawler to provide a better platform for you to:
> * Access to the full Algolia Platform - Explore additional products and 
> features for free
> * Better collaboration - Invite and manage additional team members
> * Power and flexibility of the Algolia Crawler - Previously only available to 
> paid enterprise customers, it provides the ability to customize and refine 
> your indexing like never before
> * Work on your schedule - Schedule or trigger crawls based on your demands 
> from the Crawler interface or the GitHub action
> * Advanced tooling - The Crawler interface includes a live editor to maintain 
> your config and allow you to test your search results with DocSearch v3
> h1. What do I need to do?
> h1. 
> We tried making the migrations as smooth as possible, so all you have to do 
> to migrate your yunikorn index to our new infra is:
> # Create an Algolia account with your email address (the one that received 
> this email)
> # Join your own Algolia application (the invite is valid 7 days, please 
> contact us if you need a new one)
> # Update your DocSearch frontend integration with your new credentials: (ASK 
> PPMC)
> You can also read more about the migration on our documentation or contact us 
> at docsea...@algolia.com or on Discord
> h1. Can I still use my old credentials? When do I need to migrate by?
> h1. 
> Old credentials and indices will still be available, but crawl jobs will be 
> stopped 3 months after you've received this email.
> h1. Can I still use the legacy DocSearch scraper locally?
> h1. 
> Yes, the configs and the scraper will still be available, but not maintained.
> h1. How can I configure/update my new Crawler?
> h1. 
> After you've created your Algolia account and joined your Algolia 
> application, you can visit the Crawler interface and configure your crawlers!
> For any informations regarding your DocSearch configuration, please visit our 
> new documentation. If you were familiar with the legacy DocSearch scraper and 
> configs, please read our key parity page.You can also read more about the 
> Algolia Crawler on the Algolia Documentation.
> h1. Do I need to trigger my crawls manually?
> h1. 
> No, your crawls are already scheduled to run once a week, but you are now 
> able to trigger a new one whenever you want!
> h1. I'm not using this index/these credentials anymore
> h1. 
> Keep in mind that DocSearch is a community project, please let us know if 
> that's the case so we can disable it.
> Read more in our migration guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-725) Support arm64

2022-02-21 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495871#comment-17495871
 ] 

Weiwei Yang commented on YUNIKORN-725:
--

Thanks [~srisco], [~wilfreds]. Sorry for the late response, I was on vacation.
I don't have much experience with the cross-compiling stuff, it looks like 
there are 2 options, docker buildx, and docker manifest. The buildx approach 
looks so much easier, why do you prefer the manifest approach [~wilfreds]? 
Except for the build script changes, how can we make sure the image we built is 
working well on arm64 hosts? As far as I can tell, github action doesn't 
support arm64 which means we won't be able to have automated tests for that. I 
am not sure what is the best practice for this, any suggestions?

> Support arm64
> -
>
> Key: YUNIKORN-725
> URL: https://issues.apache.org/jira/browse/YUNIKORN-725
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: release
>Reporter: Holden Karau
>Priority: Major
>
> It would be good to support arm64, this is probably not too painful and can 
> be done by swapping docker build with docker buildx, but there's often edge 
> cases where some code changes are needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-725) Support arm64

2022-02-16 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493470#comment-17493470
 ] 

Weiwei Yang commented on YUNIKORN-725:
--

hi [~srisco] I do not know anyone who has an env to create or test arm64 
images, do you want to give it a try?
We are building our docker images with the Makefile: 
https://github.com/apache/incubator-yunikorn-k8shim/blob/1966efe93dcc1eddadf45a35a01830cedaff6a35/Makefile#L270-L271,
 all docker files are in the repo. Thanks

> Support arm64
> -
>
> Key: YUNIKORN-725
> URL: https://issues.apache.org/jira/browse/YUNIKORN-725
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Holden Karau
>Priority: Major
>
> It would be good to support arm64, this is probably not too painful and can 
> be done by swapping docker build with docker buildx, but there's often edge 
> cases where some code changes are needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-979) Add e2e test coverage for the admission controller

2022-02-13 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491795#comment-17491795
 ] 

Weiwei Yang commented on YUNIKORN-979:
--

hi [~lowc1012]

Thanks for working on this. I think we can start by covering a few basic things:
# submit a pod without setting schedulerName, and verify the schedulerName gets 
automatically updated to yunikorn. (we do not need to verify all generated 
labels/annotations, we need to get them covered by UT, if not, add them to UT)
# verify the blacklist of namespaces, make sure YK skips updating pod 
labels/annotations/schedulerName for pods in such namespaces.
# post anj invalid yunikorn-configs config-map updates and make sure it gets 
denied by the admission-controller. 

> Add e2e test coverage for the admission controller
> --
>
> Key: YUNIKORN-979
> URL: https://issues.apache.org/jira/browse/YUNIKORN-979
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: test - e2e
>Reporter: Weiwei Yang
>Assignee: Ryan Lo
>Priority: Major
>  Labels: pull-request-available
>
> We need to add coverage for the admission controller, in order to prevent the 
> regressions 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-990) DocSearch Migration

2022-02-12 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-990.
--
Fix Version/s: 1.0.0
   Resolution: Fixed

> DocSearch Migration
> ---
>
> Key: YUNIKORN-990
> URL: https://issues.apache.org/jira/browse/YUNIKORN-990
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Weiwei Yang
>Assignee: Chen Yu Teng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> h1. DocSearch is migrating!
> h1. 
> With DocSearch, we put developers needs first, so we always try to improve 
> their experience with our tools. We teamed up with our beloved Algolia 
> Crawler to provide a better platform for you to:
> * Access to the full Algolia Platform - Explore additional products and 
> features for free
> * Better collaboration - Invite and manage additional team members
> * Power and flexibility of the Algolia Crawler - Previously only available to 
> paid enterprise customers, it provides the ability to customize and refine 
> your indexing like never before
> * Work on your schedule - Schedule or trigger crawls based on your demands 
> from the Crawler interface or the GitHub action
> * Advanced tooling - The Crawler interface includes a live editor to maintain 
> your config and allow you to test your search results with DocSearch v3
> h1. What do I need to do?
> h1. 
> We tried making the migrations as smooth as possible, so all you have to do 
> to migrate your yunikorn index to our new infra is:
> # Create an Algolia account with your email address (the one that received 
> this email)
> # Join your own Algolia application (the invite is valid 7 days, please 
> contact us if you need a new one)
> # Update your DocSearch frontend integration with your new credentials: (ASK 
> PPMC)
> You can also read more about the migration on our documentation or contact us 
> at docsea...@algolia.com or on Discord
> h1. Can I still use my old credentials? When do I need to migrate by?
> h1. 
> Old credentials and indices will still be available, but crawl jobs will be 
> stopped 3 months after you've received this email.
> h1. Can I still use the legacy DocSearch scraper locally?
> h1. 
> Yes, the configs and the scraper will still be available, but not maintained.
> h1. How can I configure/update my new Crawler?
> h1. 
> After you've created your Algolia account and joined your Algolia 
> application, you can visit the Crawler interface and configure your crawlers!
> For any informations regarding your DocSearch configuration, please visit our 
> new documentation. If you were familiar with the legacy DocSearch scraper and 
> configs, please read our key parity page.You can also read more about the 
> Algolia Crawler on the Algolia Documentation.
> h1. Do I need to trigger my crawls manually?
> h1. 
> No, your crawls are already scheduled to run once a week, but you are now 
> able to trigger a new one whenever you want!
> h1. I'm not using this index/these credentials anymore
> h1. 
> Keep in mind that DocSearch is a community project, please let us know if 
> that's the case so we can disable it.
> Read more in our migration guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-1073) AllocationAskRelease field allocationkey should be allocationKey

2022-02-12 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-1073.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> AllocationAskRelease field allocationkey should be allocationKey
> 
>
> Key: YUNIKORN-1073
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1073
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: core - common, scheduler-interface, shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: ted
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 1.0.0
>
>
> The case for the allocationkey in the AllocationAskRelease is incorrect and 
> should be fixed. Proper capitalisation would be allocationKey as it is used 
> in all other places in the interface.
> This has a flow on effect from the interface to the core and shim as the 
> field name in the message changes. It is a simple change that can almost be 
> done by a search and replace and 2 times a go.mod file updates
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



  1   2   3   4   5   6   7   8   9   10   >