[jira] [Updated] (YUNIKORN-317) [Design] Cache/Scheduler modules need to be combined to a single module

2020-08-05 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YUNIKORN-317:

Summary: [Design] Cache/Scheduler modules need to be combined to a single 
module  (was: [CORE] Cache/Scheduler modules need to be combined to a single 
module)

> [Design] Cache/Scheduler modules need to be combined to a single module
> ---
>
> Key: YUNIKORN-317
> URL: https://issues.apache.org/jira/browse/YUNIKORN-317
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - cache, core - scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> Now we have two modules inside core:
>  * Cache
>  * Scheduler
> Between Cache and Scheduler, they communicate with each other using async 
> events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). 
> Also, it complex logic a lot, there're 10 out of 19 events are only between 
> cache and scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-317) [Design] Cache/Scheduler modules need to be combined to a single module

2020-08-05 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YUNIKORN-317:

Component/s: documentation

> [Design] Cache/Scheduler modules need to be combined to a single module
> ---
>
> Key: YUNIKORN-317
> URL: https://issues.apache.org/jira/browse/YUNIKORN-317
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - cache, core - scheduler, documentation
>Reporter: Wangda Tan
>Priority: Major
>
> Now we have two modules inside core:
>  * Cache
>  * Scheduler
> Between Cache and Scheduler, they communicate with each other using async 
> events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). 
> Also, it complex logic a lot, there're 10 out of 19 events are only between 
> cache and scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-317) [CORE] Cache/Scheduler modules need to be combined to a single module

2020-07-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165354#comment-17165354
 ] 

Wangda Tan commented on YUNIKORN-317:
-

Created [https://github.com/wangdatan/yunikorn-core/tree/YUNIKORN-317] branch 
and uploaded WIP patch (not ready for review).

> [CORE] Cache/Scheduler modules need to be combined to a single module
> -
>
> Key: YUNIKORN-317
> URL: https://issues.apache.org/jira/browse/YUNIKORN-317
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - cache, core - scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> Now we have two modules inside core:
>  * Cache
>  * Scheduler
> Between Cache and Scheduler, they communicate with each other using async 
> events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). 
> Also, it complex logic a lot, there're 10 out of 19 events are only between 
> cache and scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-317) [CORE] Cache/Scheduler modules need to be combined to a single module

2020-07-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165353#comment-17165353
 ] 

Wangda Tan commented on YUNIKORN-317:
-

Had a brief discussion with [~cheersyang] and [~wilfreds], we agree the two 
modules together is a good idea. 

Since it will involve lots of changes, before making changes, we need to better 
design and plan. I'm writing a design/analysis doc for the proposal. (Design 
doc/analysis will follow).

> [CORE] Cache/Scheduler modules need to be combined to a single module
> -
>
> Key: YUNIKORN-317
> URL: https://issues.apache.org/jira/browse/YUNIKORN-317
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - cache, core - scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> Now we have two modules inside core:
>  * Cache
>  * Scheduler
> Between Cache and Scheduler, they communicate with each other using async 
> events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). 
> Also, it complex logic a lot, there're 10 out of 19 events are only between 
> cache and scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-317) [CORE] Cache/Scheduler modules need to be combined to a single module

2020-07-26 Thread Wangda Tan (Jira)
Wangda Tan created YUNIKORN-317:
---

 Summary: [CORE] Cache/Scheduler modules need to be combined to a 
single module
 Key: YUNIKORN-317
 URL: https://issues.apache.org/jira/browse/YUNIKORN-317
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - cache, core - scheduler
Reporter: Wangda Tan


Now we have two modules inside core:
 * Cache
 * Scheduler

Between Cache and Scheduler, they communicate with each other using async 
events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). 

Also, it complex logic a lot, there're 10 out of 19 events are only between 
cache and scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-223) build website v2 by docker

2020-06-12 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134337#comment-17134337
 ] 

Wangda Tan commented on YUNIKORN-223:
-

Thanks [~lamber-ken] for filing this ticket, I just moved this to YUNIKORN-212 
umbrella.

> build website v2 by docker
> --
>
> Key: YUNIKORN-223
> URL: https://issues.apache.org/jira/browse/YUNIKORN-223
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: website
>Reporter: lamber-ken
>Priority: Major
>
> build website v2 by docker



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-223) build website v2 by docker

2020-06-12 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YUNIKORN-223:

Parent: YUNIKORN-212
Issue Type: Sub-task  (was: Improvement)

> build website v2 by docker
> --
>
> Key: YUNIKORN-223
> URL: https://issues.apache.org/jira/browse/YUNIKORN-223
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: website
>Reporter: lamber-ken
>Priority: Major
>
> build website v2 by docker



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-212) [Umbrella] YuniKorn web-site v2

2020-06-12 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YUNIKORN-212:

Summary: [Umbrella] YuniKorn web-site v2  (was: YuniKorn web-site v2)

> [Umbrella] YuniKorn web-site v2
> ---
>
> Key: YUNIKORN-212
> URL: https://issues.apache.org/jira/browse/YUNIKORN-212
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: website
>Reporter: Weiwei Yang
>Priority: Major
>
> Current web site: [http://yunikorn.apache.org/] 
> Plan to build a new version of web-site based on 
> [https://github.com/facebook/docusaurus]
> Planned goals:
>  * automated doc pages build
>  * multi-version doc support
>  * auto build and publish to ASF



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-218) Introduce Chaos Monkey tests to YuniKorn

2020-06-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133749#comment-17133749
 ] 

Wangda Tan commented on YUNIKORN-218:
-

*For phase #1, we can monitor the following items:* 

1) Randomly kill scheduler PODs.

2) Randomly kill other PODs which require scheduler to relaunch them. 

3) Randomly bring down nodes. 

4) If it often happens: schedule some pods using default scheduler on the same 
cluster, and make sure that YuniKorn reacts correctly.

*We need to make sure to verify the following abnormalities:* 

1) Is there any abnormal states of scheduler, such as negative available 
resource, total used > total available, etc. 

2) Is there any violation of scheduling constraints (like allocated > 
maximum-allowed, anti-affinity, etc.). 

3) If we run applications with finite lifetime (like batch), we should expect 
such applications can completed. (End states check). 

4) Are we able to see nodes with sufficient available resources but some pods 
keep in pending states?

[https://github.com/asobti/kube-monkey] looks like a good tool to make chaos 
within a cluster.

> Introduce Chaos Monkey tests to YuniKorn
> 
>
> Key: YUNIKORN-218
> URL: https://issues.apache.org/jira/browse/YUNIKORN-218
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: test - e2e
>Reporter: Wangda Tan
>Priority: Major
>
> We're seeing corner cases reported which lead scheduler to become 
> malfunction, cannot allocation PODs, etc. 
> To systematically prevent such issues happen, I think we need to introduce 
> ChaosMonkey tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-218) Introduce Chaos Monkey tests to YuniKorn

2020-06-11 Thread Wangda Tan (Jira)
Wangda Tan created YUNIKORN-218:
---

 Summary: Introduce Chaos Monkey tests to YuniKorn
 Key: YUNIKORN-218
 URL: https://issues.apache.org/jira/browse/YUNIKORN-218
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: test - e2e
Reporter: Wangda Tan


We're seeing corner cases reported which lead scheduler to become malfunction, 
cannot allocation PODs, etc. 

To systematically prevent such issues happen, I think we need to introduce 
ChaosMonkey tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) High efficient scheduling events framework phase 1

2020-06-05 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126965#comment-17126965
 ] 

Wangda Tan commented on YUNIKORN-42:


[~adam.antal], can you make it open to public (you may need to create it in 
your own google drive, and set it open for public to view).

> High efficient scheduling events framework phase 1
> --
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Adam Antal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-184) Update YuniKorn-Core Design Documentation

2020-05-22 Thread Wangda Tan (Jira)
Wangda Tan created YUNIKORN-184:
---

 Summary: Update YuniKorn-Core Design Documentation
 Key: YUNIKORN-184
 URL: https://issues.apache.org/jira/browse/YUNIKORN-184
 Project: Apache YuniKorn
  Issue Type: Task
  Components: documentation
Reporter: Wangda Tan


The original design doc is pretty out-of-dated, it was done before the 
yunikorn-core is completed. A lot of things changed after that, we need to 
refresh the design doc so new members can easier understand structure inside 
yunikorn-core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-173) Generates one default application ID per namespace in the admission controller

2020-05-22 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114422#comment-17114422
 ] 

Wangda Tan commented on YUNIKORN-173:
-

[~wwei], [~wilfreds],  

What the behavior when namespace -> queue creation/mapping is disabled? 

> Generates one default application ID per namespace in the admission controller
> --
>
> Key: YUNIKORN-173
> URL: https://issues.apache.org/jira/browse/YUNIKORN-173
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9
>
>
> If app doesn't explicitly specify application ID,  lets group such pods to 
> one single app per namespace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2) Support Gang Scheduling

2020-05-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108784#comment-17108784
 ] 

Wangda Tan commented on YUNIKORN-2:
---

Added design doc, still largely WIP.

[~wwei], [~wilfreds], [~Tao Yang], [~sunil.gov...@gmail.com] please review and 
give some feedbacks when you get chance. 

> Support Gang Scheduling
> ---
>
> Key: YUNIKORN-2
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>
> Gang scheduling is one of the most important features for schedulers, it is 
> very useful for machine learning workloads such as tensorflow, pytorch, etc. 
> Since yunikorn has notion of the application, it is not very hard for us to 
> support gang scheduling context.
> This umbrella tracks the efforts to support gang scheduling feature in 
> yunikorn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-21) Revisit node sorting algorithm for fairness

2020-04-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092781#comment-17092781
 ] 

Wangda Tan commented on YUNIKORN-21:


Thanks [~Tao Yang], the answer looks reasonable to me, and the number also 
looks very promising.

bq. The scheduling throughput can be improved from 450 to 5000+ in a mock 
cluster with 1000 nodes according to the benchmark results of 
scheduler_perf_test.go in my local test.

Reread the design doc. I think I can understand it better now. From high-level, 
class design makes sense, and if you can have a PoC patch, I can help with 
review and give more detailed suggestions.

After another thought, I think the weighted sort policy may make sense if the 
number of scorers involved is small (no more than 3), I want to avoid 20 
scorers involved and give a weighted score which we cannot explain at all

> Revisit node sorting algorithm for fairness
> ---
>
> Key: YUNIKORN-21
> URL: https://issues.apache.org/jira/browse/YUNIKORN-21
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Wangda Tan
>Priority: Major
> Attachments: Improve node sorting algorithm v1.pdf, Improve node 
> sorting algorithm v2.pdf
>
>
> Currently, we're using DominantRatio for the node sorting algorithm
> {code:java}
> func CompUsageShares(left, right *Resource) int {
>  lshares := getShares(left,nil) rshares := getShares(right,nil)
>  return compareShares(lshares, rshares) 
> }{code}
> Which is not good, two reasons:
>  # Dominate resource compare is about 8X more expensive than single float 
> compares for two resource types.
>  # Dominate resource is not stable when we have scarce resource types like 
> GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB 
> mem, 64 vcore and 8 GPU available; the prior one can go first because of the 
> following logic:
> {code:java}
> if total == nil || total.Resources[k] == 0 {
>  // negative share is logged
>  if v < 0 {
>   log.Logger().Debug("usage is negative no total, share is also negative", 
> zap.Int64("resource quantity", int64(v))) 
>  }
>  shares[idx] = float64(v) idx++ continue
> }{code}
> I think we should discard dominate resource compare for node resource. 
> Instead, we just use one resource type (like vcores) to compare available 
> resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-21) Revisit node sorting algorithm for fairness

2020-04-25 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092340#comment-17092340
 ] 

Wangda Tan commented on YUNIKORN-21:


Several examples for the node sorting policy: 

1) Bin-packing policy: Get sorted node list based on most-used nodes. 

2) Peanut-buttering policy: Get sorted node list based on least-used nodes. 

3) Best-fit policy: Get a sorted node list based on the node's available 
resource vector most similar to requested resource. 
(For example, requested resource is cpu=2,mem=3; A node with available resource 
cpu=4,mem=6 is more "fit" comparing to another node with available  resource 
cpu=6,mem=4. (Reference to the paper: 
https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf) 

1/2 are not request-related, 3 is request-related, I'm wondering how we deal 
with these different use cases based on the proposal.

Also, it will be important to make surethe node sorting policy can be used by 
preemption logic. 

> Revisit node sorting algorithm for fairness
> ---
>
> Key: YUNIKORN-21
> URL: https://issues.apache.org/jira/browse/YUNIKORN-21
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Wangda Tan
>Priority: Major
> Attachments: Improve node sorting algorithm v1.pdf, Improve node 
> sorting algorithm v2.pdf
>
>
> Currently, we're using DominantRatio for the node sorting algorithm
> {code:java}
> func CompUsageShares(left, right *Resource) int {
>  lshares := getShares(left,nil) rshares := getShares(right,nil)
>  return compareShares(lshares, rshares) 
> }{code}
> Which is not good, two reasons:
>  # Dominate resource compare is about 8X more expensive than single float 
> compares for two resource types.
>  # Dominate resource is not stable when we have scarce resource types like 
> GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB 
> mem, 64 vcore and 8 GPU available; the prior one can go first because of the 
> following logic:
> {code:java}
> if total == nil || total.Resources[k] == 0 {
>  // negative share is logged
>  if v < 0 {
>   log.Logger().Debug("usage is negative no total, share is also negative", 
> zap.Int64("resource quantity", int64(v))) 
>  }
>  shares[idx] = float64(v) idx++ continue
> }{code}
> I think we should discard dominate resource compare for node resource. 
> Instead, we just use one resource type (like vcores) to compare available 
> resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-21) Revisit node sorting algorithm for fairness

2020-04-25 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092325#comment-17092325
 ] 

Wangda Tan commented on YUNIKORN-21:


Thanks [~Tao Yang], 

I took a quick look at the doc, I'm not sure if I fully understand the 
following details: 

1) What is the difference between node-sorting policy and evaluator? If the 
node-order is solely based on the evaluator, we can use one single component to 
replace the two? (Or we only expose sorter interface and the evaluator becomes 
implementation details). 

2) The scope of incremental sorting algorithm is not very clear to me, are we 
going to maintain a sorted list for every request? It might be too much if we 
want to do it on a per-request basis (we could have many different requests).

3) I don't quite sure about the "Weight" concept, are we going to support a 
multi node scorer like K8s default scheduler? I personally don't prefer that 
way, since a weighted result is not easy to explain the behavior.

4) How fast we can do the resorting? Since node list is keep changing, and 
node's status also changing fast, are we going to keep an always up-to-date 
sorting result, or we will have some latencies. (If we need pre-sorted node 
lists on a per-request basis, there're too many sorted node lists we need to 
maintain).

> Revisit node sorting algorithm for fairness
> ---
>
> Key: YUNIKORN-21
> URL: https://issues.apache.org/jira/browse/YUNIKORN-21
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Wangda Tan
>Priority: Major
> Attachments: Improve node sorting algorithm v1.pdf, Improve node 
> sorting algorithm v2.pdf
>
>
> Currently, we're using DominantRatio for the node sorting algorithm
> {code:java}
> func CompUsageShares(left, right *Resource) int {
>  lshares := getShares(left,nil) rshares := getShares(right,nil)
>  return compareShares(lshares, rshares) 
> }{code}
> Which is not good, two reasons:
>  # Dominate resource compare is about 8X more expensive than single float 
> compares for two resource types.
>  # Dominate resource is not stable when we have scarce resource types like 
> GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB 
> mem, 64 vcore and 8 GPU available; the prior one can go first because of the 
> following logic:
> {code:java}
> if total == nil || total.Resources[k] == 0 {
>  // negative share is logged
>  if v < 0 {
>   log.Logger().Debug("usage is negative no total, share is also negative", 
> zap.Int64("resource quantity", int64(v))) 
>  }
>  shares[idx] = float64(v) idx++ continue
> }{code}
> I think we should discard dominate resource compare for node resource. 
> Instead, we just use one resource type (like vcores) to compare available 
> resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-77) De-dup event queue methods inside yunikorn-core, and improve logs

2020-04-04 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-77?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YUNIKORN-77:
--

Assignee: Wangda Tan

> De-dup event queue methods inside yunikorn-core, and improve logs
> -
>
> Key: YUNIKORN-77
> URL: https://issues.apache.org/jira/browse/YUNIKORN-77
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
>
> Propose to improve the following areas: 
> 1) Under cache/scheduler/rmproxy, we have duplicated methods of 
> enqueueAndCheckFull. 
> 2) Existing enqueueAndCheckFull won't block when the event queue is full, we 
> should block the writing. Otherwise, we will miss event and cause weird bugs. 
> 3) We lack events when the events are accumulating in the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-77) De-dup event queue methods inside yunikorn-core, and improve logs

2020-04-04 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-77?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YUNIKORN-77:
---
Labels: pull-request-available  (was: )

> De-dup event queue methods inside yunikorn-core, and improve logs
> -
>
> Key: YUNIKORN-77
> URL: https://issues.apache.org/jira/browse/YUNIKORN-77
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
>
> Propose to improve the following areas: 
> 1) Under cache/scheduler/rmproxy, we have duplicated methods of 
> enqueueAndCheckFull. 
> 2) Existing enqueueAndCheckFull won't block when the event queue is full, we 
> should block the writing. Otherwise, we will miss event and cause weird bugs. 
> 3) We lack events when the events are accumulating in the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-77) De-dup event queue methods inside yunikorn-core, and improve logs

2020-04-04 Thread Wangda Tan (Jira)
Wangda Tan created YUNIKORN-77:
--

 Summary: De-dup event queue methods inside yunikorn-core, and 
improve logs
 Key: YUNIKORN-77
 URL: https://issues.apache.org/jira/browse/YUNIKORN-77
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common
Reporter: Wangda Tan


Propose to improve the following areas: 

1) Under cache/scheduler/rmproxy, we have duplicated methods of 
enqueueAndCheckFull. 

2) Existing enqueueAndCheckFull won't block when the event queue is full, we 
should block the writing. Otherwise, we will miss event and cause weird bugs. 

3) We lack events when the events are accumulating in the queue.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-69) Design YUNIKORN logo

2020-04-02 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074136#comment-17074136
 ] 

Wangda Tan commented on YUNIKORN-69:


Thanks the contribution from Yu-En! We have a better-than-average logo among 
all Apache project. ;)

> Design YUNIKORN logo
> 
>
> Key: YUNIKORN-69
> URL: https://issues.apache.org/jira/browse/YUNIKORN-69
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Sunil G
>Priority: Major
> Fix For: 0.8
>
> Attachments: yunikorn-log-gray.png, yunikorn-logo-black.png, 
> yunikorn-logo-blue (1).png, yunikorn-logo-main.png
>
>
> New Yunikorn logo creation and using in all projects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-30 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071340#comment-17071340
 ] 

Wangda Tan commented on YUNIKORN-42:


[~wilfreds], thanks for sharing your feedback. 

I agree that pushing all events to POD/namespace description is not 
comprehensive. We may need something like YARN-9050 to get all the necessary 
information which is customized to scheduler hierarchies, etc. 

However, I think we need to push more information to POD/namespace if possible. 
According to early users who using YuniKorn, we found they got used to use 
{{describe pod}} to understand why allocation failed. At least for default 
scheduler, users can get pretty good insight about why allocation failed. 

I think we should try to push the experiences more native when users using 
YuniKorn on K8s, and it could be a show-stopper for many users. 

bq. The queue is out of resources, the user limit has been reached or even the 
maximum number of applications that can be run for a user is reached. That kind 
of information does not translate or help on the K8s side.

I think it is possible, we just need to go into the queue/user and populate 
these information to event recorder. It is not straightforward, but it should 
be doable. 

bq. Every event that we push back will be stored in etcd. I am worried about 
the amount of data we will push back to etcd if we are not careful.

We definitely need to test more for this, this should be not significantly more 
than what K8s existing event system will do since we have throttling mechanism 
built-in. 

bq. Can we also guarantee that the YuniKorn admin can always describe all the 
pods and namespaces? That would require the admin to have high level access to 
the K8s cluster which he might not have. Something we need to keep in mind.

A lot of things should be done by end user of the k8s cluster instead of Admin. 
Admin should ideally access our UI and understand more details if pod describe 
doesn't have enough information. (Of course, we need additional UI works)

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-52) Update queue/app screenshots in README

2020-03-25 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YUNIKORN-52.

Fix Version/s: 0.8
   Resolution: Fixed

Merged, thanks [~wwei]

> Update queue/app screenshots in README
> --
>
> Key: YUNIKORN-52
> URL: https://issues.apache.org/jira/browse/YUNIKORN-52
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have some updates in UI, let's refresh the screenshots as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-23 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064892#comment-17064892
 ] 

Wangda Tan commented on YUNIKORN-42:


Thanks [~Tao Yang], you're absolutely right. 

There're two requirements in the doc: 
{code}
- Events and diagnostics can be retrieved from YuniKorn UI / REST API. (P2) 
- Be able to filter events/diagnostics based on queue/app/node/pod. (P2)
{code}

I mark them to P2 for now because retrieving from YuniKorn web UI needs a lot 
of front-end work to make it complete and make sure security access of the UI 
can be enforced. 

Instead, now many early users of YuniKorn are familiar with how to use existing 
K8s tools like POD event to understand what's going on in the cluster. So I 
think we can publish PODs events first, to satisfy 70-80% of troubleshooting 
needs w/o require admins to learn anything specific to YuniKorn.

We should definitely consider some more powerful and integrated tools for admin 
to use, similar to YARN-9050 after we finish the basic work.

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063652#comment-17063652
 ] 

Wangda Tan commented on YUNIKORN-42:


I may not have the bandwidth to start/finish the end to end work in the next 
two weeks, if someone has interest to work on this immediately, please feel 
free to comment on the JIRA & take over the tasks. Otherwise, I could spend 
some cycles to push this forward, but no commitment I can make.

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063649#comment-17063649
 ] 

Wangda Tan commented on YUNIKORN-42:


Spent some time thinking about the problem. 
Put ideas to a design 
https://docs.google.com/document/d/19iMkLJGVwTSq9OfV9p75wOobJXAxR_CRo4YzSRG_Pzw/edit#

cc: [~cheersyang], [~taoyang], [~wilfreds] to check the ideas.

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063635#comment-17063635
 ] 

Wangda Tan commented on YUNIKORN-42:


Attached design doc, (WIP)

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-19 Thread Wangda Tan (Jira)
Wangda Tan created YUNIKORN-42:
--

 Summary: Better to support POD events for YuniKorn to troubleshoot 
allocation failures
 Key: YUNIKORN-42
 URL: https://issues.apache.org/jira/browse/YUNIKORN-42
 Project: Apache YuniKorn
  Issue Type: Task
Reporter: Wangda Tan


Now it is tricky to do troubleshoot for pod allocation, we need better expose 
this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-40) Add doc for YuniKorn Community Sync Up

2020-03-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-40?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063013#comment-17063013
 ] 

Wangda Tan commented on YUNIKORN-40:


Thanks [~wwei] for committing the patch.

[~wilfreds]: [~wwei] is correct, I'm working on YUNIKORN-41 to link to the PR.  
Patch is already available.

> Add doc for YuniKorn Community Sync Up
> --
>
> Key: YUNIKORN-40
> URL: https://issues.apache.org/jira/browse/YUNIKORN-40
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: community, documentation
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-41) Update YuniKorn main README.md to make user can better get focus of the project

2020-03-19 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YUNIKORN-41:
--

Assignee: Wangda Tan

> Update YuniKorn main README.md to make user can better get focus of the 
> project
> ---
>
> Key: YUNIKORN-41
> URL: https://issues.apache.org/jira/browse/YUNIKORN-41
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The message from the Apache YuniKorn Github is a bit confusing when we put 
> YARN, and “general scheduler” together. 
> For most of the companies we spoke to, initially, they’re confused with: Is 
> it a K8s scheduler, or it is a YARN RM but hook up with K8s. I think we need 
> to do massive work to make the marketing message correct.
> To me, the #1 use case YuniKorn trying to solve now is to be the best 
> scheduler in K8s to solve batch workload problems. 
> It is nicely designed to allow yunikorn-core to hook up with other RMs, but 
> that is the next step.
> We can also include other items like highlights of K8s integration, perf 
> test, and community sync ups to the same README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-41) Update YuniKorn main README.md to make user can better get focus of the project

2020-03-19 Thread Wangda Tan (Jira)
Wangda Tan created YUNIKORN-41:
--

 Summary: Update YuniKorn main README.md to make user can better 
get focus of the project
 Key: YUNIKORN-41
 URL: https://issues.apache.org/jira/browse/YUNIKORN-41
 Project: Apache YuniKorn
  Issue Type: Task
Reporter: Wangda Tan


The message from the Apache YuniKorn Github is a bit confusing when we put 
YARN, and “general scheduler” together. 

For most of the companies we spoke to, initially, they’re confused with: Is it 
a K8s scheduler, or it is a YARN RM but hook up with K8s. I think we need to do 
massive work to make the marketing message correct.

To me, the #1 use case YuniKorn trying to solve now is to be the best scheduler 
in K8s to solve batch workload problems. 

It is nicely designed to allow yunikorn-core to hook up with other RMs, but 
that is the next step.

We can also include other items like highlights of K8s integration, perf test, 
and community sync ups to the same README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-40) Add doc for YuniKorn Community Sync Up

2020-03-19 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YUNIKORN-40:
--

Assignee: Wangda Tan

> Add doc for YuniKorn Community Sync Up
> --
>
> Key: YUNIKORN-40
> URL: https://issues.apache.org/jira/browse/YUNIKORN-40
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: community, documentation
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org