[jira] [Updated] (YUNIKORN-317) [Design] Cache/Scheduler modules need to be combined to a single module
[ https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YUNIKORN-317: Summary: [Design] Cache/Scheduler modules need to be combined to a single module (was: [CORE] Cache/Scheduler modules need to be combined to a single module) > [Design] Cache/Scheduler modules need to be combined to a single module > --- > > Key: YUNIKORN-317 > URL: https://issues.apache.org/jira/browse/YUNIKORN-317 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - cache, core - scheduler >Reporter: Wangda Tan >Priority: Major > > Now we have two modules inside core: > * Cache > * Scheduler > Between Cache and Scheduler, they communicate with each other using async > events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). > Also, it complex logic a lot, there're 10 out of 19 events are only between > cache and scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-317) [Design] Cache/Scheduler modules need to be combined to a single module
[ https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YUNIKORN-317: Component/s: documentation > [Design] Cache/Scheduler modules need to be combined to a single module > --- > > Key: YUNIKORN-317 > URL: https://issues.apache.org/jira/browse/YUNIKORN-317 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - cache, core - scheduler, documentation >Reporter: Wangda Tan >Priority: Major > > Now we have two modules inside core: > * Cache > * Scheduler > Between Cache and Scheduler, they communicate with each other using async > events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). > Also, it complex logic a lot, there're 10 out of 19 events are only between > cache and scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-317) [CORE] Cache/Scheduler modules need to be combined to a single module
[ https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165354#comment-17165354 ] Wangda Tan commented on YUNIKORN-317: - Created [https://github.com/wangdatan/yunikorn-core/tree/YUNIKORN-317] branch and uploaded WIP patch (not ready for review). > [CORE] Cache/Scheduler modules need to be combined to a single module > - > > Key: YUNIKORN-317 > URL: https://issues.apache.org/jira/browse/YUNIKORN-317 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - cache, core - scheduler >Reporter: Wangda Tan >Priority: Major > > Now we have two modules inside core: > * Cache > * Scheduler > Between Cache and Scheduler, they communicate with each other using async > events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). > Also, it complex logic a lot, there're 10 out of 19 events are only between > cache and scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-317) [CORE] Cache/Scheduler modules need to be combined to a single module
[ https://issues.apache.org/jira/browse/YUNIKORN-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165353#comment-17165353 ] Wangda Tan commented on YUNIKORN-317: - Had a brief discussion with [~cheersyang] and [~wilfreds], we agree the two modules together is a good idea. Since it will involve lots of changes, before making changes, we need to better design and plan. I'm writing a design/analysis doc for the proposal. (Design doc/analysis will follow). > [CORE] Cache/Scheduler modules need to be combined to a single module > - > > Key: YUNIKORN-317 > URL: https://issues.apache.org/jira/browse/YUNIKORN-317 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - cache, core - scheduler >Reporter: Wangda Tan >Priority: Major > > Now we have two modules inside core: > * Cache > * Scheduler > Between Cache and Scheduler, they communicate with each other using async > events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). > Also, it complex logic a lot, there're 10 out of 19 events are only between > cache and scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-317) [CORE] Cache/Scheduler modules need to be combined to a single module
Wangda Tan created YUNIKORN-317: --- Summary: [CORE] Cache/Scheduler modules need to be combined to a single module Key: YUNIKORN-317 URL: https://issues.apache.org/jira/browse/YUNIKORN-317 Project: Apache YuniKorn Issue Type: Improvement Components: core - cache, core - scheduler Reporter: Wangda Tan Now we have two modules inside core: * Cache * Scheduler Between Cache and Scheduler, they communicate with each other using async events, which causes lots of inconsistent issues. (Such as YUNIKORN-169). Also, it complex logic a lot, there're 10 out of 19 events are only between cache and scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-223) build website v2 by docker
[ https://issues.apache.org/jira/browse/YUNIKORN-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134337#comment-17134337 ] Wangda Tan commented on YUNIKORN-223: - Thanks [~lamber-ken] for filing this ticket, I just moved this to YUNIKORN-212 umbrella. > build website v2 by docker > -- > > Key: YUNIKORN-223 > URL: https://issues.apache.org/jira/browse/YUNIKORN-223 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: website >Reporter: lamber-ken >Priority: Major > > build website v2 by docker -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-223) build website v2 by docker
[ https://issues.apache.org/jira/browse/YUNIKORN-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YUNIKORN-223: Parent: YUNIKORN-212 Issue Type: Sub-task (was: Improvement) > build website v2 by docker > -- > > Key: YUNIKORN-223 > URL: https://issues.apache.org/jira/browse/YUNIKORN-223 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: website >Reporter: lamber-ken >Priority: Major > > build website v2 by docker -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-212) [Umbrella] YuniKorn web-site v2
[ https://issues.apache.org/jira/browse/YUNIKORN-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YUNIKORN-212: Summary: [Umbrella] YuniKorn web-site v2 (was: YuniKorn web-site v2) > [Umbrella] YuniKorn web-site v2 > --- > > Key: YUNIKORN-212 > URL: https://issues.apache.org/jira/browse/YUNIKORN-212 > Project: Apache YuniKorn > Issue Type: New Feature > Components: website >Reporter: Weiwei Yang >Priority: Major > > Current web site: [http://yunikorn.apache.org/] > Plan to build a new version of web-site based on > [https://github.com/facebook/docusaurus] > Planned goals: > * automated doc pages build > * multi-version doc support > * auto build and publish to ASF -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-218) Introduce Chaos Monkey tests to YuniKorn
[ https://issues.apache.org/jira/browse/YUNIKORN-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133749#comment-17133749 ] Wangda Tan commented on YUNIKORN-218: - *For phase #1, we can monitor the following items:* 1) Randomly kill scheduler PODs. 2) Randomly kill other PODs which require scheduler to relaunch them. 3) Randomly bring down nodes. 4) If it often happens: schedule some pods using default scheduler on the same cluster, and make sure that YuniKorn reacts correctly. *We need to make sure to verify the following abnormalities:* 1) Is there any abnormal states of scheduler, such as negative available resource, total used > total available, etc. 2) Is there any violation of scheduling constraints (like allocated > maximum-allowed, anti-affinity, etc.). 3) If we run applications with finite lifetime (like batch), we should expect such applications can completed. (End states check). 4) Are we able to see nodes with sufficient available resources but some pods keep in pending states? [https://github.com/asobti/kube-monkey] looks like a good tool to make chaos within a cluster. > Introduce Chaos Monkey tests to YuniKorn > > > Key: YUNIKORN-218 > URL: https://issues.apache.org/jira/browse/YUNIKORN-218 > Project: Apache YuniKorn > Issue Type: Bug > Components: test - e2e >Reporter: Wangda Tan >Priority: Major > > We're seeing corner cases reported which lead scheduler to become > malfunction, cannot allocation PODs, etc. > To systematically prevent such issues happen, I think we need to introduce > ChaosMonkey tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-218) Introduce Chaos Monkey tests to YuniKorn
Wangda Tan created YUNIKORN-218: --- Summary: Introduce Chaos Monkey tests to YuniKorn Key: YUNIKORN-218 URL: https://issues.apache.org/jira/browse/YUNIKORN-218 Project: Apache YuniKorn Issue Type: Bug Components: test - e2e Reporter: Wangda Tan We're seeing corner cases reported which lead scheduler to become malfunction, cannot allocation PODs, etc. To systematically prevent such issues happen, I think we need to introduce ChaosMonkey tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-42) High efficient scheduling events framework phase 1
[ https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126965#comment-17126965 ] Wangda Tan commented on YUNIKORN-42: [~adam.antal], can you make it open to public (you may need to create it in your own google drive, and set it open for public to view). > High efficient scheduling events framework phase 1 > -- > > Key: YUNIKORN-42 > URL: https://issues.apache.org/jira/browse/YUNIKORN-42 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Wangda Tan >Assignee: Adam Antal >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Now it is tricky to do troubleshoot for pod allocation, we need better expose > this information to POD description. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-184) Update YuniKorn-Core Design Documentation
Wangda Tan created YUNIKORN-184: --- Summary: Update YuniKorn-Core Design Documentation Key: YUNIKORN-184 URL: https://issues.apache.org/jira/browse/YUNIKORN-184 Project: Apache YuniKorn Issue Type: Task Components: documentation Reporter: Wangda Tan The original design doc is pretty out-of-dated, it was done before the yunikorn-core is completed. A lot of things changed after that, we need to refresh the design doc so new members can easier understand structure inside yunikorn-core. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-173) Generates one default application ID per namespace in the admission controller
[ https://issues.apache.org/jira/browse/YUNIKORN-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114422#comment-17114422 ] Wangda Tan commented on YUNIKORN-173: - [~wwei], [~wilfreds], What the behavior when namespace -> queue creation/mapping is disabled? > Generates one default application ID per namespace in the admission controller > -- > > Key: YUNIKORN-173 > URL: https://issues.apache.org/jira/browse/YUNIKORN-173 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: pull-request-available > Fix For: 0.9 > > > If app doesn't explicitly specify application ID, lets group such pods to > one single app per namespace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2) Support Gang Scheduling
[ https://issues.apache.org/jira/browse/YUNIKORN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108784#comment-17108784 ] Wangda Tan commented on YUNIKORN-2: --- Added design doc, still largely WIP. [~wwei], [~wilfreds], [~Tao Yang], [~sunil.gov...@gmail.com] please review and give some feedbacks when you get chance. > Support Gang Scheduling > --- > > Key: YUNIKORN-2 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - scheduler >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > > Gang scheduling is one of the most important features for schedulers, it is > very useful for machine learning workloads such as tensorflow, pytorch, etc. > Since yunikorn has notion of the application, it is not very hard for us to > support gang scheduling context. > This umbrella tracks the efforts to support gang scheduling feature in > yunikorn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-21) Revisit node sorting algorithm for fairness
[ https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092781#comment-17092781 ] Wangda Tan commented on YUNIKORN-21: Thanks [~Tao Yang], the answer looks reasonable to me, and the number also looks very promising. bq. The scheduling throughput can be improved from 450 to 5000+ in a mock cluster with 1000 nodes according to the benchmark results of scheduler_perf_test.go in my local test. Reread the design doc. I think I can understand it better now. From high-level, class design makes sense, and if you can have a PoC patch, I can help with review and give more detailed suggestions. After another thought, I think the weighted sort policy may make sense if the number of scorers involved is small (no more than 3), I want to avoid 20 scorers involved and give a weighted score which we cannot explain at all > Revisit node sorting algorithm for fairness > --- > > Key: YUNIKORN-21 > URL: https://issues.apache.org/jira/browse/YUNIKORN-21 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Wangda Tan >Priority: Major > Attachments: Improve node sorting algorithm v1.pdf, Improve node > sorting algorithm v2.pdf > > > Currently, we're using DominantRatio for the node sorting algorithm > {code:java} > func CompUsageShares(left, right *Resource) int { > lshares := getShares(left,nil) rshares := getShares(right,nil) > return compareShares(lshares, rshares) > }{code} > Which is not good, two reasons: > # Dominate resource compare is about 8X more expensive than single float > compares for two resource types. > # Dominate resource is not stable when we have scarce resource types like > GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB > mem, 64 vcore and 8 GPU available; the prior one can go first because of the > following logic: > {code:java} > if total == nil || total.Resources[k] == 0 { > // negative share is logged > if v < 0 { > log.Logger().Debug("usage is negative no total, share is also negative", > zap.Int64("resource quantity", int64(v))) > } > shares[idx] = float64(v) idx++ continue > }{code} > I think we should discard dominate resource compare for node resource. > Instead, we just use one resource type (like vcores) to compare available > resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-21) Revisit node sorting algorithm for fairness
[ https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092340#comment-17092340 ] Wangda Tan commented on YUNIKORN-21: Several examples for the node sorting policy: 1) Bin-packing policy: Get sorted node list based on most-used nodes. 2) Peanut-buttering policy: Get sorted node list based on least-used nodes. 3) Best-fit policy: Get a sorted node list based on the node's available resource vector most similar to requested resource. (For example, requested resource is cpu=2,mem=3; A node with available resource cpu=4,mem=6 is more "fit" comparing to another node with available resource cpu=6,mem=4. (Reference to the paper: https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf) 1/2 are not request-related, 3 is request-related, I'm wondering how we deal with these different use cases based on the proposal. Also, it will be important to make surethe node sorting policy can be used by preemption logic. > Revisit node sorting algorithm for fairness > --- > > Key: YUNIKORN-21 > URL: https://issues.apache.org/jira/browse/YUNIKORN-21 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Wangda Tan >Priority: Major > Attachments: Improve node sorting algorithm v1.pdf, Improve node > sorting algorithm v2.pdf > > > Currently, we're using DominantRatio for the node sorting algorithm > {code:java} > func CompUsageShares(left, right *Resource) int { > lshares := getShares(left,nil) rshares := getShares(right,nil) > return compareShares(lshares, rshares) > }{code} > Which is not good, two reasons: > # Dominate resource compare is about 8X more expensive than single float > compares for two resource types. > # Dominate resource is not stable when we have scarce resource types like > GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB > mem, 64 vcore and 8 GPU available; the prior one can go first because of the > following logic: > {code:java} > if total == nil || total.Resources[k] == 0 { > // negative share is logged > if v < 0 { > log.Logger().Debug("usage is negative no total, share is also negative", > zap.Int64("resource quantity", int64(v))) > } > shares[idx] = float64(v) idx++ continue > }{code} > I think we should discard dominate resource compare for node resource. > Instead, we just use one resource type (like vcores) to compare available > resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-21) Revisit node sorting algorithm for fairness
[ https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092325#comment-17092325 ] Wangda Tan commented on YUNIKORN-21: Thanks [~Tao Yang], I took a quick look at the doc, I'm not sure if I fully understand the following details: 1) What is the difference between node-sorting policy and evaluator? If the node-order is solely based on the evaluator, we can use one single component to replace the two? (Or we only expose sorter interface and the evaluator becomes implementation details). 2) The scope of incremental sorting algorithm is not very clear to me, are we going to maintain a sorted list for every request? It might be too much if we want to do it on a per-request basis (we could have many different requests). 3) I don't quite sure about the "Weight" concept, are we going to support a multi node scorer like K8s default scheduler? I personally don't prefer that way, since a weighted result is not easy to explain the behavior. 4) How fast we can do the resorting? Since node list is keep changing, and node's status also changing fast, are we going to keep an always up-to-date sorting result, or we will have some latencies. (If we need pre-sorted node lists on a per-request basis, there're too many sorted node lists we need to maintain). > Revisit node sorting algorithm for fairness > --- > > Key: YUNIKORN-21 > URL: https://issues.apache.org/jira/browse/YUNIKORN-21 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Wangda Tan >Priority: Major > Attachments: Improve node sorting algorithm v1.pdf, Improve node > sorting algorithm v2.pdf > > > Currently, we're using DominantRatio for the node sorting algorithm > {code:java} > func CompUsageShares(left, right *Resource) int { > lshares := getShares(left,nil) rshares := getShares(right,nil) > return compareShares(lshares, rshares) > }{code} > Which is not good, two reasons: > # Dominate resource compare is about 8X more expensive than single float > compares for two resource types. > # Dominate resource is not stable when we have scarce resource types like > GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB > mem, 64 vcore and 8 GPU available; the prior one can go first because of the > following logic: > {code:java} > if total == nil || total.Resources[k] == 0 { > // negative share is logged > if v < 0 { > log.Logger().Debug("usage is negative no total, share is also negative", > zap.Int64("resource quantity", int64(v))) > } > shares[idx] = float64(v) idx++ continue > }{code} > I think we should discard dominate resource compare for node resource. > Instead, we just use one resource type (like vcores) to compare available > resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-77) De-dup event queue methods inside yunikorn-core, and improve logs
[ https://issues.apache.org/jira/browse/YUNIKORN-77?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YUNIKORN-77: -- Assignee: Wangda Tan > De-dup event queue methods inside yunikorn-core, and improve logs > - > > Key: YUNIKORN-77 > URL: https://issues.apache.org/jira/browse/YUNIKORN-77 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: pull-request-available > > Propose to improve the following areas: > 1) Under cache/scheduler/rmproxy, we have duplicated methods of > enqueueAndCheckFull. > 2) Existing enqueueAndCheckFull won't block when the event queue is full, we > should block the writing. Otherwise, we will miss event and cause weird bugs. > 3) We lack events when the events are accumulating in the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-77) De-dup event queue methods inside yunikorn-core, and improve logs
[ https://issues.apache.org/jira/browse/YUNIKORN-77?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YUNIKORN-77: --- Labels: pull-request-available (was: ) > De-dup event queue methods inside yunikorn-core, and improve logs > - > > Key: YUNIKORN-77 > URL: https://issues.apache.org/jira/browse/YUNIKORN-77 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Wangda Tan >Priority: Major > Labels: pull-request-available > > Propose to improve the following areas: > 1) Under cache/scheduler/rmproxy, we have duplicated methods of > enqueueAndCheckFull. > 2) Existing enqueueAndCheckFull won't block when the event queue is full, we > should block the writing. Otherwise, we will miss event and cause weird bugs. > 3) We lack events when the events are accumulating in the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-77) De-dup event queue methods inside yunikorn-core, and improve logs
Wangda Tan created YUNIKORN-77: -- Summary: De-dup event queue methods inside yunikorn-core, and improve logs Key: YUNIKORN-77 URL: https://issues.apache.org/jira/browse/YUNIKORN-77 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Wangda Tan Propose to improve the following areas: 1) Under cache/scheduler/rmproxy, we have duplicated methods of enqueueAndCheckFull. 2) Existing enqueueAndCheckFull won't block when the event queue is full, we should block the writing. Otherwise, we will miss event and cause weird bugs. 3) We lack events when the events are accumulating in the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-69) Design YUNIKORN logo
[ https://issues.apache.org/jira/browse/YUNIKORN-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074136#comment-17074136 ] Wangda Tan commented on YUNIKORN-69: Thanks the contribution from Yu-En! We have a better-than-average logo among all Apache project. ;) > Design YUNIKORN logo > > > Key: YUNIKORN-69 > URL: https://issues.apache.org/jira/browse/YUNIKORN-69 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Sunil G >Priority: Major > Fix For: 0.8 > > Attachments: yunikorn-log-gray.png, yunikorn-logo-black.png, > yunikorn-logo-blue (1).png, yunikorn-logo-main.png > > > New Yunikorn logo creation and using in all projects. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures
[ https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071340#comment-17071340 ] Wangda Tan commented on YUNIKORN-42: [~wilfreds], thanks for sharing your feedback. I agree that pushing all events to POD/namespace description is not comprehensive. We may need something like YARN-9050 to get all the necessary information which is customized to scheduler hierarchies, etc. However, I think we need to push more information to POD/namespace if possible. According to early users who using YuniKorn, we found they got used to use {{describe pod}} to understand why allocation failed. At least for default scheduler, users can get pretty good insight about why allocation failed. I think we should try to push the experiences more native when users using YuniKorn on K8s, and it could be a show-stopper for many users. bq. The queue is out of resources, the user limit has been reached or even the maximum number of applications that can be run for a user is reached. That kind of information does not translate or help on the K8s side. I think it is possible, we just need to go into the queue/user and populate these information to event recorder. It is not straightforward, but it should be doable. bq. Every event that we push back will be stored in etcd. I am worried about the amount of data we will push back to etcd if we are not careful. We definitely need to test more for this, this should be not significantly more than what K8s existing event system will do since we have throttling mechanism built-in. bq. Can we also guarantee that the YuniKorn admin can always describe all the pods and namespaces? That would require the admin to have high level access to the K8s cluster which he might not have. Something we need to keep in mind. A lot of things should be done by end user of the k8s cluster instead of Admin. Admin should ideally access our UI and understand more details if pod describe doesn't have enough information. (Of course, we need additional UI works) > Better to support POD events for YuniKorn to troubleshoot allocation failures > - > > Key: YUNIKORN-42 > URL: https://issues.apache.org/jira/browse/YUNIKORN-42 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Wangda Tan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Now it is tricky to do troubleshoot for pod allocation, we need better expose > this information to POD description. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-52) Update queue/app screenshots in README
[ https://issues.apache.org/jira/browse/YUNIKORN-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YUNIKORN-52. Fix Version/s: 0.8 Resolution: Fixed Merged, thanks [~wwei] > Update queue/app screenshots in README > -- > > Key: YUNIKORN-52 > URL: https://issues.apache.org/jira/browse/YUNIKORN-52 > Project: Apache YuniKorn > Issue Type: Bug > Components: documentation >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: pull-request-available > Fix For: 0.8 > > Time Spent: 20m > Remaining Estimate: 0h > > We have some updates in UI, let's refresh the screenshots as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures
[ https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064892#comment-17064892 ] Wangda Tan commented on YUNIKORN-42: Thanks [~Tao Yang], you're absolutely right. There're two requirements in the doc: {code} - Events and diagnostics can be retrieved from YuniKorn UI / REST API. (P2) - Be able to filter events/diagnostics based on queue/app/node/pod. (P2) {code} I mark them to P2 for now because retrieving from YuniKorn web UI needs a lot of front-end work to make it complete and make sure security access of the UI can be enforced. Instead, now many early users of YuniKorn are familiar with how to use existing K8s tools like POD event to understand what's going on in the cluster. So I think we can publish PODs events first, to satisfy 70-80% of troubleshooting needs w/o require admins to learn anything specific to YuniKorn. We should definitely consider some more powerful and integrated tools for admin to use, similar to YARN-9050 after we finish the basic work. > Better to support POD events for YuniKorn to troubleshoot allocation failures > - > > Key: YUNIKORN-42 > URL: https://issues.apache.org/jira/browse/YUNIKORN-42 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Wangda Tan >Priority: Major > > Now it is tricky to do troubleshoot for pod allocation, we need better expose > this information to POD description. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures
[ https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063652#comment-17063652 ] Wangda Tan commented on YUNIKORN-42: I may not have the bandwidth to start/finish the end to end work in the next two weeks, if someone has interest to work on this immediately, please feel free to comment on the JIRA & take over the tasks. Otherwise, I could spend some cycles to push this forward, but no commitment I can make. > Better to support POD events for YuniKorn to troubleshoot allocation failures > - > > Key: YUNIKORN-42 > URL: https://issues.apache.org/jira/browse/YUNIKORN-42 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Wangda Tan >Priority: Major > > Now it is tricky to do troubleshoot for pod allocation, we need better expose > this information to POD description. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures
[ https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063649#comment-17063649 ] Wangda Tan commented on YUNIKORN-42: Spent some time thinking about the problem. Put ideas to a design https://docs.google.com/document/d/19iMkLJGVwTSq9OfV9p75wOobJXAxR_CRo4YzSRG_Pzw/edit# cc: [~cheersyang], [~taoyang], [~wilfreds] to check the ideas. > Better to support POD events for YuniKorn to troubleshoot allocation failures > - > > Key: YUNIKORN-42 > URL: https://issues.apache.org/jira/browse/YUNIKORN-42 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Wangda Tan >Priority: Major > > Now it is tricky to do troubleshoot for pod allocation, we need better expose > this information to POD description. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures
[ https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063635#comment-17063635 ] Wangda Tan commented on YUNIKORN-42: Attached design doc, (WIP) > Better to support POD events for YuniKorn to troubleshoot allocation failures > - > > Key: YUNIKORN-42 > URL: https://issues.apache.org/jira/browse/YUNIKORN-42 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Wangda Tan >Priority: Major > > Now it is tricky to do troubleshoot for pod allocation, we need better expose > this information to POD description. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures
Wangda Tan created YUNIKORN-42: -- Summary: Better to support POD events for YuniKorn to troubleshoot allocation failures Key: YUNIKORN-42 URL: https://issues.apache.org/jira/browse/YUNIKORN-42 Project: Apache YuniKorn Issue Type: Task Reporter: Wangda Tan Now it is tricky to do troubleshoot for pod allocation, we need better expose this information to POD description. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-40) Add doc for YuniKorn Community Sync Up
[ https://issues.apache.org/jira/browse/YUNIKORN-40?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063013#comment-17063013 ] Wangda Tan commented on YUNIKORN-40: Thanks [~wwei] for committing the patch. [~wilfreds]: [~wwei] is correct, I'm working on YUNIKORN-41 to link to the PR. Patch is already available. > Add doc for YuniKorn Community Sync Up > -- > > Key: YUNIKORN-40 > URL: https://issues.apache.org/jira/browse/YUNIKORN-40 > Project: Apache YuniKorn > Issue Type: Task > Components: community, documentation >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: pull-request-available > Fix For: 0.8 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-41) Update YuniKorn main README.md to make user can better get focus of the project
[ https://issues.apache.org/jira/browse/YUNIKORN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YUNIKORN-41: -- Assignee: Wangda Tan > Update YuniKorn main README.md to make user can better get focus of the > project > --- > > Key: YUNIKORN-41 > URL: https://issues.apache.org/jira/browse/YUNIKORN-41 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The message from the Apache YuniKorn Github is a bit confusing when we put > YARN, and “general scheduler” together. > For most of the companies we spoke to, initially, they’re confused with: Is > it a K8s scheduler, or it is a YARN RM but hook up with K8s. I think we need > to do massive work to make the marketing message correct. > To me, the #1 use case YuniKorn trying to solve now is to be the best > scheduler in K8s to solve batch workload problems. > It is nicely designed to allow yunikorn-core to hook up with other RMs, but > that is the next step. > We can also include other items like highlights of K8s integration, perf > test, and community sync ups to the same README. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-41) Update YuniKorn main README.md to make user can better get focus of the project
Wangda Tan created YUNIKORN-41: -- Summary: Update YuniKorn main README.md to make user can better get focus of the project Key: YUNIKORN-41 URL: https://issues.apache.org/jira/browse/YUNIKORN-41 Project: Apache YuniKorn Issue Type: Task Reporter: Wangda Tan The message from the Apache YuniKorn Github is a bit confusing when we put YARN, and “general scheduler” together. For most of the companies we spoke to, initially, they’re confused with: Is it a K8s scheduler, or it is a YARN RM but hook up with K8s. I think we need to do massive work to make the marketing message correct. To me, the #1 use case YuniKorn trying to solve now is to be the best scheduler in K8s to solve batch workload problems. It is nicely designed to allow yunikorn-core to hook up with other RMs, but that is the next step. We can also include other items like highlights of K8s integration, perf test, and community sync ups to the same README. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-40) Add doc for YuniKorn Community Sync Up
[ https://issues.apache.org/jira/browse/YUNIKORN-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YUNIKORN-40: -- Assignee: Wangda Tan > Add doc for YuniKorn Community Sync Up > -- > > Key: YUNIKORN-40 > URL: https://issues.apache.org/jira/browse/YUNIKORN-40 > Project: Apache YuniKorn > Issue Type: Task > Components: community, documentation >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org