[jira] [Updated] (YUNIKORN-54) admission-controller: the generated applicationID should not exceed 64 chars

2020-03-30 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-54:
--
Target Version: 0.8

> admission-controller: the generated applicationID should not exceed 64 chars
> 
>
> Key: YUNIKORN-54
> URL: https://issues.apache.org/jira/browse/YUNIKORN-54
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If a label exceeds 64 chars, it will be rejected by some K8s clients.
> Currently, we composite a generated name by -- 
> which might exceed 64 chars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-63) Report occupied resources while registering a node

2020-03-30 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang closed YUNIKORN-63.
---
Fix Version/s: 0.8
   Resolution: Fixed

> Report occupied resources while registering a node
> --
>
> Key: YUNIKORN-63
> URL: https://issues.apache.org/jira/browse/YUNIKORN-63
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler, scheduler-interface, shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We need to support this while registering a node, otherwise, it could be 
> race-condition that the shim reports an update but the node is not yet 
> registered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-65) Remove job references from the web app

2020-03-30 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-65:
-

 Summary: Remove job references from the web app
 Key: YUNIKORN-65
 URL: https://issues.apache.org/jira/browse/YUNIKORN-65
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: webapp
Reporter: Wilfred Spiegelenburg


In the shim and core we have moved away from job and are using application as a 
generic term to talk about sets of related allocations.

The web code still uses job everywhere for internal objects. We should change 
to application in the webapp to get consistency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-63) Report occupied resources while registering a node

2020-03-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071431#comment-17071431
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-63:
---

I merged both repos: core and shim.

We should be OK for this one. [~wwei] please confirm that was it so we can 
close this

> Report occupied resources while registering a node
> --
>
> Key: YUNIKORN-63
> URL: https://issues.apache.org/jira/browse/YUNIKORN-63
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler, scheduler-interface, shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We need to support this while registering a node, otherwise, it could be 
> race-condition that the shim reports an update but the node is not yet 
> registered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-63) Report occupied resources while registering a node

2020-03-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071407#comment-17071407
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-63:
---

Tracking why this is being added [from the pr 
comment|https://github.com/apache/incubator-yunikorn-k8shim/pull/91#issuecomment-606200513]:
{quote}{quote}See my comment in the jira: why this change from not needing it 
in apache/incubator-yunikorn-core#108 to needing it now?
{quote}
This is found while testing the recovery on a real cluster. When it tries to 
report the occupied resources to the scheduler-core, the node is not yet 
registered. Therefore, the latest patch ensures that when we register the node, 
we need to have the initial occupied resources counted correctly.

My previous fix (only updating the node capacity) might have the same issue, 
but I guess I missed that during the local test (not covering the recovery 
test).
{quote}

> Report occupied resources while registering a node
> --
>
> Key: YUNIKORN-63
> URL: https://issues.apache.org/jira/browse/YUNIKORN-63
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler, scheduler-interface, shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to support this while registering a node, otherwise, it could be 
> race-condition that the shim reports an update but the node is not yet 
> registered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-64) Add zap encoders for time and duration

2020-03-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071402#comment-17071402
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-64:
---

The problem is in the shim code and the way the logger is initialised for 
production cases.

I will open a PR to fix the shim code and we will revert the code change that 
was made in the core via a second PR.

> Add zap encoders for time and duration
> --
>
> Key: YUNIKORN-64
> URL: https://issues.apache.org/jira/browse/YUNIKORN-64
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Priority: Major
>
> In YUNIKORN-62 a nil pointer caused a panic. The fix was to change the log 
> call from a Duration to a String.
> The real issue is the fact that the encoders for certain types are not set in 
> the logger. This is most likely an OS or internationalisation configuration 
> related issue. In the code we fail here:
> {code}
> panic: runtime error: invalid memory address or nil pointer 
> dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 
> pc=0x11ac568] goroutine 51 
> [running]:go.uber.org/zap/zapcore.(*jsonEncoder).AppendDuration(0xc007eca390, 
> 0x77370eba) 
> /Users/wyang/go/pkg/mod/go.uber.org/zap@v1.13.0/zapcore/json_encoder.go:239 
> {code}
> That links to the source code here:
> {code}
> 237  func (enc *jsonEncoder) AppendDuration(val time.Duration) {
> 238   cur := enc.buf.Len()
> 239   enc.EncodeDuration(val, enc)
> {code}
> That can only happen if the {{EncodeDuration}} function is nil.
> This links back to a similar [zap issue 
> #645|https://github.com/uber-go/zap/issues/645] for time stamps. In the 
> comments it says that the encoders are not optional for certain parts. We 
> should make sure that the encoders are set correctly as that looks like it is 
> the correct solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-30 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071340#comment-17071340
 ] 

Wangda Tan commented on YUNIKORN-42:


[~wilfreds], thanks for sharing your feedback. 

I agree that pushing all events to POD/namespace description is not 
comprehensive. We may need something like YARN-9050 to get all the necessary 
information which is customized to scheduler hierarchies, etc. 

However, I think we need to push more information to POD/namespace if possible. 
According to early users who using YuniKorn, we found they got used to use 
{{describe pod}} to understand why allocation failed. At least for default 
scheduler, users can get pretty good insight about why allocation failed. 

I think we should try to push the experiences more native when users using 
YuniKorn on K8s, and it could be a show-stopper for many users. 

bq. The queue is out of resources, the user limit has been reached or even the 
maximum number of applications that can be run for a user is reached. That kind 
of information does not translate or help on the K8s side.

I think it is possible, we just need to go into the queue/user and populate 
these information to event recorder. It is not straightforward, but it should 
be doable. 

bq. Every event that we push back will be stored in etcd. I am worried about 
the amount of data we will push back to etcd if we are not careful.

We definitely need to test more for this, this should be not significantly more 
than what K8s existing event system will do since we have throttling mechanism 
built-in. 

bq. Can we also guarantee that the YuniKorn admin can always describe all the 
pods and namespaces? That would require the admin to have high level access to 
the K8s cluster which he might not have. Something we need to keep in mind.

A lot of things should be done by end user of the k8s cluster instead of Admin. 
Admin should ideally access our UI and understand more details if pod describe 
doesn't have enough information. (Of course, we need additional UI works)

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-63) Report occupied resources while registering a node

2020-03-30 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YUNIKORN-63:

Target Version: 0.8

> Report occupied resources while registering a node
> --
>
> Key: YUNIKORN-63
> URL: https://issues.apache.org/jira/browse/YUNIKORN-63
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler, scheduler-interface, shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to support this while registering a node, otherwise, it could be 
> race-condition that the shim reports an update but the node is not yet 
> registered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-58) Create yunikorn website

2020-03-30 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-58.
-
Fix Version/s: 0.8
   Resolution: Fixed

> Create yunikorn website 
> 
>
> Key: YUNIKORN-58
> URL: https://issues.apache.org/jira/browse/YUNIKORN-58
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Sunil G
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A new repo has been created for the web site
> Need to setup build process and content for the site
> the website is needed for the first release of yunikorn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-58) Create yunikorn website

2020-03-30 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071116#comment-17071116
 ] 

Weiwei Yang commented on YUNIKORN-58:
-

Thanks for getting this done, [~wilfreds], [~sunilg]. (y)

> Create yunikorn website 
> 
>
> Key: YUNIKORN-58
> URL: https://issues.apache.org/jira/browse/YUNIKORN-58
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Sunil G
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A new repo has been created for the web site
> Need to setup build process and content for the site
> the website is needed for the first release of yunikorn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-64) Add zap encoders for time and duration

2020-03-30 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071114#comment-17071114
 ] 

Weiwei Yang commented on YUNIKORN-64:
-

Hi [~wilfreds], thanks for finding out the root cause of the issue. I agree we 
need to fix this issue and get back to use time.Duration for better logging.

> Add zap encoders for time and duration
> --
>
> Key: YUNIKORN-64
> URL: https://issues.apache.org/jira/browse/YUNIKORN-64
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Priority: Major
>
> In YUNIKORN-62 a nil pointer caused a panic. The fix was to change the log 
> call from a Duration to a String.
> The real issue is the fact that the encoders for certain types are not set in 
> the logger. This is most likely an OS or internationalisation configuration 
> related issue. In the code we fail here:
> {code}
> panic: runtime error: invalid memory address or nil pointer 
> dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 
> pc=0x11ac568] goroutine 51 
> [running]:go.uber.org/zap/zapcore.(*jsonEncoder).AppendDuration(0xc007eca390, 
> 0x77370eba) 
> /Users/wyang/go/pkg/mod/go.uber.org/zap@v1.13.0/zapcore/json_encoder.go:239 
> {code}
> That links to the source code here:
> {code}
> 237  func (enc *jsonEncoder) AppendDuration(val time.Duration) {
> 238   cur := enc.buf.Len()
> 239   enc.EncodeDuration(val, enc)
> {code}
> That can only happen if the {{EncodeDuration}} function is nil.
> This links back to a similar [zap issue 
> #645|https://github.com/uber-go/zap/issues/645] for time stamps. In the 
> comments it says that the encoders are not optional for certain parts. We 
> should make sure that the encoders are set correctly as that looks like it is 
> the correct solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

2020-03-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070924#comment-17070924
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-42:
---

I had not  looked at this before but I agree with Tao's remark: we need this on 
the scheduler side. A pod is just a simple allocation in that view. Some of the 
data might make sense to flow back to the events on the K8s side but not all of 
them will.

The fact that the pod is not scheduled is a scheduler internal thing. Pushing 
the information back to K8s does not make it any clearer in most cases. How do 
you explain YuniKorn scheduling limits in K8s. I.e. The queue is out of 
resources, the user limit has been reached or even the maximum number of 
applications that can be run for a user is reached. That kind of information 
does not translate or help on the K8s side. We need to be able to show that on 
our web UI or via metrics that we publish. Saying that publishing YuniKorn 
internal events in a K8s way solves the problem of how to troubleshoot YuniKorn 
is over simplifying the issue at best and incorrect at worst.

Every event that we push back will be stored in etcd. I am worried about the 
amount of data we will push back to etcd if we are not careful.
Can we also guarantee that the YuniKorn admin can always describe all the pods 
and namespaces? That would require the admin to have high level access to the 
K8s cluster which he might not have. Something we need to keep in mind.

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-58) Create yunikorn website

2020-03-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070711#comment-17070711
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-58:
---

Main site committed via PR#1

> Create yunikorn website 
> 
>
> Key: YUNIKORN-58
> URL: https://issues.apache.org/jira/browse/YUNIKORN-58
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Sunil G
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A new repo has been created for the web site
> Need to setup build process and content for the site
> the website is needed for the first release of yunikorn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org