Guys,
Update: The tlab and the AWS CD systems were in agreement today.
Note the AWS system is now running 4 x 64G VMs (32 cores and 256G ram) – (I
am bringing up a 2nd 9 x 16G vm mirror system up as well).
For those requiring AWS - the details/script on configuring the only cloud
native part of the cluster – the EFS wrapper on the NFS file share (will be
hosted outside of all the VM’s as a service) – is detailed below – you can run
the EFS script after the normal cloud-agnostic oom_rancher_install.sh script in
https://jira.onap.org/browse/LOG-325
https://wiki.onap.org/display/DW/Cloud+Native+Deployment#CloudNativeDeployment-EFS/NFSProvisioningScriptforAWS
There were only 2 failures in a good build earlier today on both systems –
one due to intermittent timing likely at 41/43 – both agreed at the same time
and on the same 2 pods – which is rare.
4 x 64G – AWS
http://jenkins.onap.info/job/oom-cd-master/2971/consoleFull
16:00:18 43 critical tests, 41 passed, 2 failed
9 x 16G - TLAB
https://jenkins.onap.org/view/External%20Labs/job/lab-tlab-beijing-oom-deploy/321/console
21:37:46 14:19:40-0700 43 critical tests, 41 passed, 2 failed
http://kibana.onap.info:5601/app/kibana#/dashboard/AWAtvpS63NTXK5mX2kuS?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2018-05-19T14:46:38.216Z',mode:absolute,to:'2018-05-19T17:22:18.756Z'))&_a=(description:'',filters:!(),options:(darkTheme:!f),panels:!((col:1,id:AWAts77k3NTXK5mX2kuM,panelIndex:1,row:1,size_x:8,size_y:3,type:visualization),(col:9,id:AWAtuTVI3NTXK5mX2kuP,panelIndex:2,row:1,size_x:4,size_y:3,type:visualization),(col:1,id:AWAtuBTY3NTXK5mX2kuO,panelIndex:3,row:7,size_x:6,size_y:3,type:visualization),(col:1,id:AWAttmqB3NTXK5mX2kuN,panelIndex:4,row:4,size_x:6,size_y:3,type:visualization),(col:7,id:AWAtvHtY3NTXK5mX2kuR,panelIndex:6,row:4,size_x:6,size_y:6,type:visualization)),query:(match_all:()),timeRestore:!f,title:'CD%20Health%20Check',uiState:(),viewMode:view)
ONAP wide Resource allocations:
The healthcheck even on this system (and the tlab system) – is very
sensitive to un-optimized rogue containers (logstash being one), the order of
pods, and readiness/liveness timing – as we fine tune and prioritize container
resources we should get better – I am scheduling a performance meeting for 1130
Thu to go over a couple of the containers (ELK stack under a 30 logs/sec idle
load) causing issues like the high indexing behavior on the logstash daemonset
and whether/how elasticsearch should also be a ds - Mike and Mandeep have work
scheduled to do most of this for all of OOM in general – the meeting will be
public and start with logging pods.
A lot of this will be things like ReplicaSet sizes - which to switch to
DaemonSets (1 container per vm), which to use autoscalers, cpu limits (cores
sorry no %), ram limits, collocation rules (which pods get affected by others
on the same vm), cluster VM granularity sweetspot (32/16/8g VMs and the
tradeoff on limiting vm local affects with reduced collocation) – for example a
rogue container on a 128g vm in a cluster hogging 6 of 32 cores is less
affecting than one on an 8 core vm, but an 8G vm may only have enough room for
1 or 2 pods that need huge heaps. However all of these optimizations should be
done together with all the PTLs because if we arbitrarily set priorities on
ram/cpu limits on some pods – others will be effectively downgraded – we need a
hierarchy – which Mike mentioned.
I recommend the windriver/tlab system report the pod list just before the
final healthcheck like below so we can split out heathchecks failing due to
failed pods and HC failing on running pods – as well as HC passing on failing
pods for false positives.
If we use -o wide we can also determine the deployment distribution
architecture of that particular install – where pods especially non-DaemonSet
replicaSet ones are running on which cluster VM.
Example a couple running pods – the list is 125 of 150+
16:00:01 List of ONAP Modules
16:00:01 NAMESPACE NAMEREADY
STATUS RESTARTS AGE IP NODE
16:00:01 onap dep-config-binding-service-68b4695cb4-l4tst 2/2
Running0 3h10.42.233.237
ip-10-0-0-80.us-east-2.compute.internal
16:00:01 onap dep-dcae-tca-analytics-68d749cb4c-7mzjg 2/2
Running0 3h10.42.174.250
ip-10-0-0-111.us-east-2.compute.internal
…..
For robot logs – I’ll think it will be benefical to add a filebeat sidecar to
the robot pod – so we can query the elk stack on 30253 on any robot healthcheck
and ete logs as well.
https://jira.onap.org/browse/LOG-414
this was mentioned in a previous request
https://lists.onap.org/pipermail/onap-discuss/2018-April/009199.html
thank you
/michael
This message and the information contained herein is proprietary and
confidential and subject to the