[libreoffice-website] Minutes from the Tue Feb 20 infra call

Guilhem Moulin Tue, 20 Feb 2018 11:03:07 -0800

Participants
 1. davido
 2. guilhem
 3. Brett
 4. Christian

Agenda
 * Upgrade Gerrit to 2.13.10
   + Issue filed in Redmine
   + Who is working on it
   + https://redmine.documentfoundation.org/issues/2463
     - G. Do we need to go through that again?  The issue was filed and
       assigned, we'll follow up from there (no update there = no news)
   + Next steps
   + Set up staging gerrit instance and sync production data
     - Assigned target to Q1

* Monitoring update
+ Brett: I believe Prometheus to be the better solution for infra
monitoring.
- Prometheus is actively maintained by the community rather than TICK's
reliance on InfluxDB. Exporters (data collection binaries) are already
available in debian stable and debian backports (even the shiny, new
2.0 release). TICK requires external repositories and would require
some auditing (see: Chronograf phoning home by default).
. G. yay, agreed :-)
. What's up with the docker containers on vm213? Brett: I had
just installed a bunch of throwaway services for metrics
testing.
. grafana: http://localhost:3000/dashboard/db/node-exporter-full
(forward 3000/TCP to vm213 first)
. prometheus: http://localhost:9090/consoles/node-overview.html
(forward 9090/TCP to vm213 first)
- Prometheus does not have a useful built-in dashboard, only ad-hoc
query input: They recommend using Grafana for that. Presently, Debian
only has a package in Sid.
. Package was removed from testing during the freeze due to two RC
bugs, and was subsequently orphaned by its maintainer
<https://bugs.debian.org/876648>
. G. Not a blocker for us: I'd refrain from using third party repos
when possible, but we need a single installation of that package
(installing prometheus/telegraph from a third party repo on every
single host would be another story…) Might even step in and adopt
the package if its maintenance is not a burden :-)
-
https://prometheus.io/docs/introduction/faq/#how-does-prometheus-compare-against-other-monitoring-systems?
- Debian only has a small number of exporters available in the repos -
We'd have to manually install/configure any additional exporters from
https://prometheus.io/docs/instrumenting/exporters/
. Not a blocker, we can build ourselves and tell salt to install the
.deb
- G. Confidentiality and integrity protection
. exporters installed to vm191 and vm213 for now; they currently
communicate via intranet (private IP range)
. monitoring server needs to be offsite (eg, reuse monitoring.tdf),
and metrics need to be protected: either with SSL/TLS, or with
IPsec/VPN/…
→ go for TLS tunnels, most host have a nginx instance anyway
. client auth: HTTP digest or client certs (ECDSA for minimum
overhead)
. server (exporter) auth: server certs (ECDSA for minimum overhead)
. SSL/TLS protection needs an extra SSL termination proxy (eg nginx,
or stunnel4); unfortunately both client and server insist on verifying
the chain for mutual auth — instead of pinning the key material…
- Dashboards (prometheus & grafana) protection: refrain from using SSO
here so admins are not locked out if the SAML IdP or LDAP server are
down.
- Each exporter listens on a different INET port (91XY) and speak HTTP
with ‘/metrics’ as entry path. Do we want to open gazilion of ports,
or a single port with multiple entry paths behind the reverse proxy?
(eg, ‘/MySQL/metrics’)
- Status of monitoring.documentfoundation.org
. Ubuntu 14.04.5 LTS root server, hosted at filoo
. guilhem: Would prefer a Debian (stretch) box instead for the sake of
uniformity, ok to wipe and recycle?
. AI guilhem: get in touch with filoo if there is no rescue boot
- Users have requested a public dashboard / status page, with basic info
(no graph) such as service/host up/down and a custom field we (admin
team) can fill to tell them we're aware of the problem and are working
on it
. Probably doable with prometheus API(?)
. Exposing blackbox exporter metrics from our various services would
probably be enough (HTTP return code and timing)
. Example: https://status.lineageos.org/ , powered by https://hund.io/
→ Sponsored service?

+ Is there a desire for just infra monitoring or application-level
monitoring as well? If we need application-level monitoring, the ELK stack
is recommended: https://www.elastic.co/webinars/introduction-elk-stack
- At least mail queue, database (MySQL, PostgreSQL, slapd) operations,
HTTPd response code & timing
. Brett: These can be handled by prometheus, though AFAIK apache/nginx
need a module to get in-depth stats.
- ELK: ElasticSearch + LogStash + Kibana
- LogStash vs. graylog pro & cons? (We already have an instance of the
latter)
. Brett: I forgot about graylog, sorry. I see no reason to switch from
it.
. OK let's keep it then

+ Alerting
- The alert system of our current Incinga-based monitoring system is
(mostly) not working, and having working alerts is an incentive for
refactoring the monitoring system; so we really want that one to work
:-)
- Threshold-based is good enough
- As discussed a few calls ago, needs to be schedulable so volunteers
aren't awoken during their vacation
. https://github.com/prometheus/alertmanager/issues/876 is a feature
request from last year with no priority :(
. Possible workaround at
https://github.com/prometheus/alertmanager/issues/517#issuecomment-250918957.
There are mentions that using the prometheus API/webhooks could work.
. Wouldn't it be up to the volunteer to silence alarms when on vacation?
→ It's also about week-ends and night: we don't want to give all
infra volunteers the feeling that they are on duty
- Need (at least) SMS *and* mail
. Prometheus' alertmanager can be bridged to an SMS provider
https://github.com/messagebird/sachet
. Brett: I've had success with using email as provided by telecoms.
e.g. I use T-Mobile (Deutsche Telekom) and can email
[email protected]
to get a text. https://support.t-mobile.com/docs/DOC-3309
. How about other countries? Not aware of a Swedish provider offering
a similar service
. also sipgate has API (RPS & REST) to send sms

(https://teamhelp.sipgate.de/hc/de/articles/207867549-Die-sipgate-APIs-im-%C3%9Cberblick
german entry page to various api docs (spec in English))
. so does pingdom.com

* SSO adoption <https://user.documentfoundation.org>:
+ 572 accounts in total (72 since the last infra call)
+ Nextcloud is now using SAML (unauthenticated users are redirected to
auth.tdf); accounts not in LDAP yet are *locked out*
+ All MC members now have a LDAP account, shared creds (HTTP digest auth)
is now deprecated
+ governance: 1/10 board member missing; 40/190 (21%) TDF members missing
+ contributors: 84/175 (48%) recent (last 90 days) wiki editors missing
+ Need to resume the redmine migration to SAML

* Next call: Tuesday March 20 2018 at 18:30 Berlin time (17:30 UTC)

--
Guilhem.

--
To unsubscribe e-mail to: [email protected]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/website/
All messages sent to this list will be publicly archived and cannot be deleted

[libreoffice-website] Minutes from the Tue Feb 20 infra call

Reply via email to