Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-06-03 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  project  | Status:
 |  accepted
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-june |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-
Changes (by anarcat):

 * keywords:  tpa-roadmap-may => tpa-roadmap-june


Comment:

 i obviously did not have time to complete this in may, and i'm unlikely to
 do so in june either, but just in case, moving there.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs


Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-04-30 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  project  | Status:
 |  accepted
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-may  |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-
Changes (by anarcat):

 * keywords:  tpa-roadmap-april => tpa-roadmap-may


Comment:

 i fixed the timeout error, and did today's round of upgrades without too
 many problems. one issue that came up is that ganeti wasn't happy to
 chain-reboot machines: some instances had to have a `activate-disks` ran
 so they recognize their secondary. that has been added as a TODO in the
 code.

 i also made some experiments with feeding LDAP hosts lists as an argument
 to the reboot command which also worked well. this, for example, rebooted
 the `rotation` hosts with a 10-minute delay:

 {{{
 ./reboot -H $(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org
 -x -ZZ -b dc=torproject,dc=org -LLL
 "(&(hostname=*.torproject.org)(rebootPolicy=rotation))" hostname | awk
 "\$1 == \"hostname:\" {print \$2}" | sort') -v
 }}}

 I added a modified recipe to the upgrades page, which covers all cases.

 I also set the reboot policy on a few hosts so they are classified
 properly, those didn't have a policy, and now have:

 manual:

 * moly (KVM, requires special handling)
 * kvm4 (KVM)
 * kvm5 (KVM)
 * scw-arm-par1 (buggy buildbox, see #32920)
 * fsn-node-01 (ganeti, requires special handling)
 * fsn-node-02 (ganeti)
 * fsn-node-03 (ganeti)
 * weissii (windows buildbox, no ssh)
 * woronowii (windows buildbox, no ssh)
 * winklerianum (windows buildbox, no ssh)

 justdoit:

 * pauli (puppet)
 * rude (rt)
 * alberti (ldap)
 * eugeni (mail)
 * majus (translation)
 * rouyi (jenkins)
 * troodi (trac)
 * nevii (dns primary)
 * henryi (consensus-health)
 * vineale (gitweb)
 * gayi (svn)
 * polyanthum (bridges)
 * materculae (exonerator)
 * meronense (metrics.tpo)
 * colchicifolium (collector backend)
 * carinatum (DocTor)
 * build-x86-05 (buildbox)
 * build-x86-06 (buildbox)
 * build-x86-08 (buildbox)
 * build-x86-09 (buildbox)
 * perdulce (people.tpo)
 * staticiforme (static master)
 * forrestii (fpcentral)
 * subnotabile (survey)
 * crm-int-01 (CRM backend)
 * crm-ext-01 (CRM frontend)
 * submit-01 (mail)

 rotation:

 * fallax (DNS secondary)
 * omeiense (onionoo backend)
 * oo-hetzner-03 (onionoo backend)
 * neriniflorum (DNS secondary)
 * web-hetzner-01 (web frontend)
 * web-cymru-01 (web frontend)


 the following were already configured as...

 rotation:

 * orestis (onionoo backend)
 * nutans (DNS secondary)
 * cdn-backend-sunet-01 (web frontend)
 * hetzner-hel1-02 (DNS secondary)
 * hetzner-hel1-03 (web frontend)
 * onionoo-backend-01 (onionoo backend)
 * web-fsn-01 (web frontend)
 * web-fsn-02 (web frontend)
 * onionoo-frontend-01 (onionoo frontend)
 * cache01 (cache frontend)
 * cache-02 (cache frontend)
 * onionoo-backend-02 (onionoo backend)

 justdoit:

 * corsicum (collector)
 * hetzner-hel1-01 (nagios)
 * bungei (backup storage)
 * hetzner-nbg1-01 (prometheus)
 * hetzner-nbg1-02 (prometheus)
 * archive-01 (non-redundant web frontend)
 * loghost01 (syslog)
 * static-master-fsn (static master)
 * bacula-director-01 (backup director)
 * gettor-01 (gettor)
 * onionbalance-01 (onionbalance)
 * chives (IRC)
 * build-arm-10 (buildbox)
 * tbb-nightlies-master (static master)
 * gitlab-02 (gitlab)
 * check-01 (check.tpo)

 manual:

 * mandos-01 (mandos, requires crypto)
 * fsn-node-04
 * fsn-node-05

 In other words, I made the following diff in LDAP:

 {{{
 --- policy-before   2020-04-30 19:48:50.158412413 -0400
 +++ policy-after2020-04-30 19:54:15.209832522 -0400
 @@ -6,27 +6,35 @@

  dn: host=moly,ou=hosts,dc=torproject,dc=org
  host: moly
 +rebootPolicy: manual

  dn: host=pauli,ou=hosts,dc=torproject,dc=org
  host: pauli
 +rebootPolicy: justdoit

  dn: host=rude,ou=hosts,dc=torproject,dc=org
  host: rude
 +rebootPolicy: justdoit

  dn: host=alberti,ou=hosts,dc=torproject,dc=org
  host: alberti
 +rebootPolicy: justdoit

  dn: host=cupani,ou=hosts,dc=torproject,dc=org
  host: cupani
 +rebootPolicy: justdoit

  dn: host=fallax,ou=hosts,dc=torproject,dc=org
  host: fallax
 +rebootPolicy: rotation

  dn: host=eugeni,ou=hosts,dc=torproject,dc=org
  host: eugeni
 +rebootPolicy: justdoit

  dn: host=majus,ou=hosts,dc=torproject,dc=org
  host: majus
 +rebootPolicy: justdoit

  dn: host=listera,ou=hosts,dc=torproject,dc=org
  host: listera
 @@ 

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-04-14 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  project  | Status:
 |  accepted
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-april|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-
Changes (by anarcat):

 * status:  new => accepted
 * keywords:  tpa-roadmap-march => tpa-roadmap-april
 * owner:  tpa => anarcat


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-04-02 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 i did more work on the reboot procedures today, and rebooted the ganeti
 cluster using the reboot command. there were some issues with the initrd
 interfering with the `wait_for_boot` (now called `wait_for_ping`) checks
 so I did some refactoring, but i'm still confused about the exception
 that's raised by Fabric in this case.

 the exception I got here is:

 {{{
 All instances migrated successfully.
 Shutdown scheduled for Thu 2020-04-02 18:30:55 UTC, use 'shutdown -c'
 to cancel.
 waiting 0 minutes for reboot to happen
 waiting up to 30 seconds for host to go down
 waiting 300 seconds for host to go up
 host fsn-node-01.torproject.org should be back online, checking uptime
 Traceback (most recent call last):
   File "./reboot", line 132, in 
 logging.getLogger(mod).setLevel('WARNING')
   File "./reboot", line 116, in main
 delay_up=args.delay_up,
   File "/usr/lib/python3/dist-packages/invoke/tasks.py", line 127, in
 __call__
 result = self.body(*args, **kwargs)
   File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/reboot.py", line
 197, in shutdown_and_wait
 res = con.run('uptime', watchers=[responder], pty=True, warn=True)
   File "", line 2, in run
   File "/usr/lib/python3/dist-packages/fabric/connection.py", line 29,
 in opens
 self.open()
   File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/__init__.py", line
 106, in safe_open
 Connection.open_orig(self)
   File "/usr/lib/python3/dist-packages/fabric/connection.py", line
 634, in open
 self.client.connect(**kwargs)
   File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349,
 in connect
 retry_on_signal(lambda: sock.connect(addr))
   File "/usr/lib/python3/dist-packages/paramiko/util.py", line 280, in
 retry_on_signal
 return function()
   File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349,
 in 
 retry_on_signal(lambda: sock.connect(addr))
 TimeoutError: [Errno 110] Connection timed out
 }}}

 maybe the exception gets generated *above* our code, in the fabric task
 handler itself, in which case it might mean we shouldn't use a @task for
 this at all, at least in our code.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-03-25 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 Replying to [comment:2 arma]:
 > Replying to [comment:1 anarcat]:
 > > i also wonder, in general, if we should warn users about those
 reboots, as part of the reboot script.
 >
 > This idea might not at all be worth the hassle of implementing it, but
 your "rebooting x", "x is back" lines from #tor-project irc seem eminently
 automatable.

 This is getting closer to reality now. There's a KGB bot living on chives
 now (but just use the kgb-bot.torproject.org alias instead) that can be
 used for such notifications. It's not hooked into fabric just yet, but
 that's the next step. With the configuration from `/etc/kgb-client-
 tpa.conf`, one can do:

 {{{
 kgb-client --conf kgb-client-tpa.conf --relay-msg test
 }}}

 ... and that will say "test" in `#tor-project` and `#tor-bots`. This is
 obviously configurable, but the next step here is to find the best way to
 hook this into Fabric.

 I'm tempted to just shell out locally and do exactly the above to send
 notifications, as opposed to implementing a full KGB client in Python (!).
 But then again, "it's just JSON-RPC with some authentication mechanism".
 And we just use the "relay_message" bit:

 https://manpages.debian.org/unstable/kgb-bot/kgb-
 protocol.7p.en.html#relay_message_message

 ... so "how hard could it be"?

 Fun times.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-03-13 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Old description:

> in #31957 we have worked on automating upgrades, but that's only part of
> the problem. we also need to reboot in some situations.
>
> we have various mechanisms to do so right now:
>
>  * `tsa-misc/reboot-host` - reboot script for kvm boxes, kind of a mess,
> to be removed when we finish the kvm-ganeti migration
>  * `tsa-misc/reboot-guest` - reboot a single host. kind of a hack, but
> useful to reboot a single machine
>  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
> with `rebootPolicy=justdoit` in LDAP and reboot them with `torproject-
> reboot-many`
>  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
> with `rebootPolicy=rotation` in LDAP and reboot them with `torproject-
> reboot-many`, with a 30 minute delay between each host
>  * `ganeti-reboot-cluster` - a tool to reboot the ganeti cluster
>
> There are various problems with all this:
>
>  * the `torproject-reboot-*` scripts do not take care of
> `rebootPolicy=manual` hosts
>  * the `ganeti-reboot-cluster` script has been known to fail if a cluster
> is unbalanced
>  * the `ganeti-reboot-cluster` script currently fails when hosts talk to
> each other over IPv6 somehow (see #33412)
>  * we have 5 different ways of performing reboots, we should have just
> one script that does it all
>  * reboot-{host,guest} do not check if hosts need reboot before rebooting
> (but the multi-tool does)
>
> In short, this is kind of a mess, and we should refactor this. We should
> consider using needrestart, which knows how to reboot individual hosts.
>
> I also added a [https://github.com/xneelo/hetzner-needrestart/issues/23
> feature request to the needrestart puppet module] to expose its knowledge
> as a puppet fact, so we can use that information from PuppetDB instead of
> SSH'ing in each host and calling the dsa-* tools.

New description:

 in #31957 we have worked on automating upgrades, but that's only part of
 the problem. we also need to reboot in some situations.

 we have various mechanisms to do so right now:

  * `tsa-misc/reboot-host` - reboot script for kvm boxes, kind of a mess,
 to be removed when we finish the kvm-ganeti migration
  * `tsa-misc/reboot-guest` - reboot a single host. kind of a hack, but
 useful to reboot a single machine
  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
 with `rebootPolicy=justdoit` in LDAP and reboot them with `torproject-
 reboot-many`
  * `misc/multi-tool/torproject-reboot-rotation` - iterate over all hosts
 with `rebootPolicy=rotation` in LDAP and reboot them with `torproject-
 reboot-many`, with a 30 minute delay between each host
  * `ganeti-reboot-cluster` - a tool to reboot the ganeti cluster

 There are various problems with all this:

  * the `torproject-reboot-*` scripts do not take care of
 `rebootPolicy=manual` hosts
  * the `ganeti-reboot-cluster` script has been known to fail if a cluster
 is unbalanced
  * the `ganeti-reboot-cluster` script currently fails when hosts talk to
 each other over IPv6 somehow (see #33412)
  * we have 5 different ways of performing reboots, we should have just one
 script that does it all
  * reboot-{host,guest} do not check if hosts need reboot before rebooting
 (but the multi-tool does)

 In short, this is kind of a mess, and we should refactor this. We should
 consider using needrestart, which knows how to reboot individual hosts.

 I also added a [https://github.com/xneelo/hetzner-needrestart/issues/23
 feature request to the needrestart puppet module] to expose its knowledge
 as a puppet fact, so we can use that information from PuppetDB instead of
 SSH'ing in each host and calling the dsa-* tools.

--

Comment (by anarcat):

 that prototype is now a library, in https://gitweb.torproject.org/admin
 /tsa-misc.git/tree/fabric_tpa/reboot.py

 it can be called with a wrapper script in
 https://gitweb.torproject.org/admin/tsa-misc.git/tree/reboot

 with something like:

 {{{
 ./reboot -H fsn-node-03.torproject.org,...
 }}}

 it handles ganeti nodes, but not libvirt nodes. it therefore replaces the
 following:

  * `tsa-misc/reboot-guest`
  * `ganeti-reboot-cluster`

 it *could* also replace the following, provided that (a) a host list is
 somewhat generated out of band and (b) the 

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-02-21 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 i wrote a simple reboot prototype that does just that, but can also be
 used as a `reboot-guest` replacement:

 https://gitweb.torproject.org/admin/tsa-misc.git/tree/ganeti-reboot-
 cluster-fabric-prototype

 it's mostly a test to see how Fabric works and is not intended to be a
 replacement for all tools just yet.

 but i find the results promising: it's much nicer to work in python with
 that stuff: errors are (mostly) well defined and it's easy to modularize
 things. for example, i originally wrote the thing to migrate fsn-node-01
 (and that worked) but then i could extend it to also reboot arbitrary node
 (and i rebooted gayi).

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-02-21 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 just for future reference, ganeti-reboot-cluster, as we have in our puppet
 repo, doesn't work in our cluster, because it relies on assumptions
 specific to the DSA clusters (namely that the last node is an empty
 spare). so it fails with:

 {{{
 fsn-node-03.torproject.org not empty.
 }}}

 apparently, the latest version of the script might fix that with the
 `crossmigratemany` function:

 https://salsa.debian.org/dsa-team/mirror/dsa-
 puppet/raw/master/modules/ganeti2/files/ganeti-reboot-cluster

 for now, i'll just do the reboot by hand.

 in theory, rebooting a ganeti node is to:

  1. migrate all the primaries off of the node: `ssh $master gnt-migrate
 $node`
  2. if it's a master, promote another master: `ssh $notmaster gnt-cluster
 master-failover` (optional, only if we can't afford having the master down
 during the reboot)
  3. reboot the node `ssh $node reboot`

 ... for each node.

 i'm testing that procedure on fsn-node-03 now.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-02-21 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Old description:

> in #31957 we have worked on automating upgrades, but that's only part of
> the problem. we also need to reboot in some situations.
>
> we have various mechanisms to do so right now:
>
>  * `tsa-misc/reboot-host` - reboot script for kvm boxes, kind of a mess,
> to be removed when we finish the kvm-ganeti migration
>  * `tsa-misc/reboot-guest` - reboot a single host. kind of a hack, but
> useful to reboot a single machine
>  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
> with `rebootPolicy=justdoit` in LDAP and reboot them with `torproject-
> reboot-many`
>  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
> with `rebootPolicy=rotation` in LDAP and reboot them with `torproject-
> reboot-many`, with a 30 minute delay between each host
>  * `ganeti-reboot-cluster` - a tool to reboot the ganeti cluster
>
> There are various problems with all this:
>
>  * the `torproject-reboot-*` scripts do not take care of
> `rebootPolicy=manual` hosts
>  * the `ganeti-reboot-cluster` script has been known to fail if a cluster
> is unbalanced
>  * the `ganeti-reboot-cluster` script currently fails when hosts talk to
> each other over IPv6 somehow
>  * we have 5 different ways of performing reboots, we should have just
> one script that does it all
>  * reboot-{host,guest} do not check if hosts need reboot before rebooting
> (but the multi-tool does)
>
> In short, this is kind of a mess, and we should refactor this. We should
> consider using needrestart, which knows how to reboot individual hosts.
>
> I also added a [https://github.com/xneelo/hetzner-needrestart/issues/23
> feature request to the needrestart puppet module] to expose its knowledge
> as a puppet fact, so we can use that information from PuppetDB instead of
> SSH'ing in each host and calling the dsa-* tools.

New description:

 in #31957 we have worked on automating upgrades, but that's only part of
 the problem. we also need to reboot in some situations.

 we have various mechanisms to do so right now:

  * `tsa-misc/reboot-host` - reboot script for kvm boxes, kind of a mess,
 to be removed when we finish the kvm-ganeti migration
  * `tsa-misc/reboot-guest` - reboot a single host. kind of a hack, but
 useful to reboot a single machine
  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
 with `rebootPolicy=justdoit` in LDAP and reboot them with `torproject-
 reboot-many`
  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
 with `rebootPolicy=rotation` in LDAP and reboot them with `torproject-
 reboot-many`, with a 30 minute delay between each host
  * `ganeti-reboot-cluster` - a tool to reboot the ganeti cluster

 There are various problems with all this:

  * the `torproject-reboot-*` scripts do not take care of
 `rebootPolicy=manual` hosts
  * the `ganeti-reboot-cluster` script has been known to fail if a cluster
 is unbalanced
  * the `ganeti-reboot-cluster` script currently fails when hosts talk to
 each other over IPv6 somehow (see #33412)
  * we have 5 different ways of performing reboots, we should have just one
 script that does it all
  * reboot-{host,guest} do not check if hosts need reboot before rebooting
 (but the multi-tool does)

 In short, this is kind of a mess, and we should refactor this. We should
 consider using needrestart, which knows how to reboot individual hosts.

 I also added a [https://github.com/xneelo/hetzner-needrestart/issues/23
 feature request to the needrestart puppet module] to expose its knowledge
 as a puppet fact, so we can use that information from PuppetDB instead of
 SSH'ing in each host and calling the dsa-* tools.

--

Comment (by anarcat):

 filed bug #33412 about the ganeti-reboot-cluster bug

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-02-21 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 Replying to [comment:2 arma]:
 > Replying to [comment:1 anarcat]:
 > > i also wonder, in general, if we should warn users about those
 reboots, as part of the reboot script.
 >
 > This idea might not at all be worth the hassle of implementing it, but
 your "rebooting x", "x is back" lines from #tor-project irc seem eminently
 automatable.

 That's exactly what I had in mind. The trick is whether individual hosts
 should connect to IRC to issue those notifications (?!) or whether the
 calling script should. Either way, we'd need some sort of notification
 bot, which has been kind of a pain in the arse before in my experience.
 But maybe we could leverage KGB for this?

 It's one of the reasons I'm thinking of rebuilding this system in the
 first place as well...

 Thanks for the feedback!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-02-21 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by arma):

 Replying to [comment:1 anarcat]:
 > i also wonder, in general, if we should warn users about those reboots,
 as part of the reboot script.

 This idea might not at all be worth the hassle of implementing it, but
 your "rebooting x", "x is back" lines from #tor-project irc seem eminently
 automatable.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

2020-02-20 Thread Tor Bug Tracker & Wiki
#33406: automate reboots
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  Low  |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-march|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 note that this may very well mean just removing tsa-misc/reboot-host and
 tsa-misc/reboot-guest, and documenting the multi-tool better. :)

 i just tried `./torproject-reboot-rotation` and ``./torproject-reboot-
 simple` and the unattended operation isn't great... it fires up all those
 reboots, and doesn't show clearly what it did. for example, it seems to
 have queued reboots on a bunch of hosts, but it doesn't say which.

 after further inspection (with `cumin '*' 'screen -ls | grep reboot-
 job'`), i have found it has scheduled reboots on

 * static-master-fsn.torproject.org
 * cdn-backend-sunet-01.torproject.org
 * web-fsn-01.torproject.org
 * onionoo-frontend-01.torproject.org
 * orestis.torproject.org
 * nutans.torproject.org
 * chives.torproject.org
 * onionbalance-01.torproject.org
 * listera.torproject.org
 * peninsulare.torproject.org

 Most of those are okay and should return unattended. But in some cases,
 those could have been covered by a libvirt reboot (i had performed those
 before, in this case, so non were). Eventually though, that point is moot
 because we'll all be running under ganeti and will separate host and guest
 reboot procedures.

 one host is problematic in there (chives) as it needs a specific warning
 to users. maybe chives should be taken out of "justdoit" rotation...

 i also wonder, in general, if we should warn users about those reboots, as
 part of the reboot script.

 then i don't know which hosts are left to do manually, but i guess that,
 with time, nagios will let us know. it would be nice to have a scenario
 for those as well.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs