Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-02-03 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:  closed
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:  fixed
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-
Changes (by anarcat):

 * status:  merge_ready => closed
 * resolution:   => fixed


Comment:

 box rebooted fine last week, disk arrays are all sync'd and work fine (no
 smartd error), we're done with the disk problems here hopefully.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-31 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  merge_ready
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-
Changes (by anarcat):

 * status:  assigned => merge_ready


Comment:

 resync seems to be going fine, i'll do one last reboot in about an hour
 and then we're clear here.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-30 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 > still todo: restore vg_ganeti_hdd over LUKS.

 done. let's see how the resync goes now.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-30 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 they rechecked the disks and everything is back online. readded the NVMe
 disks to the array, which is already rebuilt, and rebuilt a md2 array on
 the HDDs, which will be done in ~13h.

 still todo: restore vg_ganeti_hdd over LUKS.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-30 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 aaand the problem was that an NVMe drive was missing! not sure why it
 wouldn't boot properly, but an update-initramfs later and it would boot
 correctly. ping'd hetzner about that again.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-30 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 new HDDs installed, but the box doesn't get past cryptsetup. for some
 reason the initrd is giving me this on the SSH prompt:

 {{{
 Error: Timeout reached while waiting for askpass.
 }}}

 and mandos doesn't seem to be able to bring back the machine online
 either.

 rebooting in rescue, and starting the dance of good old:

 {{{
 cryptsetup luksOpen /dev/md1 crypt_dev_md1
 vgchange -a y
 mount /dev/vg_ganeti/root /mnt
 mount /dev/md0 /mnt/boot
 for dev in dev sys proc run ; do mount -o bind /$dev /mnt/$dev ; done
 hostname fsn-node-03
 chroot /mnt
 }}}

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-30 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 hetzner moved the NVMe drives into a new server and are setting up new
 HDDs in there as well.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-30 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 and we got more SMART email messages about this server. now it's also sdb
 that's complaining. ]

 i also noticed that sdb has been complaining even before i opened that
 ticket with Hetzner. in fact, what triggered me to open that ticket is the
 second smartd email, which i mistakenly thought was caused by sda errors:
 the second email we got was about sdb! so changing sda wouldn't have
 solved that problem.

 i commented on hetzner's ticket with the following:

 > We're still having trouble with this server.
 >
 > After a full RAID-1 resync, I rebooted the box, but the new disk was
 > kicked out of the array, and not detected as having a RAID superblock:
 >
 > {{{
 > root@fsn-node-03:~# mdadm -E /dev/sda1
 > mdadm: No md superblock detected on /dev/sda1.
 > }}}
 >
 > When I started the array and readded the disk, it started a full resync
 > again:
 >
 > {{{
 > root@fsn-node-03:~# mdadm --run /dev/md2
 > mdadm: started array /dev/md/2
 > root@fsn-node-03:~# mdadm /dev/md2 -a /dev/sda1
 > mdadm: added /dev/sda1
 > root@fsn-node-03:~# cat /proc/mdstat
 > Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
 [raid4] [raid10]
 > md2 : active raid1 sda1[2] sdb1[1]
 >   9766302720 blocks super 1.2 [2/1] [_U]
 >   [>]  recovery =  0.0% (274048/9766302720)
 finish=593.9min speed=274048K/sec
 >   bitmap: 0/73 pages [0KB], 65536KB chunk
 >
 > md1 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
 >   937026560 blocks super 1.2 [2/2] [UU]
 >   bitmap: 1/7 pages [4KB], 65536KB chunk
 >
 > md0 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
 >   523264 blocks super 1.2 [2/2] [UU]
 >
 > unused devices: 
 > }}}
 >
 > Furthermore, I just noticed that have received smartd notifications
 about the *OTHER*
 > hard drive (sdb):
 >
 > > Date: Wed, 29 Jan 2020 23:46:38 +
 > >
 > > [...]
 > >
 > > Device: /dev/sdb [SAT], ATA error count increased from 4 to 5
 > >
 > > Device info:
 > > TOSHIBA MG06ACA10TEY, S/N:[...], WWN:[...], FW:0103, 10.0 TB
 >
 > We have also seen errors from sdb, the second drive, *before* we opened
 > this ticket. That was my mistake: I thought the errors were both from
 > the same disk, I couldn't imagine both disks were giving out errors.
 >
 > At this point, I am wondering if it might not be better to just
 > commission a completely new machine than trying to revive this one. I
 > get the strong sense something is wrong with the disk controller on that
 > one. We have two other PX62 servers with the same identical setup
 > (fsn-node-01/PX62-NVMe #[...], fsn-node-02/PX62-NVMe #[...]).
 > Both are in production and neither show the same disk problems.
 >
 > In any case, I can't use the box like this: its (software) RAID array
 > doesn't survive reboots which tells me there's something very wrong with
 > this machine.
 >
 > Could you look into this again please?

 So I think that, worst case, they just swap the machine and we reinstall.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-29 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:
 |  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Blocker  | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 they tested the drive (probably a smart short self-test) and didn't find
 anything. they swapped the cable and gave us back the box.

 after boot, the raid array *again* did not come up. i restarted it and re-
 added the extra drive. during the sync, we got a ATA error again, but this
 time it seems the drives didn't notice because smartd didn't send us
 email. i only noticed it through a dmesg tail:

 {{{
 [Jan29 20:00] microcode: microcode updated early to revision 0xca, date =
 2019-10-03
 [  +0.00] Linux version 4.19.0-6-amd64 (debian-
 ker...@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP
 Debian 4.19.67-2+deb10u2 (2019-11-11)
 [...]
 [Jan29 20:18] md/raid1:md2: active with 1 out of 2 mirrors
 [  +0.050102] md2: detected capacity change from 0 to 1693985280
 [Jan29 20:19] md: recovery of RAID array md2
 [...]
 [Jan29 22:33] ata1.00: exception Emask 0x50 SAct 0xe000 SErr 0x480900
 action 0x6 frozen
 [  +0.14] ata1.00: irq_stat 0x0800, interface fatal error
 [  +0.07] ata1: SError: { UnrecovData HostInt 10B8B Handshk }
 [  +0.09] ata1.00: failed command: WRITE FPDMA QUEUED
 [  +0.13] ata1.00: cmd 61/00:68:80:0c:c4/08:00:b8:00:00/40 tag 13 ncq
 dma 1048576 ou
res 40/00:68:80:0c:c4/00:00:b8:00:00/40 Emask 0x50
 (ATA bus error)
 [  +0.11] ata1.00: status: { DRDY }
 [  +0.05] ata1.00: failed command: WRITE FPDMA QUEUED
 [  +0.12] ata1.00: cmd 61/00:70:80:14:c4/03:00:b8:00:00/40 tag 14 ncq
 dma 393216 out
res 40/00:68:80:0c:c4/00:00:b8:00:00/40 Emask 0x50
 (ATA bus error)
 [  +0.11] ata1.00: status: { DRDY }
 [  +0.06] ata1.00: failed command: WRITE FPDMA QUEUED
 [  +0.11] ata1.00: cmd 61/80:78:80:17:c4/04:00:b8:00:00/40 tag 15 ncq
 dma 589824 out
res 40/00:68:80:0c:c4/00:00:b8:00:00/40 Emask 0x50
 (ATA bus error)
 [  +0.11] ata1.00: status: { DRDY }
 [  +0.08] ata1: hard resetting link
 [  +0.375942] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 [  +0.001613] ata1.00: configured for UDMA/133
 [  +0.26] ata1: EH complete
 }}}

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

[tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

2020-01-29 Thread Tor Bug Tracker & Wiki
#33098: fsn-node-03 disk problems
-+-
 Reporter:  anarcat  |  Owner:  anarcat
 Type:  defect   | Status:  assigned
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin   |Version:
  Team   |
 Severity:  Blocker  |   Keywords:
Actual Points:   |  Parent ID:
   Points:   |   Reviewer:
  Sponsor:   |
-+-
 for some reason, the HDD disk on fsn-node-03 is having SMART errors. I
 originally filed this ticket with Hetzner:

 > yesterday, as we got errors from the SMART daemon on this host, looking
 like this:
 >
 > From: root 
 > Subject: SMART error (ErrorCount) detected on host: fsn-node-03
 > To: r...@fsn-node-03.torproject.org
 > Date: Tue, 28 Jan 2020 23:35:35 +
 >
 > This message was generated by the smartd daemon running on:
 >
 >host name:  fsn-node-03
 >DNS domain: torproject.org
 >
 > The following warning/error was logged by the smartd daemon:
 >
 > Device: /dev/sda [SAT], ATA error count increased from 0 to 1
 >
 > Device info:
 > TOSHIBA MG06ACA10TEY, S/N:..., WWN:, FW:0103, 10.0 TB
 >
 > For details see host's SYSLOG.
 >
 > You can also use the smartctl utility for further investigation.
 > Another message will be sent in 24 hours if the problem persists.
 >
 > Another such email triggered an hour later as well.
 >
 > The RAID array the disk is on triggered a rebuild as well, somehow. The
 follow
 > messages showed up in dmesg:
 >
 > [Jan28 20:44] md: resync of RAID array md2
 > [Jan28 22:20] ata2.00: exception Emask 0x50 SAct 0x4000 SErr 0x480900
 action 0x6
 > frozen
 > [  +0.004419] ata2.00: irq_stat 0x0800, interface fatal error
 > [  +0.001489] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
 > [  +0.000781] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000785] ata2.00: cmd 61/00:70:80:52:f6/05:00:ec:00:00/40 tag 14
 ncq dma
 > 655360 out
 >res 40/00:70:80:52:f6/00:00:ec:00:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001600] ata2.00: status: { DRDY }
 > [  +0.000801] ata2: hard resetting link
 > [  +0.310126] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 > [  +0.088155] ata2.00: configured for UDMA/133
 > [  +0.31] ata2: EH complete
 > [Jan28 23:27] ata1.00: exception Emask 0x50 SAct 0x1c00 SErr 0x280900
 action 0x6
 > frozen
 > [  +0.004338] ata1.00: irq_stat 0x0800, interface fatal error
 > [  +0.001815] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
 > [  +0.000772] ata1.00: failed command: READ FPDMA QUEUED
 > [  +0.000738] ata1.00: cmd 60/00:50:00:3b:b1/05:00:47:01:00/40 tag 10
 ncq dma
 > 655360 in
 >res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001512] ata1.00: status: { DRDY }
 > [  +0.000793] ata1.00: failed command: READ FPDMA QUEUED
 > [  +0.000727] ata1.00: cmd 60/00:58:00:40:b1/05:00:47:01:00/40 tag 11
 ncq dma
 > 655360 in
 >res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001534] ata1.00: status: { DRDY }
 > [  +0.000769] ata1.00: failed command: READ FPDMA QUEUED
 > [  +0.000720] ata1.00: cmd 60/00:60:00:45:b1/01:00:47:01:00/40 tag 12
 ncq dma
 > 131072 in
 >res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001453] ata1.00: status: { DRDY }
 > [  +0.000778] ata1: hard resetting link
 > [  +0.556198] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 > [  +0.001780] ata1.00: configured for UDMA/133
 > [  +0.37] ata1: EH complete
 > [Jan28 23:32] perf: interrupt took too long (2518 > 2500), lowering
 > kernel.perf_event_max_sample_rate to 79250
 > [Jan29 00:14] ata2.00: exception Emask 0x50 SAct 0x1c00 SErr
 0x480900 action
 > 0x6 frozen
 > [  +0.004173] ata2.00: irq_stat 0x0800, interface fatal error
 > [  +0.001996] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
 > [  +0.000737] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000729] ata2.00: cmd 61/00:d0:00:62:0e/05:00:86:01:00/40 tag 26
 ncq dma
 > 655360 out
 >res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001486] ata2.00: status: { DRDY }
 > [  +0.000854] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000718] ata2.00: cmd 61/00:d8:00:67:0e/05:00:86:01:00/40 tag 27
 ncq dma
 > 655360 out
 >res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001478] ata2.00: status: { DRDY }
 > [  +0.000884] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000736] ata2.00: cmd 61/00:e0:00:6c:0e/01:00:86:01:00/40 tag 28