> On 22.04.2010 16:36, JR Richardson wrote: >> I had a terrible time migrating containers from one hardware not to >> another. I ran into the ssh bug with the public key not being >> accepted error: >> >> error: RSA_public_decrypt failed: error:0407006A:lib(4):func(112):reason(106) >> >> It was so random, migrating 50 containers, some would work and some >> migrations would prompt for the ssh password, truley random. I had to >> migrate some containers several times before the ssh session would >> establish and complete. >> >> The strange thing is while migration between HN 1 & 2 was very >> problematic, migration between HN 4 & 5 worked as expected. All 4 >> nodes were built on the same day, same hardware and all have matching >> package versions and configuration. >> > > That points to hardware problems, either in your HNs or your network. I > once had to debug randomly failing SSH sessions, and it turned out to be > a faulty ethernet switch which corrupted data, but auto-corrected the > checksums of the ethernet packets, and this caused SSH to abort in > various stages. > > Can you use ssh reliably between HN 1+2? If yes, try running scp of a > large file in both directions 100 times or so. The best idea is to copy > the same file back and forth, so errors will accumulate. Compute the > checksum of the file after each copy. I had SSH corrupt data in a tunnel > and the SSH connection stayed stable although the tunneled data was > clearly corrupt. Admittedly this was due to a faulty ethernet switch, > but still I had expected SSH to abort the connection or at least ensure > the integrity of my data, especially because the corruption was hardware > failure and not a determined attacker. > > And yes, sometimes one machine in a batch of identical machines corrupts > data, or one ethernet port corrupts data and all others are OK. > > Regards, > Carl-Daniel
While I was researching the ssh bug, 100 posts or so, I came across one that mentioned a hardware error on a sun workstation. So I started testing and did isolate the issue to HN1. There are no hardware errors reported on the Ethernet interfaces of the server or the connected switch port. During the next maintenance window I will migrate all the production containers off and bring the server in the lab and hammer on it. Thanks for reporting your experience, I'm leaning toward a hardware error as well. JR -- JR Richardson Engineering for the Masses _______________________________________________ Users mailing list [email protected] https://openvz.org/mailman/listinfo/users
