Yes. I think a shared gdoc is prefered, and you can open up a JIRA ticket to track it.
Luca Toscano <toscano.l...@gmail.com> 於 2020年7月20日 週一 21:10 寫道: > Hi Evans! > > What is the best medium to use for the documentation/comments ? A > shared gdoc or something similar? > > Luca > > On Thu, Jul 16, 2020 at 5:11 PM Evans Ye <evan...@apache.org> wrote: > > > > One thing I think would be great to have is a doc version of the steps > for upgrade and rollback. The benefits: > > 1. Anything unexpected happened during automation, you do have folks can > quickly understand what's going on and get into the investigation. > > 2. Share the doc with us to help the others OSS users for doing the > migration. For the env specific things I think that's fine. We can left > comment on it. At least all the other users can get a high level view of a > proven solution. And then they can go and find out the rest of the pieces > by themselves. > > > > For automations, I suggest to split up the automation into several > stages, and apply some validation steps(manually is ok) before kicking of > the next stage. > > > > Best, > > Evans > > > > > > > > > > Luca Toscano <toscano.l...@gmail.com> 於 2020年7月15日 週三 下午9:07寫道: > >> > >> Hi everybody, > >> > >> I didn't get the time to work on this until recently, but I finally > >> managed to have a reliable procedure to upgrade from CDH to Bigtop 1.4 > >> and rollback if needed. The assumptions are: > >> > >> 1) It is ok to have (limited) cluster downtime. > >> 2) Rolling upgrade is not needed. > >> 3) QJM is used. > >> > >> The procedure is listed in these two scripts: > >> > >> > https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hadoop/stop-cluster.py > >> > https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hadoop/change-distro-from-cdh.py > >> > >> The code is highly dependent on my working environment, but it should > >> be clear to follow when writing a tutorial about how to migrate from > >> CDH to Bigtop. All the suggestions given by this mailing list were > >> really useful to reach a solution! > >> > >> My next steps will be: > >> > >> 1) Keep testing Bigtop 1.4 (finalize HDFS upgrade, run more hadoop > >> jobs, test Hive 2, etc..). > >> 2) Upgrade the production Hadoop cluster to Bigtop 1.4 on Debian 9 > >> (HDFS 2.6.0-cdh -> 2.8.5). > >> 3) Upgrade to Bigtop 1.5 on Debian 9 (HDFS 2.8.5 -> 2.10). > >> 4) Upgrade to Debian 10. > >> > >> With automation it shouldn't be very difficult, I'll report progress > once made. > >> > >> Thanks a lot! > >> > >> Luca > >> > >> On Mon, Apr 13, 2020 at 9:25 AM Luca Toscano <toscano.l...@gmail.com> > wrote: > >> > > >> > Hi Evans, > >> > > >> > thanks a lot for the feedback, it was exactly what I needed. The > >> > simpler the better is definitely a good advice in this use case, I'll > >> > try this week another rollout/rollback and report back :) > >> > > >> > Luca > >> > > >> > On Thu, Apr 9, 2020 at 8:09 PM Evans Ye <evan...@apache.org> wrote: > >> > > > >> > > Hi Luca, > >> > > > >> > > Thanks for reporting back and let us know how it goes. > >> > > I don't have the exactly HDFS with QJM HA upgrade experience. The > experience I had was 0.20 non-HA upgrade to 2.0 non-HA and then enable QJM > HA, which was back in 2014. > >> > > > >> > > Regarding to rollback, I think you're right: > >> > > > >> > > it is possible to rollback to HDFS’ state before the upgrade in > case of unexpected problems. > >> > > > >> > > My previous experience is the same that the rollback is merely a > snapshot before the upgrade. If you've gone far, then rollback cost more > data lost... Our runbook is if our sanity check failed during upgrade > downtime, we perform the rollback immediately. > >> > > > >> > > Regarding to that FSImage hole issue, I've experienced it as well. > >> > > I managed to fix it by manually edit the FSImage with offline image > viewer[1] and delete that missing editLog in FSImage. That actually brought > my cluster back with a little number of missing blocks. > >> > > > >> > > Our experience says that the more the steps, the more the chance > you failed the upgrade. We did good on dozen times of testing, DEV cluster, > STAGING cluster, but still got missing blocks when upgrading Production... > >> > > > >> > > The suggestion is to get your production in good shape first(the > less decommissioned, offline DNs, disk failures, the better). > >> > > Also, maybe you can switch to non-HA mode and do the upgrade to > simplify the things? > >> > > > >> > > Not many helps but please let us know if any progress. > >> > > Last one, have you reached out to Hadoop community? the authors > should know the most :) > >> > > > >> > > - Evans > >> > > > >> > > [1] > https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html > >> > > > >> > > Luca Toscano <toscano.l...@gmail.com> 於 2020年4月8日 週三 21:03 寫道: > >> > >> > >> > >> Hi everybody, > >> > >> > >> > >> most of the bugs/issues/etc.. that I found while upgrading from > CDH 5 > >> > >> to BigTop 1.4 are fixed, I am now testing (as suggested also in > here) > >> > >> upgrade/rollback procedures for HDFS (all written in > >> > >> https://phabricator.wikimedia.org/T244499, will add documentation > >> > >> about this at the end I promise). > >> > >> > >> > >> I initially followed [1][2] in my Test cluster, choosing the > Rolling > >> > >> upgrade, but when I tried to rollback (after days since the initial > >> > >> upgrade) I ended up in an inconsistent state and I wasn't able to > >> > >> recover the previous HDFS state. I didn't save the exact error > >> > >> messages but the situation was more or less the following: > >> > >> > >> > >> FS-Image-rollback (created at the time of the upgrade) - up to > transaction X > >> > >> FS-Image-current - up to transaction Y, with Y = X + 10000 (number > >> > >> totally made up for the example) > >> > >> QJM cluster: first available transaction Z = X + 10000 + 1 > >> > >> > >> > >> When I tried to rolling rollback, the Namenode complained about a > hole > >> > >> in the transaction log, namely at X + 1, so it refused to start. I > >> > >> tried to force a regular rollback, but the Namenode refused again > >> > >> saying that there was no available FS Image to roll back to. I > checked > >> > >> in the Hadoop code and indeed the Namenode saves the fs image with > >> > >> different naming/path in case of a rolling upgrade or a regular > >> > >> upgrade. Both cases make sense, especially the first one since > there > >> > >> was indeed a hole between the last transaction of the > >> > >> FS-Image-rollback and the first available transaction to reply on > the > >> > >> QJM cluster. I chose the rolling upgrade initially since it was > >> > >> appealing: it promises to bring back the Namenodes to their > previous > >> > >> versions, but keeping the data modified between upgrade and > rollback. > >> > >> > >> > >> I then found [3], in which it is said that with QJM everything is > more > >> > >> complicated, and a regular rollback is the only option available. > What > >> > >> I think this mean is that due to the Edit log spread among multiple > >> > >> nodes, a rollback that keeps data between upgrade and rollback is > not > >> > >> available, so worst case scenario the data modified during that > >> > >> timeframe is lost. Not a big deal in my case, but I want to triple > >> > >> check with you if this is the correct interpretation or if there is > >> > >> another tutorial/guide/etc.. that I haven't read with a different > >> > >> procedure :) > >> > >> > >> > >> Is my interpretation correct? If not, is there anybody with > experience > >> > >> in HDFS upgrades that could shed some light on the subject? > >> > >> > >> > >> Thanks in advance! > >> > >> > >> > >> Luca > >> > >> > >> > >> > >> > >> > >> > >> [1] > https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Upgrade_and_Rollback > >> > >> [2] > https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html > >> > >> [3] > https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled >