[jira] [Updated] (HDDS-14502) [Website v2] [Docs] [Administrator Guide] Replacing Ozone Manager Disks
[ https://issues.apache.org/jira/browse/HDDS-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-14502: -- Labels: pull-request-available (was: ) > [Website v2] [Docs] [Administrator Guide] Replacing Ozone Manager Disks > --- > > Key: HDDS-14502 > URL: https://issues.apache.org/jira/browse/HDDS-14502 > Project: Apache Ozone > Issue Type: Sub-task >Reporter: Wei-Chiu Chuang >Assignee: Gargi Jaiswal >Priority: Major > Labels: pull-request-available > > [https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/ozone-manager] > > if the disk containing OM metadata directory (ozone.om.db.dirs) needs to be > replaced for whatever reason, the OM metadata directory will need to be > reconstructed by running ozone om–bootstrap. (assuming OM HA is configured) > > — > Gemini Cli suggests the following content, which looks quite reasonable to me: > > — > Title: Replacing Ozone Manager Disks > Audience: Cluster Administrators > Prerequisites: Familiarity with Ozone cluster administration and Linux > system administration. > — > 1. Overview > Start with a brief introduction explaining the purpose of the document. > * Purpose: This guide provides the steps required to safely replace a > failed disk on an Ozone Manager (OM) node. > * Impact of OM Disk Failure: The OM disk is critical as it stores the > RocksDB database containing the entire object store namespace (volumes, > buckets, > keys) and block locations. A failure of this disk can lead to metadata > loss if not handled correctly. > * Crucial Distinction: HA vs. Non-HA: The recovery procedure depends > entirely on whether your OM is a single, standalone instance or part of a > High-Availability (HA) Ratis-based quorum. The HA procedure is > significantly safer and results in no cluster downtime. Running a standalone > OM is not > recommended for production environments. > — > 2. Pre-flight Checks > Before starting, the administrator should: > 1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to > confirm which disk has failed and its mount point. > 2. Identify OM Directories: Check your ozone-site.xml to confirm which > Ozone directories are on the failed disk. The most important one is: > * ozone.om.db.dirs: The primary OM metadata database. > * Also check ozone.om.ratis.storage.dir if you have configured it to > be on a separate disk. > 3. Prepare the Replacement Disk: Have a new, healthy disk physically > installed, formatted, and mounted on the system at the same path as the > failed disk. > Ensure it has the correct ownership and permissions for the user that > runs the OM process. > — > 3. Procedure for a Standalone (Non-HA) Ozone Manager > This is a high-risk, manual disaster recovery process that will require > cluster downtime. > 1. STOP THE ENTIRE CLUSTER: Shut down all clients, DataNodes, SCM, and the > Ozone Manager to prevent any further state changes. > 2. Attempt Data Recovery: If the failed disk is still partially readable, > make a best-effort attempt to copy the contents of the ozone.om.db.dirs > directory to a safe, temporary location. > 3. If Recovery Fails, Restore from Backup: If the OM database files are > unrecoverable, you must restore from your most recent backup. This document > does > not cover the backup process itself, but it is the only path to > recovery in this scenario. > 4. Replace and Configure Disk: Physically replace the hardware and ensure > the new, empty disk is mounted at the correct path defined in > ozone.om.db.dirs. > 5. Restore Metadata: Copy the recovered data (from Step 2) or the restored > backup data (from Step 3) to the ozone.om.db.dirs path on the new disk. > 6. Restart and Verify: > * Start the SCM and Ozone Manager services. > * Once the OM is running, start the DataNodes. > * Run ozone sh volume list and other basic commands to verify that the > namespace is intact and the cluster is operational. > — > 4. Procedure for an HA (Ratis-based) Ozone Manager > This procedure is much safer, leverages the built-in redundancy of the OM > HA cluster, and does not require full cluster downtime. > 1. STOP THE FAILED OM INSTANCE: On the node with the failed disk, stop > only the Ozone Manager process. The other two OMs will continue operating, > and one > of them will remain the leader, serving client requests. > 2. Replace and Configure Disk: Physically replace the hardware. Mount the > new, empty disk at the path defined in ozone.om.db.dirs and ensure it has the > correct ownership and permissions. > 3. RE-INITIALI
[jira] [Updated] (HDDS-14502) [Website v2] [Docs] [Administrator Guide] Replacing Ozone Manager Disks
[ https://issues.apache.org/jira/browse/HDDS-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDDS-14502: --- Description: [https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/ozone-manager] if the disk containing OM metadata directory (ozone.om.db.dirs) needs to be replaced for whatever reason, the OM metadata directory will need to be reconstructed by running ozone om–bootstrap. (assuming OM HA is configured) — Gemini Cli suggests the following content, which looks quite reasonable to me: — Title: Replacing Ozone Manager Disks Audience: Cluster Administrators Prerequisites: Familiarity with Ozone cluster administration and Linux system administration. — 1. Overview Start with a brief introduction explaining the purpose of the document. * Purpose: This guide provides the steps required to safely replace a failed disk on an Ozone Manager (OM) node. * Impact of OM Disk Failure: The OM disk is critical as it stores the RocksDB database containing the entire object store namespace (volumes, buckets, keys) and block locations. A failure of this disk can lead to metadata loss if not handled correctly. * Crucial Distinction: HA vs. Non-HA: The recovery procedure depends entirely on whether your OM is a single, standalone instance or part of a High-Availability (HA) Ratis-based quorum. The HA procedure is significantly safer and results in no cluster downtime. Running a standalone OM is not recommended for production environments. — 2. Pre-flight Checks Before starting, the administrator should: 1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to confirm which disk has failed and its mount point. 2. Identify OM Directories: Check your ozone-site.xml to confirm which Ozone directories are on the failed disk. The most important one is: * ozone.om.db.dirs: The primary OM metadata database. * Also check ozone.om.ratis.storage.dir if you have configured it to be on a separate disk. 3. Prepare the Replacement Disk: Have a new, healthy disk physically installed, formatted, and mounted on the system at the same path as the failed disk. Ensure it has the correct ownership and permissions for the user that runs the OM process. — 3. Procedure for a Standalone (Non-HA) Ozone Manager This is a high-risk, manual disaster recovery process that will require cluster downtime. 1. STOP THE ENTIRE CLUSTER: Shut down all clients, DataNodes, SCM, and the Ozone Manager to prevent any further state changes. 2. Attempt Data Recovery: If the failed disk is still partially readable, make a best-effort attempt to copy the contents of the ozone.om.db.dirs directory to a safe, temporary location. 3. If Recovery Fails, Restore from Backup: If the OM database files are unrecoverable, you must restore from your most recent backup. This document does not cover the backup process itself, but it is the only path to recovery in this scenario. 4. Replace and Configure Disk: Physically replace the hardware and ensure the new, empty disk is mounted at the correct path defined in ozone.om.db.dirs. 5. Restore Metadata: Copy the recovered data (from Step 2) or the restored backup data (from Step 3) to the ozone.om.db.dirs path on the new disk. 6. Restart and Verify: * Start the SCM and Ozone Manager services. * Once the OM is running, start the DataNodes. * Run ozone sh volume list and other basic commands to verify that the namespace is intact and the cluster is operational. — 4. Procedure for an HA (Ratis-based) Ozone Manager This procedure is much safer, leverages the built-in redundancy of the OM HA cluster, and does not require full cluster downtime. 1. STOP THE FAILED OM INSTANCE: On the node with the failed disk, stop only the Ozone Manager process. The other two OMs will continue operating, and one of them will remain the leader, serving client requests. 2. Replace and Configure Disk: Physically replace the hardware. Mount the new, empty disk at the path defined in ozone.om.db.dirs and ensure it has the correct ownership and permissions. 3. RE-INITIALIZE THE OM: This is the key step. Since the local database is gone, the OM needs to be "reborn" by getting a complete copy of the latest state from the current OM leader. * Simply starting the OM process on the repaired node with an empty DB directory will trigger this process automatically. The OM process is designed to detect that it belongs to an existing Ratis ring but has no local state. 4. START THE OM AND MONITOR: * Start the Ozone Manager service on the repaired node. * Tail the OM's log file (.log and .out). You should see messages indicating that it is connecting to the OM HA ring and
