Hi Trey, Kudu currently requires removing all the Kudu data folders on a machine when one disk fails. This is because Kudu effectively does striping over all the data disks. Assuming you're not running with replication=1, your data should already be re-replicated on your other nodes.
Hope this helps, J-D On Wed, Nov 2, 2016 at 1:47 PM, Cahill, Trey <[email protected]> wrote: > Hi All, > > > > While running Kudu 1.0.0 with 9 tablet servers and a single master in a > CDH 5.4.10 cluster, a drive failed for one of the tablet servers. The > drive has since been replaced, but the tablet server will not restart. > > Below is the error from kudu-tserver.FATAL: > > “Log file created at: 2016/11/02 19:27:17 > > Running on machine: i-d6d75566.intra.omneo.com > > Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg > > F1102 19:27:17.451611 21593 tablet_server_main.cc:55] Check failed: > _s.ok() Bad status: Already present: Could not create new FS layout: > FSManager root is not empty: /data/0/kudu/tserver” > > > > The WARN and ERROR logs contain the same message. > > > > The INFO log has the following output: > > “Log file created at: 2016/11/02 19:27:17 > > Running on machine: i-d6d75566.intra.omneo.com > > Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg > > I1102 19:27:17.448385 21593 mem_tracker.cc:140] MemTracker: hard memory > limit is 4.000000 GB > > I1102 19:27:17.448578 21593 mem_tracker.cc:142] MemTracker: soft memory > limit is 2.400000 GB > > I1102 19:27:17.449854 21593 tablet_server_main.cc:54] Initializing tablet > server... > > I1102 19:27:17.450325 21593 hybrid_clock.cc:177] HybridClock initialized. > Resolution in nanos?: 1 Wait times tolerance adjustment: 1.0005 Current > error: 143827 > > I1102 19:27:17.451561 21593 server_base.cc:168] Could not load existing FS > layout: Not found: /data/0/kudu/tserver-wal/instance: No such file or > directory (error 2) > > I1102 19:27:17.451573 21593 server_base.cc:169] Creating new FS layout > > F1102 19:27:17.451611 21593 tablet_server_main.cc:55] Check failed: > _s.ok() Bad status: Already present: Could not create new FS layout: > FSManager root is not empty: /data/0/kudu/tserver” > > > > > > Fs_wal_dir is set to “/data/0/kudu/tserver” and fs_data_dirs is set to > ““/data/0/kudu/tserver, /data/1/kudu/tserver, 2/data/2/kudu/tserver, > /data/3/kudu/tserver” for every tablet server. > > > > I searched, but could not seem to find a way to recover/start the tablet > server. > > > > Any thoughts? > > > > Let me know if you need more information or such. > > > Thanks, > > > Trey >
