[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429019#comment-16429019 ] Andrew Wong commented on KUDU-2359: --- Clarifying this a bit based on some offline discussion with Adar, it's less that the file vanished; rather, our current code couldn't read anything from the instance file, and thus returned a "file not found" error. Looking at the file system with strace, we found that the EIO was triggered in getdents(). A snippet of the strace here: {quote}{{ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, \{B38400 opost isig icanon echo ...}) = 0}} {{ioctl(1, TIOCGWINSZ, \{ws_row=88, ws_col=357, ws_xpixel=0, ws_ypixel=0}) = 0}} {{stat("/data/6", \{st_mode=S_IFDIR|S_ISVTX|0777, st_size=2048, ...}) = 0}} {{open("/data/6", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3}} {{fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)}} {{getdents(3, 0x8ecc90, 32768) = -1 EIO (Input/output error)}} {{open("/usr/share/locale/locale.alias", O_RDONLY) = 4}} {{fstat(4, \{st_mode=S_IFREG|0644, st_size=2512, ...}) = 0}}{quote} We discussed a few options, considering potentially having more stringent checking around a mount point for failures (snooping around the file system for more info on failures), but settled on the point that, at least for start up, treating missing instance files as failed instance files would have the desired behavior. The case for update_dirs is trickier, for the reasons mentioned above. One implementation we considered was to perhaps treat _all_ instances that returned errors upon loading as missing when running `kudu fs update_dirs`. As long as we don't do anything silly like prematurely overwrite files before knowing that the entire operation has completed, we _should_ be able to get away with this, since presumably the update will eventually fail at some point throughout the run of the tool. What we lose out on is, rather than short-circuiting if we see a disk failure, the update tool will attempt to do stuff (read, rewrite on other drives, etc.) because we're not sure whether we're "failed" or "missing" or whatever. We could have some heuristics like, "If we notice a failed instance, definitely do not try to update, but if we see a missing disk, try to update and if we can't because the disk has actually failed, revert everything" to make the semantics better, but for now I'll see how well this works. > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Assignee: Andrew Wong >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428892#comment-16428892 ] Andrew Wong commented on KUDU-2359: --- Based on this, it probably makes sense to go about treating missing directories as "failed" directories (i.e. it should be marked "failed" in memory and all tablets configured to use it should be failed an re-replicated automatically). What does this mean for the `kudu fs update_dirs` tool, which mends missing directories? Its use would fall more on the side of fixing provisioning errors, rather than disk errors, and so it will be useful to keep around. That said, it'll take some thought on how to accommodate both missing directories as a "failed" state and missing directories as an expected state when running the tool. > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428879#comment-16428879 ] Andrew Wong commented on KUDU-2359: --- I spent some time looking at a test cluster that had a few bad disks with the following behavior in their logs. On one of the servers, which had failed in Kudu 1.5 (pre-disk-failure handling), for some time following the failures, the server would attempt to start up and fail immediately with: {{Fatal I/O error, context: /data/6/kudu/instance}} After a few months of this (the server remaining down), the error changed: {{Check failed: _s.ok() Bad status: Already present: Could not create new FS layout: FSManager root is not empty: /data/1/kudu}} This message indicates that Kudu couldn't find an instance file for a data directory, and upon examining the FS a bit more, noticed this that /data/6/instance was indeed missing, but seemingly not because the disk was removed and replaced. Rather, it seemed that the instance file, after some time on the failed disk, vanished, and this is something that we need to consider. {{cat: /data/6/kudu/instance: No such file or directory}} {{ls: cannot access /data/6/kudu: No such file or directory}} {{ls: reading directory /data/6: Input/output error}} > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418154#comment-16418154 ] Todd Lipcon commented on KUDU-2359: --- I think the point is that, often times, after a server crash, things are configured to automatically reboot, and upon a reboot the Kudu daemon will automatically restart. So, there is no operator involvement to restart a crashed service. Or, a non-Kudu-expert operator knows enough to see that a tserver has crashed and restart the service, but isn't familiar enough to start modifying flags, etc. Additionally, maintaining a separate set of flags on different daemons in a cluster gets complex. bq. It also begs the question, would operators even care about those failed tablets? If our re-replication story is robust enough to handle everything on its own, it could be seen as a pointless configuration. I suppose exposing it as a flag initially would give us that sort of info. right, I think in the common case, you want the server to come back, and then it'll notice the failed 25% of tablets, and re-replicate them elsewhere. Currently as it is, it's likely the server will be down for a day or two before the operator figures out the right way to run the 'update-dirs' tool, etc, and by that time when they get the server back up, everything has been re-replicated elsewhere already. > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414249#comment-16414249 ] Andrew Wong commented on KUDU-2359: --- This should be doable by extending the architecture in place for the `kudu fs update_dirs` tool. The caveat here, and with the update tool, is that any tablets that are/were on the missing data directory are/should be started up in a failed state so they can be evicted and re-replicated elsewhere. For the update tool, we have operators confront this tradeoff by requiring them to specify the `–force` flag. Ideally a similar flag could be used here, so at least the mean time to recovery is gated by the time it takes to update a flag, rather than the time it takes to run `kudu fs update_dirs`. It also begs the question, would operators even care about those failed tablets? If our re-replication story is robust enough to handle everything on its own, it could be seen as a pointless configuration. I suppose exposing it as a flag initially would give us that sort of info. > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405844#comment-16405844 ] Todd Lipcon commented on KUDU-2359: --- cc [~anjuwong] for thoughts > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)