[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup
[ https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke updated KUDU-2050: -- Labels: stability supportability (was: ) > Avoid peer eviction during block manager startup > > > Key: KUDU-2050 > URL: https://issues.apache.org/jira/browse/KUDU-2050 > Project: Kudu > Issue Type: Bug > Components: fs, tserver >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Priority: Critical > Labels: stability, supportability > > In larger deployments we've observed that opening the block manager can take > a really long time, like tens of minutes or sometimes even hours. This is > especially true as of 1.4 where the log block manager tries to optimize > on-disk data structures during startup. > The default time to Raft peer eviction is 5 minutes. If one node is restarted > and LBM startup takes over 5 minutes, or if all nodes are restarted and > there's over 5 minutes of LBM startup time variance across them, the "slow" > node could have all of its replicas evicted. Besides generating a lot of > unnecessary work in rereplication, this effectively "defeats" the LBM > optimizations in that it would have been equally slow (but more efficient) to > reformat the node instead. > So, let's reorder startup such that LBM startup counts towards replica > bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta > files can be accessed early to construct bootstrapping replicas, but to defer > opening of the block manager until after that time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup
[ https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke updated KUDU-2050: -- Target Version/s: (was: 1.8.0) > Avoid peer eviction during block manager startup > > > Key: KUDU-2050 > URL: https://issues.apache.org/jira/browse/KUDU-2050 > Project: Kudu > Issue Type: Bug > Components: fs, tserver >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Priority: Critical > Labels: stability, supportability > > In larger deployments we've observed that opening the block manager can take > a really long time, like tens of minutes or sometimes even hours. This is > especially true as of 1.4 where the log block manager tries to optimize > on-disk data structures during startup. > The default time to Raft peer eviction is 5 minutes. If one node is restarted > and LBM startup takes over 5 minutes, or if all nodes are restarted and > there's over 5 minutes of LBM startup time variance across them, the "slow" > node could have all of its replicas evicted. Besides generating a lot of > unnecessary work in rereplication, this effectively "defeats" the LBM > optimizations in that it would have been equally slow (but more efficient) to > reformat the node instead. > So, let's reorder startup such that LBM startup counts towards replica > bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta > files can be accessed early to construct bootstrapping replicas, but to defer > opening of the block manager until after that time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup
[ https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke updated KUDU-2050: -- Target Version/s: 1.8.0 > Avoid peer eviction during block manager startup > > > Key: KUDU-2050 > URL: https://issues.apache.org/jira/browse/KUDU-2050 > Project: Kudu > Issue Type: Bug > Components: fs, tserver >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Priority: Critical > > In larger deployments we've observed that opening the block manager can take > a really long time, like tens of minutes or sometimes even hours. This is > especially true as of 1.4 where the log block manager tries to optimize > on-disk data structures during startup. > The default time to Raft peer eviction is 5 minutes. If one node is restarted > and LBM startup takes over 5 minutes, or if all nodes are restarted and > there's over 5 minutes of LBM startup time variance across them, the "slow" > node could have all of its replicas evicted. Besides generating a lot of > unnecessary work in rereplication, this effectively "defeats" the LBM > optimizations in that it would have been equally slow (but more efficient) to > reformat the node instead. > So, let's reorder startup such that LBM startup counts towards replica > bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta > files can be accessed early to construct bootstrapping replicas, but to defer > opening of the block manager until after that time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup
[ https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke updated KUDU-2050: -- Target Version/s: 1.7.0 (was: 1.6.0) > Avoid peer eviction during block manager startup > > > Key: KUDU-2050 > URL: https://issues.apache.org/jira/browse/KUDU-2050 > Project: Kudu > Issue Type: Bug > Components: fs, tserver >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Priority: Critical > > In larger deployments we've observed that opening the block manager can take > a really long time, like tens of minutes or sometimes even hours. This is > especially true as of 1.4 where the log block manager tries to optimize > on-disk data structures during startup. > The default time to Raft peer eviction is 5 minutes. If one node is restarted > and LBM startup takes over 5 minutes, or if all nodes are restarted and > there's over 5 minutes of LBM startup time variance across them, the "slow" > node could have all of its replicas evicted. Besides generating a lot of > unnecessary work in rereplication, this effectively "defeats" the LBM > optimizations in that it would have been equally slow (but more efficient) to > reformat the node instead. > So, let's reorder startup such that LBM startup counts towards replica > bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta > files can be accessed early to construct bootstrapping replicas, but to defer > opening of the block manager until after that time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup
[ https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated KUDU-2050: - Target Version/s: 1.6.0 > Avoid peer eviction during block manager startup > > > Key: KUDU-2050 > URL: https://issues.apache.org/jira/browse/KUDU-2050 > Project: Kudu > Issue Type: Bug > Components: fs, tserver >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Priority: Critical > > In larger deployments we've observed that opening the block manager can take > a really long time, like tens of minutes or sometimes even hours. This is > especially true as of 1.4 where the log block manager tries to optimize > on-disk data structures during startup. > The default time to Raft peer eviction is 5 minutes. If one node is restarted > and LBM startup takes over 5 minutes, or if all nodes are restarted and > there's over 5 minutes of LBM startup time variance across them, the "slow" > node could have all of its replicas evicted. Besides generating a lot of > unnecessary work in rereplication, this effectively "defeats" the LBM > optimizations in that it would have been equally slow (but more efficient) to > reformat the node instead. > So, let's reorder startup such that LBM startup counts towards replica > bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta > files can be accessed early to construct bootstrapping replicas, but to defer > opening of the block manager until after that time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)