[ https://issues.apache.org/jira/browse/KUDU-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Wong resolved KUDU-2681. ------------------------------- Fix Version/s: 1.12.0 Resolution: Duplicate This seems like a duplicate of KUDU-2992 > Account for already-running tasks before sending new ones upon processing > tablet reports > ---------------------------------------------------------------------------------------- > > Key: KUDU-2681 > URL: https://issues.apache.org/jira/browse/KUDU-2681 > Project: Kudu > Issue Type: Bug > Components: master > Affects Versions: 1.7.1 > Reporter: Andrew Wong > Priority: Major > Labels: scalability, supportability > Fix For: 1.12.0 > > > I've seen a case where the master will reschedule the same delete tablet task > for a given tablet multiple times, e.g. because it received a new tablet > report that the tablet still exists on a given tserver. This results in > significant log-spam, and ends up sending excessive RPCs to the tablet > servers. Here are some master logs demonstrating this (note the repeated > attempt numbers): > > {{I0129 05:09:43.918886 22190 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 > (server:7050) (Replica with old config index 3048677 (current committed > config index is 3054594))}} > {{W0129 05:09:43.919509 22190 catalog_manager.cc:2892] TS > 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already > present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already > in progress: opening tablet}} > {{I0129 05:09:43.919517 22190 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for > TS=90369522338b4763ae25dd0161d6e548 with a delay of 8226 ms (attempt = 10)}} > {{I0129 05:09:43.960479 22190 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 > (server:7050) (Replica with old config index 3048677 (current committed > config index is 3054594))}} > {{W0129 05:09:43.961150 22190 catalog_manager.cc:2892] TS > 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already > present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already > in progress: opening tablet}} > {{I0129 05:09:43.961158 22190 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for > TS=90369522338b4763ae25dd0161d6e548 with a delay of 8235 ms (attempt = 10)}} > {{I0129 05:09:44.016152 22190 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 > (server:7050) (Replica with old config index 3048677 (current committed > config index is 3054594))}} > {{W0129 05:09:44.016383 22190 catalog_manager.cc:2892] TS > 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already > present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already > in progress: opening tablet}} > {{I0129 05:09:44.016391 22190 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for > TS=90369522338b4763ae25dd0161d6e548 with a delay of 8206 ms (attempt = 10)}} > {{I0129 05:09:44.226428 22190 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 > (server:7050) (Replica with old config index 3048677 (current committed > config index is 3054594))}} > {{W0129 05:09:44.226753 22190 catalog_manager.cc:2892] TS > 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already > present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already > in progress: opening tablet}} > {{I0129 05:09:44.226773 22190 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for > TS=90369522338b4763ae25dd0161d6e548 with a delay of 8207 ms (attempt = 10)}} > {{I0129 05:09:44.234709 22190 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 > (server:7050) (Replica with old config index 3048677 (current committed > config index is 3054594))}} > {{W0129 05:09:44.234923 22190 catalog_manager.cc:2892] TS > 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already > present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already > in progress: opening tablet}} > {{I0129 05:09:44.234936 22190 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for > TS=90369522338b4763ae25dd0161d6e548 with a delay of 8199 ms (attempt = 10)}} > {{I0129 05:09:44.322351 22190 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 > (server:7050) (Replica with old config index 3048677 (current committed > config index is 3054594))}} > {{W0129 05:09:44.322854 22190 catalog_manager.cc:2892] TS > 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet > 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already > present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already > in progress: opening tablet}} > {{I0129 05:09:44.322860 22190 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for > TS=90369522338b4763ae25dd0161d6e548 with a delay of 8207 ms (attempt = 10)}} > > Here is a similar log for alter schema requests: > > {{I0129 05:09:50.798943 22188 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60047 > ms (attempt = 27)}} > {{W0129 05:09:50.798954 22188 catalog_manager.cc:2719] Async tablet task > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: > failed to reset TS proxy: Could not find TS for UUID > 94b8851c35914acca3ebea8c0ba6e06e}} > {{I0129 05:09:50.889256 22191 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60044 > ms (attempt = 27)}} > {{W0129 05:09:50.889267 22191 catalog_manager.cc:2719] Async tablet task > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: > failed to reset TS proxy: Could not find TS for UUID > 94b8851c35914acca3ebea8c0ba6e06e}} > {{I0129 05:09:50.928457 22188 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60001 > ms (attempt = 27)}} > {{W0129 05:09:50.928469 22188 catalog_manager.cc:2719] Async tablet task > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: > failed to reset TS proxy: Could not find TS for UUID > 94b8851c35914acca3ebea8c0ba6e06e}} > {{I0129 05:09:51.024364 22188 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60036 > ms (attempt = 27)}} > {{W0129 05:09:51.024372 22188 catalog_manager.cc:2719] Async tablet task > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: > failed to reset TS proxy: Could not find TS for UUID > 94b8851c35914acca3ebea8c0ba6e06e}} > {{I0129 05:09:51.061228 22191 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60001 > ms (attempt = 27)}} > {{W0129 05:09:51.061239 22191 catalog_manager.cc:2719] Async tablet task > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: > failed to reset TS proxy: Could not find TS for UUID > 94b8851c35914acca3ebea8c0ba6e06e}} > {{I0129 05:09:51.080605 22191 catalog_manager.cc:2700] Scheduling retry of > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60038 > ms (attempt = 27)}} > {{W0129 05:09:51.080643 22191 catalog_manager.cc:2719] Async tablet task > 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> > [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: > failed to reset TS proxy: Could not find TS for UUID > 94b8851c35914acca3ebea8c0ba6e06e}} > > Kudu should take into account existing, retrying tasks running on the master > to make sure we don't stack up in this way. > I've seen this confirmed with Kudu 1.7.1, but looking through the code at > CatalogManager::ProcessTabletReport, this seems possible on master too. -- This message was sent by Atlassian Jira (v8.20.1#820001)