[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307137#comment-16307137 ] Christopher Tubbs commented on ACCUMULO-4751: - Just a quick note in case the git commit history is confusing: * this patch was applied to the 1.8 branch as commit 0a7c465d9eace6b8c9dccdfd37a92a0e1bcfbd52 * a nearly identical commit was applied to the master branch as 3161b98f05615844caf6980c8b5922375e92bd32 (with slight differences due to unrelated changes with table IDs being strongly typed in that branch) * branch 1.8 was merged into the master branch as 984dbf54b23634a4a461a54de07ca016f9430af0 to ensure that 1.8 could be cleanly merged into master in the future (this merge commit did not change any code in the master branch, but it did ensure that 0a7c465d9eace6b8c9dccdfd37a92a0e1bcfbd52 is included in the master branch's history) > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > Labels: pull-request-available > Fix For: 1.8.2, 2.0.0 > > Attachments: repl_logs.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281030#comment-16281030 ] Adam J Shook commented on ACCUMULO-4751: [~elserj] Not too certain if we even need that block of code to update metadata for unused WALs in {{DataFileManager}}? The solutions I see now would be to 1. Use {{StatusUtil.openWithUnknownLength(System.currentTimeMillis())}} here to add a {{createdTime}} to the metadata 2. Add the handling logic in the {{StatusMaker}} to set the {{createdTime}} to the WAL to the HDFS timestamp (#2 above) 3. Delete this block entirely (it might be needed at other times? I don't know the pipeline at this level to know if we need to flag unused WALs) > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > Attachments: repl_logs.txt > > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280957#comment-16280957 ] Adam J Shook commented on ACCUMULO-4751: I can confirm that this is the case. Looking at the timestamps from the TServer and the Master, you can see the Master remove the replication entries after it has been complete, but then [1] gets called via {{Tablet#minorCompact}} minutes later and inserts an entry into the metadata table without a {{createdTime}}, which is the latest entry in the table. [1] https://github.com/apache/accumulo/blob/rel/1.8.1/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/DatafileManager.java#L422 > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > Attachments: repl_logs.txt > > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280853#comment-16280853 ] Adam J Shook commented on ACCUMULO-4751: I have attached some logs tracking a particular WAL file. You can see that it has a {{createdTime}} but at some point a deleting entry must be written (note the timestamp change but the {{createdTime}} is gone) and then other entries added. > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > Attachments: repl_logs.txt > > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277195#comment-16277195 ] Josh Elser commented on ACCUMULO-4751: -- bq. My initial finding is that some entries coming out of the StatusUtil don't have a createdTime. Whereas something like [1] writes a metadata entry to a WAL and the replication status is updated with a createdTime. I am guessing that, in this case, there is no ingest for a particular table during the lifetime of the WAL, so the work to replicate the WAL to a particular table (via the key extent) is never created. However, something else is writing some information that doesn't have a createdTime. Yeah, I do recall this distinction. I remember having a separation between when a WAL is "created" in memory (more like the TServer decides it's going to use that file), but there aren't any records in that file yet. I could also see replication missing some part of the lifecycle of the WAL itself as this is pretty obtuse (not explicitly documented). LMK if you need to knock heads together. > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277154#comment-16277154 ] Adam J Shook commented on ACCUMULO-4751: My initial finding is that some entries coming out of the {{StatusUtil}} don't have a {{createdTime}}. Whereas something like [1] writes a metadata entry to a WAL and the replication status is updated with a {{createdTime}}. I am guessing that, in this case, there is no ingest for a particular table during the lifetime of the WAL, so the work to replicate the WAL to a particular table (via the key extent) is never created. However, something else is writing some information that doesn't have a {{createdTime}}. I'll do some more digging to see if this can be nailed down. [1] https://github.com/apache/accumulo/blob/rel/1.8.1/server/tserver/src/main/java/org/apache/accumulo/tserver/log/TabletServerLogger.java#L390 > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277096#comment-16277096 ] Josh Elser commented on ACCUMULO-4751: -- I would say #2 at first glance, but I am more worried about how we missed the createdTime situation. Ideally, even in the face of TServer crashes, the TServer would set the correct "metadata" on each Status record. Do you have any hunch as to how this record exists without the createdTime attribute set? It would be nice to confirm that we don't have some other kind of bug lingering in which we're just not writing the record correctly. I wouldn't be surprised if we would actually need some kind of solution (like #2) to guard against some kind of unlikely situation (e.g. tserver failure) in addition another bug. In other words, the catch-all to prevent the system from "wedging" on these WALs would be appreciated. > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4751) Some WALs don't replicate due to lacking a createdTime entry
[ https://issues.apache.org/jira/browse/ACCUMULO-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277044#comment-16277044 ] Adam J Shook commented on ACCUMULO-4751: [~elserj] when you have the time, could you let me know your thoughts on 1 and 2 above? Happy to contribute a fix. > Some WALs don't replicate due to lacking a createdTime entry > > > Key: ACCUMULO-4751 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4751 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.7.3, 1.8.1 >Reporter: Adam J Shook >Assignee: Adam J Shook > > From what I can tell, the below error is thrown when no data for a particular > table is written to a WAL, but the file is closed. This would be because the > {{Status}} entry from the {{StatusUtil}} for {{fileClosed}} is pre-built and > therefore does not have a {{createdTime}}. This prevents a WAL from being > replicated until a {{createdTime}} entry is added manually. > From the Accumulo master: > {code} > Status record ([begin: 0 end: 0 infiniteEnd: true closed: true]) for > hdfs://namenode:9000/accumulo/wal/tserver.example.com+31732/f922df9c-3ffc-49ee-8d0c-261c7a05fea2 > in table 7l was written to metadata table which lacked createdTime > {code} > There are two solutions I have in mind: > 1. Update the {{StatusUtil}} such that every returned {{Status}} object sets > the {{createdTime}} to {{System.currentTimeMillis}} if not explicitly given. > 2. Update the Accumulo Master to set the {{createdTime}} to the WAL's > modification time in HDFS if the WAL is closed but there is no > {{createdTime}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)