I tried building a timeline but the logs are just not there. We weren't sending the debug logs to Splunk due to the verbosity, but we may be tweaking the log4j settings a bit to make sure we get the log data stored in the event this happens again. This very well could be attributed to the recovery failure; hard to say. I'll be upgrading to 1.9.1 soon.
On Mon, May 14, 2018 at 8:53 AM, Michael Wall <[email protected]> wrote: > Can you pick some of the files that are missing and search through your > logs to put together a timeline? See if you can find that file for a > specific tablet. Then grab all the logs for when a file was created as > result of a compaction, and a when a file was included in compaction for > that table. Follow compactions for that tablet until you started getting > errors. Then see what logs you have for WAL replay during that time for > that tablet and the metadata and can try to correlate. > > It's a shame you don't have the GC logs. If you saw it was GC'd then > showed up in the metadata table again that would help explain what > happened. Like Christopher mentioned, this could be related to a recovery > failure. > > Mike > > On Sat, May 12, 2018 at 5:26 PM Adam J. Shook <[email protected]> > wrote: > >> WALs are turned on. Durability is set to flush for all tables except for >> root and metadata which are sync. The current rfile names on HDFS and >> in the metadata table are greater than the files that are missing. >> Searched through all of our current and historical logs in Splunk (which >> are only INFO level or higher). Issues from the logs: >> >> * Problem reports saying the files are not found >> * IllegalStateException saying the rfile is closed when it tried to load >> the Bloom filter (likely the flappy DataNode) >> * IOException when reading the file saying Stream is closed (likely the >> flappy DataNode) >> >> Nothing in the GC logs -- all the above errors are in the tablet server >> logs. The logs may have rolled over, though, and our debug logs don't make >> it into Splunk. >> >> --Adam >> >> On Fri, May 11, 2018 at 6:16 PM, Christopher <[email protected]> wrote: >> >>> Oh, it occurs to me that this may be related to the WAL bugs that Keith >>> fixed for 1.9.1... which could affect the metadata table recovery after a >>> failure. >>> >>> On Fri, May 11, 2018 at 6:11 PM Michael Wall <[email protected]> wrote: >>> >>>> Adam, >>>> >>>> Do you have GC logs? Can you see if those missing RFiles were removed >>>> by the GC process? That could indicate you somehow got old metadata info >>>> replayed. Also, the rfiles increment so compare the current rfile names in >>>> the srv.dir directory vs what is in the metadata table. Are the existing >>>> files after files in the metadata. Finally, pick a few of the missing >>>> files and grep all your master and tserver logs to see if you can learn >>>> anything. This sounds ungood. >>>> >>>> Mike >>>> >>>> On Fri, May 11, 2018 at 6:06 PM Christopher <[email protected]> >>>> wrote: >>>> >>>>> This is strange. I've only ever seen this when HDFS has reported >>>>> problems, such as missing blocks, or another obvious failure. What is your >>>>> durability settings (were WALs turned on)? >>>>> >>>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello all, >>>>>> >>>>>> On one of our clusters, there are a good number of missing RFiles >>>>>> from HDFS, however HDFS is not/has not reported any missing blocks. We >>>>>> were experiencing issues with HDFS; some flapping DataNode processes that >>>>>> needed more heap. >>>>>> >>>>>> I don't anticipate I can do much besides create a bunch of empty >>>>>> RFiles (open to suggestions). My question is, Is it possible that >>>>>> Accumulo >>>>>> could have written the metadata for these RFiles but failed to write it >>>>>> to >>>>>> HDFS? In which case it would have been re-tried later and the data was >>>>>> persisted to a different RFile? Or is it an 'RFile is in Accumulo >>>>>> metadata >>>>>> if and only if it is in HDFS' situation? >>>>>> >>>>>> Accumulo 1.8.1 on HDFS 2.6.0. >>>>>> >>>>>> Thank you, >>>>>> --Adam >>>>>> >>>>> >>
