Thanks for all of your help. We have a peer cluster that we'll be using to
do some data reconciliation.
On Wed, May 16, 2018 at 11:29 AM, Michael Wall wrote:
> Since the rfiles on disk are "later" then the ones references, I tend to
> think old metadata got rewritten. Since
Since the rfiles on disk are "later" then the ones references, I tend to
think old metadata got rewritten. Since you can't get a timeline to better
understand what happened, the only think I can think of is reingest all
data since a known good point. And then do thing to make the future better
I tried building a timeline but the logs are just not there. We weren't
sending the debug logs to Splunk due to the verbosity, but we may be
tweaking the log4j settings a bit to make sure we get the log data stored
in the event this happens again. This very well could be attributed to the
Can you pick some of the files that are missing and search through your
logs to put together a timeline? See if you can find that file for a
specific tablet. Then grab all the logs for when a file was created as
result of a compaction, and a when a file was included in compaction for
that table.
WALs are turned on. Durability is set to flush for all tables except for
root and metadata which are sync. The current rfile names on HDFS and in
the metadata table are greater than the files that are missing. Searched
through all of our current and historical logs in Splunk (which are only
Oh, it occurs to me that this may be related to the WAL bugs that Keith
fixed for 1.9.1... which could affect the metadata table recovery after a
failure.
On Fri, May 11, 2018 at 6:11 PM Michael Wall wrote:
> Adam,
>
> Do you have GC logs? Can you see if those missing RFiles
Adam,
Do you have GC logs? Can you see if those missing RFiles were removed by
the GC process? That could indicate you somehow got old metadata info
replayed. Also, the rfiles increment so compare the current rfile names in
the srv.dir directory vs what is in the metadata table. Are the
This is strange. I've only ever seen this when HDFS has reported problems,
such as missing blocks, or another obvious failure. What is your durability
settings (were WALs turned on)?
On Fri, May 11, 2018 at 12:45 PM Adam J. Shook wrote:
> Hello all,
>
> On one of our
Hello all,
On one of our clusters, there are a good number of missing RFiles from
HDFS, however HDFS is not/has not reported any missing blocks. We were
experiencing issues with HDFS; some flapping DataNode processes that needed
more heap.
I don't anticipate I can do much besides create a bunch