subject:"Re\: Question on missing RFiles"

Re: Question on missing RFiles

2018-05-16 Thread Adam J. Shook

Thanks for all of your help.  We have a peer cluster that we'll be using to
do some data reconciliation.

On Wed, May 16, 2018 at 11:29 AM, Michael Wall  wrote:

> Since the rfiles on disk are "later" then the ones references, I tend to
> think old metadata got rewritten.  Since you can't get a timeline to better
> understand what happened, the only think I can think of is reingest all
> data since a known good point.  And then do thing to make the future better
> like tweak what logs you have save and upgrade to 1.9.1.  Sorry, I wish I
> had better answers for you.
>
>
> On Wed, May 16, 2018 at 11:25 AM Adam J. Shook 
> wrote:
>
>> I tried building a timeline but the logs are just not there.  We weren't
>> sending the debug logs to Splunk due to the verbosity, but we may be
>> tweaking the log4j settings a bit to make sure we get the log data stored
>> in the event this happens again.  This very well could be attributed to the
>> recovery failure; hard to say.  I'll be upgrading to 1.9.1 soon.
>>
>> On Mon, May 14, 2018 at 8:53 AM, Michael Wall  wrote:
>>
>>> Can you pick some of the files that are missing and search through your
>>> logs to put together a timeline?  See if you can find that file for a
>>> specific tablet.  Then grab all the logs for when a file was created as
>>> result of a compaction, and a when a file was included in compaction for
>>> that table.  Follow compactions for that tablet until you started getting
>>> errors.  Then see what logs you have for WAL replay during that time for
>>> that tablet and the metadata and can try to correlate.
>>>
>>> It's a shame you don't have the GC logs.  If you saw it was GC'd then
>>> showed up in the metadata table again that would help explain what
>>> happened.  Like Christopher mentioned, this could be related to a recovery
>>> failure.
>>>
>>> Mike
>>>
>>> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook 
>>> wrote:
>>>
 WALs are turned on.  Durability is set to flush for all tables except
 for root and metadata which are sync.  The current rfile names on HDFS
 and in the metadata table are greater than the files that are missing.
  Searched through all of our current and historical logs in Splunk (which
 are only INFO level or higher).  Issues from the logs:

 * Problem reports saying the files are not found
 * IllegalStateException saying the rfile is closed when it tried to
 load the Bloom filter (likely the flappy DataNode)
 * IOException when reading the file saying Stream is closed (likely the
 flappy DataNode)

 Nothing in the GC logs -- all the above errors are in the tablet server
 logs.  The logs may have rolled over, though, and our debug logs don't make
 it into Splunk.

 --Adam

 On Fri, May 11, 2018 at 6:16 PM, Christopher 
 wrote:

> Oh, it occurs to me that this may be related to the WAL bugs that
> Keith fixed for 1.9.1... which could affect the metadata table recovery
> after a failure.
>
> On Fri, May 11, 2018 at 6:11 PM Michael Wall  wrote:
>
>> Adam,
>>
>> Do you have GC logs?  Can you see if those missing RFiles were
>> removed by the GC process?  That could indicate you somehow got old
>> metadata info replayed.  Also, the rfiles increment so compare the 
>> current
>> rfile names in the srv.dir directory vs what is in the metadata table.  
>> Are
>> the existing files after files in the metadata.  Finally, pick a few of 
>> the
>> missing files and grep all your master and tserver logs to see if you can
>> learn anything.  This sounds ungood.
>>
>> Mike
>>
>> On Fri, May 11, 2018 at 6:06 PM Christopher 
>> wrote:
>>
>>> This is strange. I've only ever seen this when HDFS has reported
>>> problems, such as missing blocks, or another obvious failure. What is 
>>> your
>>> durability settings (were WALs turned on)?
>>>
>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
>>> wrote:
>>>
 Hello all,

 On one of our clusters, there are a good number of missing RFiles
 from HDFS, however HDFS is not/has not reported any missing blocks.  We
 were experiencing issues with HDFS; some flapping DataNode processes 
 that
 needed more heap.

 I don't anticipate I can do much besides create a bunch of empty
 RFiles (open to suggestions).  My question is, Is it possible that 
 Accumulo
 could have written the metadata for these RFiles but failed to write 
 it to
 HDFS?  In which case it would have been re-tried later and the data was
 persisted to a different RFile?  Or is it an 'RFile is in Accumulo 
 metadata
 if and only if it is in

Re: Question on missing RFiles

2018-05-16 Thread Michael Wall

Since the rfiles on disk are "later" then the ones references, I tend to
think old metadata got rewritten.  Since you can't get a timeline to better
understand what happened, the only think I can think of is reingest all
data since a known good point.  And then do thing to make the future better
like tweak what logs you have save and upgrade to 1.9.1.  Sorry, I wish I
had better answers for you.


On Wed, May 16, 2018 at 11:25 AM Adam J. Shook  wrote:

> I tried building a timeline but the logs are just not there.  We weren't
> sending the debug logs to Splunk due to the verbosity, but we may be
> tweaking the log4j settings a bit to make sure we get the log data stored
> in the event this happens again.  This very well could be attributed to the
> recovery failure; hard to say.  I'll be upgrading to 1.9.1 soon.
>
> On Mon, May 14, 2018 at 8:53 AM, Michael Wall  wrote:
>
>> Can you pick some of the files that are missing and search through your
>> logs to put together a timeline?  See if you can find that file for a
>> specific tablet.  Then grab all the logs for when a file was created as
>> result of a compaction, and a when a file was included in compaction for
>> that table.  Follow compactions for that tablet until you started getting
>> errors.  Then see what logs you have for WAL replay during that time for
>> that tablet and the metadata and can try to correlate.
>>
>> It's a shame you don't have the GC logs.  If you saw it was GC'd then
>> showed up in the metadata table again that would help explain what
>> happened.  Like Christopher mentioned, this could be related to a recovery
>> failure.
>>
>> Mike
>>
>> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook 
>> wrote:
>>
>>> WALs are turned on.  Durability is set to flush for all tables except
>>> for root and metadata which are sync.  The current rfile names on HDFS
>>> and in the metadata table are greater than the files that are missing.
>>>  Searched through all of our current and historical logs in Splunk (which
>>> are only INFO level or higher).  Issues from the logs:
>>>
>>> * Problem reports saying the files are not found
>>> * IllegalStateException saying the rfile is closed when it tried to load
>>> the Bloom filter (likely the flappy DataNode)
>>> * IOException when reading the file saying Stream is closed (likely the
>>> flappy DataNode)
>>>
>>> Nothing in the GC logs -- all the above errors are in the tablet server
>>> logs.  The logs may have rolled over, though, and our debug logs don't make
>>> it into Splunk.
>>>
>>> --Adam
>>>
>>> On Fri, May 11, 2018 at 6:16 PM, Christopher 
>>> wrote:
>>>
 Oh, it occurs to me that this may be related to the WAL bugs that Keith
 fixed for 1.9.1... which could affect the metadata table recovery after a
 failure.

 On Fri, May 11, 2018 at 6:11 PM Michael Wall  wrote:

> Adam,
>
> Do you have GC logs?  Can you see if those missing RFiles were removed
> by the GC process?  That could indicate you somehow got old metadata info
> replayed.  Also, the rfiles increment so compare the current rfile names 
> in
> the srv.dir directory vs what is in the metadata table.  Are the existing
> files after files in the metadata.  Finally, pick a few of the missing
> files and grep all your master and tserver logs to see if you can learn
> anything.  This sounds ungood.
>
> Mike
>
> On Fri, May 11, 2018 at 6:06 PM Christopher 
> wrote:
>
>> This is strange. I've only ever seen this when HDFS has reported
>> problems, such as missing blocks, or another obvious failure. What is 
>> your
>> durability settings (were WALs turned on)?
>>
>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
>> wrote:
>>
>>> Hello all,
>>>
>>> On one of our clusters, there are a good number of missing RFiles
>>> from HDFS, however HDFS is not/has not reported any missing blocks.  We
>>> were experiencing issues with HDFS; some flapping DataNode processes 
>>> that
>>> needed more heap.
>>>
>>> I don't anticipate I can do much besides create a bunch of empty
>>> RFiles (open to suggestions).  My question is, Is it possible that 
>>> Accumulo
>>> could have written the metadata for these RFiles but failed to write it 
>>> to
>>> HDFS?  In which case it would have been re-tried later and the data was
>>> persisted to a different RFile?  Or is it an 'RFile is in Accumulo 
>>> metadata
>>> if and only if it is in HDFS' situation?
>>>
>>> Accumulo 1.8.1 on HDFS 2.6.0.
>>>
>>> Thank you,
>>> --Adam
>>>
>>
>>>
>

Re: Question on missing RFiles

2018-05-16 Thread Adam J. Shook

I tried building a timeline but the logs are just not there.  We weren't
sending the debug logs to Splunk due to the verbosity, but we may be
tweaking the log4j settings a bit to make sure we get the log data stored
in the event this happens again.  This very well could be attributed to the
recovery failure; hard to say.  I'll be upgrading to 1.9.1 soon.

On Mon, May 14, 2018 at 8:53 AM, Michael Wall  wrote:

> Can you pick some of the files that are missing and search through your
> logs to put together a timeline?  See if you can find that file for a
> specific tablet.  Then grab all the logs for when a file was created as
> result of a compaction, and a when a file was included in compaction for
> that table.  Follow compactions for that tablet until you started getting
> errors.  Then see what logs you have for WAL replay during that time for
> that tablet and the metadata and can try to correlate.
>
> It's a shame you don't have the GC logs.  If you saw it was GC'd then
> showed up in the metadata table again that would help explain what
> happened.  Like Christopher mentioned, this could be related to a recovery
> failure.
>
> Mike
>
> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook 
> wrote:
>
>> WALs are turned on.  Durability is set to flush for all tables except for
>> root and metadata which are sync.  The current rfile names on HDFS and
>> in the metadata table are greater than the files that are missing.
>>  Searched through all of our current and historical logs in Splunk (which
>> are only INFO level or higher).  Issues from the logs:
>>
>> * Problem reports saying the files are not found
>> * IllegalStateException saying the rfile is closed when it tried to load
>> the Bloom filter (likely the flappy DataNode)
>> * IOException when reading the file saying Stream is closed (likely the
>> flappy DataNode)
>>
>> Nothing in the GC logs -- all the above errors are in the tablet server
>> logs.  The logs may have rolled over, though, and our debug logs don't make
>> it into Splunk.
>>
>> --Adam
>>
>> On Fri, May 11, 2018 at 6:16 PM, Christopher  wrote:
>>
>>> Oh, it occurs to me that this may be related to the WAL bugs that Keith
>>> fixed for 1.9.1... which could affect the metadata table recovery after a
>>> failure.
>>>
>>> On Fri, May 11, 2018 at 6:11 PM Michael Wall  wrote:
>>>
 Adam,

 Do you have GC logs?  Can you see if those missing RFiles were removed
 by the GC process?  That could indicate you somehow got old metadata info
 replayed.  Also, the rfiles increment so compare the current rfile names in
 the srv.dir directory vs what is in the metadata table.  Are the existing
 files after files in the metadata.  Finally, pick a few of the missing
 files and grep all your master and tserver logs to see if you can learn
 anything.  This sounds ungood.

 Mike

 On Fri, May 11, 2018 at 6:06 PM Christopher 
 wrote:

> This is strange. I've only ever seen this when HDFS has reported
> problems, such as missing blocks, or another obvious failure. What is your
> durability settings (were WALs turned on)?
>
> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
> wrote:
>
>> Hello all,
>>
>> On one of our clusters, there are a good number of missing RFiles
>> from HDFS, however HDFS is not/has not reported any missing blocks.  We
>> were experiencing issues with HDFS; some flapping DataNode processes that
>> needed more heap.
>>
>> I don't anticipate I can do much besides create a bunch of empty
>> RFiles (open to suggestions).  My question is, Is it possible that 
>> Accumulo
>> could have written the metadata for these RFiles but failed to write it 
>> to
>> HDFS?  In which case it would have been re-tried later and the data was
>> persisted to a different RFile?  Or is it an 'RFile is in Accumulo 
>> metadata
>> if and only if it is in HDFS' situation?
>>
>> Accumulo 1.8.1 on HDFS 2.6.0.
>>
>> Thank you,
>> --Adam
>>
>
>>

Re: Question on missing RFiles

2018-05-14 Thread Michael Wall

Can you pick some of the files that are missing and search through your
logs to put together a timeline?  See if you can find that file for a
specific tablet.  Then grab all the logs for when a file was created as
result of a compaction, and a when a file was included in compaction for
that table.  Follow compactions for that tablet until you started getting
errors.  Then see what logs you have for WAL replay during that time for
that tablet and the metadata and can try to correlate.

It's a shame you don't have the GC logs.  If you saw it was GC'd then
showed up in the metadata table again that would help explain what
happened.  Like Christopher mentioned, this could be related to a recovery
failure.

Mike

On Sat, May 12, 2018 at 5:26 PM Adam J. Shook  wrote:

> WALs are turned on.  Durability is set to flush for all tables except for
> root and metadata which are sync.  The current rfile names on HDFS and in
> the metadata table are greater than the files that are missing.
>  Searched through all of our current and historical logs in Splunk (which
> are only INFO level or higher).  Issues from the logs:
>
> * Problem reports saying the files are not found
> * IllegalStateException saying the rfile is closed when it tried to load
> the Bloom filter (likely the flappy DataNode)
> * IOException when reading the file saying Stream is closed (likely the
> flappy DataNode)
>
> Nothing in the GC logs -- all the above errors are in the tablet server
> logs.  The logs may have rolled over, though, and our debug logs don't make
> it into Splunk.
>
> --Adam
>
> On Fri, May 11, 2018 at 6:16 PM, Christopher  wrote:
>
>> Oh, it occurs to me that this may be related to the WAL bugs that Keith
>> fixed for 1.9.1... which could affect the metadata table recovery after a
>> failure.
>>
>> On Fri, May 11, 2018 at 6:11 PM Michael Wall  wrote:
>>
>>> Adam,
>>>
>>> Do you have GC logs?  Can you see if those missing RFiles were removed
>>> by the GC process?  That could indicate you somehow got old metadata info
>>> replayed.  Also, the rfiles increment so compare the current rfile names in
>>> the srv.dir directory vs what is in the metadata table.  Are the existing
>>> files after files in the metadata.  Finally, pick a few of the missing
>>> files and grep all your master and tserver logs to see if you can learn
>>> anything.  This sounds ungood.
>>>
>>> Mike
>>>
>>> On Fri, May 11, 2018 at 6:06 PM Christopher  wrote:
>>>
 This is strange. I've only ever seen this when HDFS has reported
 problems, such as missing blocks, or another obvious failure. What is your
 durability settings (were WALs turned on)?

 On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
 wrote:

> Hello all,
>
> On one of our clusters, there are a good number of missing RFiles from
> HDFS, however HDFS is not/has not reported any missing blocks.  We were
> experiencing issues with HDFS; some flapping DataNode processes that 
> needed
> more heap.
>
> I don't anticipate I can do much besides create a bunch of empty
> RFiles (open to suggestions).  My question is, Is it possible that 
> Accumulo
> could have written the metadata for these RFiles but failed to write it to
> HDFS?  In which case it would have been re-tried later and the data was
> persisted to a different RFile?  Or is it an 'RFile is in Accumulo 
> metadata
> if and only if it is in HDFS' situation?
>
> Accumulo 1.8.1 on HDFS 2.6.0.
>
> Thank you,
> --Adam
>

>

Re: Question on missing RFiles

2018-05-12 Thread Adam J. Shook

WALs are turned on.  Durability is set to flush for all tables except for
root and metadata which are sync.  The current rfile names on HDFS and in
the metadata table are greater than the files that are missing.   Searched
through all of our current and historical logs in Splunk (which are only
INFO level or higher).  Issues from the logs:

* Problem reports saying the files are not found
* IllegalStateException saying the rfile is closed when it tried to load
the Bloom filter (likely the flappy DataNode)
* IOException when reading the file saying Stream is closed (likely the
flappy DataNode)

Nothing in the GC logs -- all the above errors are in the tablet server
logs.  The logs may have rolled over, though, and our debug logs don't make
it into Splunk.

--Adam

On Fri, May 11, 2018 at 6:16 PM, Christopher  wrote:

> Oh, it occurs to me that this may be related to the WAL bugs that Keith
> fixed for 1.9.1... which could affect the metadata table recovery after a
> failure.
>
> On Fri, May 11, 2018 at 6:11 PM Michael Wall  wrote:
>
>> Adam,
>>
>> Do you have GC logs?  Can you see if those missing RFiles were removed by
>> the GC process?  That could indicate you somehow got old metadata info
>> replayed.  Also, the rfiles increment so compare the current rfile names in
>> the srv.dir directory vs what is in the metadata table.  Are the existing
>> files after files in the metadata.  Finally, pick a few of the missing
>> files and grep all your master and tserver logs to see if you can learn
>> anything.  This sounds ungood.
>>
>> Mike
>>
>> On Fri, May 11, 2018 at 6:06 PM Christopher  wrote:
>>
>>> This is strange. I've only ever seen this when HDFS has reported
>>> problems, such as missing blocks, or another obvious failure. What is your
>>> durability settings (were WALs turned on)?
>>>
>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
>>> wrote:
>>>
 Hello all,

 On one of our clusters, there are a good number of missing RFiles from
 HDFS, however HDFS is not/has not reported any missing blocks.  We were
 experiencing issues with HDFS; some flapping DataNode processes that needed
 more heap.

 I don't anticipate I can do much besides create a bunch of empty RFiles
 (open to suggestions).  My question is, Is it possible that Accumulo could
 have written the metadata for these RFiles but failed to write it to HDFS?
 In which case it would have been re-tried later and the data was persisted
 to a different RFile?  Or is it an 'RFile is in Accumulo metadata if and
 only if it is in HDFS' situation?

 Accumulo 1.8.1 on HDFS 2.6.0.

 Thank you,
 --Adam

>>>

Re: Question on missing RFiles

2018-05-11 Thread Christopher

Oh, it occurs to me that this may be related to the WAL bugs that Keith
fixed for 1.9.1... which could affect the metadata table recovery after a
failure.

On Fri, May 11, 2018 at 6:11 PM Michael Wall  wrote:

> Adam,
>
> Do you have GC logs?  Can you see if those missing RFiles were removed by
> the GC process?  That could indicate you somehow got old metadata info
> replayed.  Also, the rfiles increment so compare the current rfile names in
> the srv.dir directory vs what is in the metadata table.  Are the existing
> files after files in the metadata.  Finally, pick a few of the missing
> files and grep all your master and tserver logs to see if you can learn
> anything.  This sounds ungood.
>
> Mike
>
> On Fri, May 11, 2018 at 6:06 PM Christopher  wrote:
>
>> This is strange. I've only ever seen this when HDFS has reported
>> problems, such as missing blocks, or another obvious failure. What is your
>> durability settings (were WALs turned on)?
>>
>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
>> wrote:
>>
>>> Hello all,
>>>
>>> On one of our clusters, there are a good number of missing RFiles from
>>> HDFS, however HDFS is not/has not reported any missing blocks.  We were
>>> experiencing issues with HDFS; some flapping DataNode processes that needed
>>> more heap.
>>>
>>> I don't anticipate I can do much besides create a bunch of empty RFiles
>>> (open to suggestions).  My question is, Is it possible that Accumulo could
>>> have written the metadata for these RFiles but failed to write it to HDFS?
>>> In which case it would have been re-tried later and the data was persisted
>>> to a different RFile?  Or is it an 'RFile is in Accumulo metadata if and
>>> only if it is in HDFS' situation?
>>>
>>> Accumulo 1.8.1 on HDFS 2.6.0.
>>>
>>> Thank you,
>>> --Adam
>>>
>>

Re: Question on missing RFiles

2018-05-11 Thread Michael Wall

Adam,

Do you have GC logs?  Can you see if those missing RFiles were removed by
the GC process?  That could indicate you somehow got old metadata info
replayed.  Also, the rfiles increment so compare the current rfile names in
the srv.dir directory vs what is in the metadata table.  Are the existing
files after files in the metadata.  Finally, pick a few of the missing
files and grep all your master and tserver logs to see if you can learn
anything.  This sounds ungood.

Mike

On Fri, May 11, 2018 at 6:06 PM Christopher  wrote:

> This is strange. I've only ever seen this when HDFS has reported problems,
> such as missing blocks, or another obvious failure. What is your durability
> settings (were WALs turned on)?
>
> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
> wrote:
>
>> Hello all,
>>
>> On one of our clusters, there are a good number of missing RFiles from
>> HDFS, however HDFS is not/has not reported any missing blocks.  We were
>> experiencing issues with HDFS; some flapping DataNode processes that needed
>> more heap.
>>
>> I don't anticipate I can do much besides create a bunch of empty RFiles
>> (open to suggestions).  My question is, Is it possible that Accumulo could
>> have written the metadata for these RFiles but failed to write it to HDFS?
>> In which case it would have been re-tried later and the data was persisted
>> to a different RFile?  Or is it an 'RFile is in Accumulo metadata if and
>> only if it is in HDFS' situation?
>>
>> Accumulo 1.8.1 on HDFS 2.6.0.
>>
>> Thank you,
>> --Adam
>>
>

Re: Question on missing RFiles

2018-05-11 Thread Christopher

This is strange. I've only ever seen this when HDFS has reported problems,
such as missing blocks, or another obvious failure. What is your durability
settings (were WALs turned on)?

On Fri, May 11, 2018 at 12:45 PM Adam J. Shook  wrote:

> Hello all,
>
> On one of our clusters, there are a good number of missing RFiles from
> HDFS, however HDFS is not/has not reported any missing blocks.  We were
> experiencing issues with HDFS; some flapping DataNode processes that needed
> more heap.
>
> I don't anticipate I can do much besides create a bunch of empty RFiles
> (open to suggestions).  My question is, Is it possible that Accumulo could
> have written the metadata for these RFiles but failed to write it to HDFS?
> In which case it would have been re-tried later and the data was persisted
> to a different RFile?  Or is it an 'RFile is in Accumulo metadata if and
> only if it is in HDFS' situation?
>
> Accumulo 1.8.1 on HDFS 2.6.0.
>
> Thank you,
> --Adam
>

Re: Question on missing RFiles

Re: Question on missing RFiles

Re: Question on missing RFiles

Re: Question on missing RFiles

Re: Question on missing RFiles

Re: Question on missing RFiles

Re: Question on missing RFiles

Re: Question on missing RFiles

8 matches

Site Navigation

Mail list logo

Footer information