Re: Bloom filter thread failure errors

Keith Turner Thu, 05 Dec 2013 13:22:43 -0800

On Thu, Dec 5, 2013 at 3:37 PM, Terry P. <[email protected]> wrote:

> Hi Keith,
> Here is the stack trace in the tserver DEBUG log for this most recent
> exception. The exception section is the same as what's in the main tserver
> log, but of course the MajC bits don't appear in the main log. This is
> hand-typed, but I'm pretty sure it's right.
>
>
> 2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
> 0.00 secs, wait 0.00 secs
> 2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
> (NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
> [/t-0000aa9/C0000zn4.rf_tmp
> 2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
> "bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
> closed
> java.lang.IllegalStateException: File
> /accumulo/tables/2/t-0000aa9/C0000zmf.rf is closed
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:244)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:142)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:211)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:307)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:357)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:142)
>   at
> org.apache.accumulo.core.file.rfile.RFile$Reader.getMetaStore(Rfile.java:927)
>   at
> org.apache.accumulo.core.file.BloomFilterLayer$BloomFilterLoader$1.run(BloomFilterLayer.java:210)
>   at
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>   at java.lang.Thread.run(Unknown Source)
> 2013-12-03 11:48:14,900 [tabletserver.Compactor] DEBUG: Compaction 2;f;d
> 280 read | 280 written |  2,500 entries/sec | 0.112 secs
> 2013-12-03 11:48:14,924 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00
> secs
>
> The biggest bummer here is just that it appears on the Monitor GUI as an
> error, and we all know how Operators don't like "errors" on their screens
> ;-)  But if this is one that can be safely ignored, we'll just have to
> write that up in a procedure somewhere.
>


The code in BloomFilterLayer$BloomFilterLoader$1.run() logs IOExceptions at
debug when the file is closed.  Because this is an IllegalStateException
its not being ignored.   Would you like to open a bug for this?


>
>
>
> On Thu, Dec 5, 2013 at 9:50 AM, Keith Turner <[email protected]> wrote:
>
>>
>>
>>
>> On Wed, Dec 4, 2013 at 7:29 PM, Terry P. <[email protected]> wrote:
>>
>>> Hi Eric,
>>> Thanks for your reply, I'm just now getting back to this as I had more
>>> of these the past two days. No tserver failures or master halts. With
>>> previous errors we were still experiencing network issues that were indeed
>>> taking tabletservers down, but now that they fixed a bad line card in a
>>> switch that had been rebooting itself (but not failing over), those issues
>>> are all gone (finally, knock on wood).
>>>
>>> Now that I see them again in isolation with no other errors, in the main
>>> tserver log these bloom-loader thread failures appear to happen out of the
>>> blue with no other issues surrounding them.
>>>
>>> However, I just checked the debug log and see they are occurring right
>>> at the time of a Major Compaction.  E.g. from one of the tservers debug log:
>>>
>>> 2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
>>> 0.00 secs, wait 0.00 secs
>>> 2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
>>> (NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
>>> [/t-0000aa9/C0000zn4.rf_tmp
>>> 2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
>>> "bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
>>> closed
>>>
>>> The rest of the stack looks like what I posted earlier. The very next
>>> debug log message after the bloom loader exception is shows that the
>>> Compaction completed successfully in 0.112 seconds.
>>>
>>> So it looks like the bloom loader is trying to open an rfile 41ms after
>>> a compaction started, and the file was likely just compacted during that
>>> gap between the calls. If that's the case, can this error be safely ignored?
>>>
>>
>> Its probably safe to ignore.   Bloomfilters are loaded lazily by a
>> background thread and its possible the file will be closed by the time the
>> background thread gets around to loading it.  However it should log a debug
>> in this case, so I am curious why an ERROR is logged.  Is there a stack
>> trace associated with the message 'Thread "bloom-loader-41" ...' ?
>>
>>
>>>
>>> Thanks,
>>> Terry
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 8:56 PM, Eric Newton <[email protected]>wrote:
>>>
>>>> This is an educated guess...
>>>>
>>>> When a process dies "gracefully" there's a shutdown hook that closes
>>>> the FileSystem.  That can result in messages like this.  It's likely
>>>> there's an error before this about a zookeeper session being lost, or a
>>>> halt issued by the master.  See if this tserver died shortly after this
>>>> message. If so, ignore the message.
>>>>
>>>> -Eric
>>>>
>>>>
>>>>
>>>> On Fri, Nov 15, 2013 at 4:31 PM, Terry P. <[email protected]> wrote:
>>>>
>>>>> Greetings folks,
>>>>> In my Accumulo 1.4.2 cluster I am seeing ERRORS about bloom loader
>>>>> threads dying due to an rfile being closed.  I can't copy/paste the error
>>>>> as it's on an air-gapped system, but it starts with:
>>>>>
>>>>> ERROR Thread "bloom-loader-2147" died File
>>>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>>>   java.lang.IllegalStateException: File
>>>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>>>     at
>>>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.getBCFile(CacheableBlockFile.java:244)
>>>>>     at
>>>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.access$000(CacheableBlockFile.java:142)
>>>>> (10 more java files ... ends with java.lang.Thread.run(UnknownSource) )
>>>>>
>>>>> No real rhyme or reason as to when they occur; we are predominantly
>>>>> ingest heavy with light reads by rowkey with ~10 entries per rowkey.  I
>>>>> don't really know if client programs are getting errors when these occur 
>>>>> or
>>>>> not.
>>>>>
>>>>> I didn't find any JIRAs related to these.  Should I be concerned about
>>>>> these?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Bloom filter thread failure errors

Reply via email to