Re: Bulk ingest losing tablet server

Eric Newton Wed, 15 Jan 2014 05:50:44 -0800

When a tablet server (lets call it A) bulk imports a file, it makes a few
bookkeeping entries in the !METADATA table. The tablet server that is
serving the !METADATA table (lets call it B) checks a constraint: tablet
server A must still have its zookeeper lock.  This constraint is being
violated because A has lost its lock.


Tablet server A should have died.

The native map is used for live data ingest and exist outside of the java
heap.  The caches live in the heap.

-Eric


On Wed, Jan 15, 2014 at 8:19 AM, Anthony F <[email protected]> wrote:

> Just checked on the native mem maps . . . looks like it is set to 1GB.  Do
> the index and data caches reside in native mem maps if available or is
> native mem used for something else?
>
> I just repeated an ingest . . . this time I did not lose any tablet
> servers but my logs are filling up with the following messages:
>
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG: violating
> metadata mutation : b;74~thf
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> file:/b-00012bq/I00012cj.rf value 20272720,0,1389757684543
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> loaded:/b-00012bq/I00012cj.rf value 2675766456963732003
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> srv:time value M1389757684543
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> srv:lock value tservers/
> 192.168.2.231:9997/zlock-0000000002$2438da698db13b4
>
>
>
> On Mon, Jan 13, 2014 at 2:44 PM, Sean Busbey <[email protected]>wrote:
>
>>
>> On Mon, Jan 13, 2014 at 12:02 PM, Anthony F <[email protected]> wrote:
>>
>>> Yes, system swappiness is set to 0.  I'll run again and gather more logs.
>>>
>>> Is there a zookeeper timeout setting that I can adjust to avoid this
>>> issue and is that advisable?  Basically, the tservers are colocated with
>>> HDFS datanodes and Hadoop nodemanagers.  The machines are overallocated in
>>> terms of RAM.  So, I have a feeling that when a map-reduce job is kicked
>>> off, it causes the tserver to page out to swap space.  Once the map-reduce
>>> job finishes and the bulk ingest is kicked off, the tserver is paged back
>>> in and the ZK timeout causes a shutdown.
>>>
>>>
>>>
>> You should not overallocate the amount of memory on the machines.
>> Generally, you should provide memory limits under teh assumption that
>> everything will be on at once.
>>
>> Many parts of Hadoop (not just Accumulo) will degrade or malfunction in
>> the presence of memory swapping.
>>
>> How much of hte 12GB for Accumulo is for native memmaps?
>>
>
>

Re: Bulk ingest losing tablet server

Reply via email to