RE: Failing to BulkIngest [SEC=UNOFFICIAL]

Dickson, Matt MR Thu, 20 Feb 2014 19:56:29 -0800

UNOFFICIAL

Thanks for that.

We recreated the nodes and restarted Accumulo, but it went through and Added 
the locks back during start up, so it appears Accumulo has knowledge of the 
locks, maybe in the metadata table(?), and has updated the fate locks in 
zookeeper.  The issue of bulk ingest failing is still occuring.

How can we investigate within Accumulo how it tracks these locks so that we can 
flush this information also or identify the issue?

Matt
________________________________
From: Eric Newton [mailto:[email protected]]
Sent: Friday, 21 February 2014 14:27
To: [email protected]
Subject: Re: Failing to BulkIngest [SEC=UNOFFICIAL]

Sorry... I should have been more clear.

"-e" is for ephemeral, these are not ephemeral nodes. I think "-s" is the 
default, so you don't need to specify it.

You can put anything in for the data.. it is unimportant:

cli>  create /accumulo/xx.../fate foo
cli>  create /accumulo/xx.../table_locks bar

I think that you can give the zkCli.sh shell quotes for an empty string:

cli> create /accumulo/xx.../fate ""

But, I can't remember if that works.  Accumulo never reads the contents of 
those nodes, so anything you put in there will be ignored.

The master may even re-create these nodes on start-up, but I did not test it.

-Eric

On Thu, Feb 20, 2014 at 6:18 PM, Dickson, Matt MR 
<[email protected]<mailto:[email protected]>> wrote:

UNOFFICIAL

After running the zkCli.sh rmr on the directories, we are having difficulties 
recreating the nodes.

The zookeeper create command has 2 options -s and -e, but it's not clear what 
each of these does and which one to use to recreate the accumulo node.  Also 
the create command requires a 'data' name specified however when we look at our 
qa system the accumulo node has no data name within it.

What is the zookeper command to run to recreate the /accumulo/xx.../fate and 
/accumulo/xx.../table_locks nodes?

________________________________
From: Eric Newton [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, 21 February 2014 07:31

To: [email protected]<mailto:[email protected]>
Subject: Re: Failing to BulkIngest [SEC=UNOFFICIAL]

No, xxx... is your instance id.  You can find it at the top of the monitor 
page. It's the ugly UUID there.

-Eric

On Thu, Feb 20, 2014 at 3:26 PM, Dickson, Matt MR 
<[email protected]<mailto:[email protected]>> wrote:

UNOFFICIAL

Is the xxx... the transaction id returned by the 'fate.Admin print'?

Whats involved with recreating a node?

Matt

________________________________
From: Eric Newton [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, 21 February 2014 01:35

To: [email protected]<mailto:[email protected]>
Subject: Re: Failing to BulkIngest [SEC=UNOFFICIAL]

You can use the zkCli.sh utility to "rmr" /accumulo/xx.../fate and 
/accumulo/xx.../table_locks, and then recreate those nodes.

-Eric

On Wed, Feb 19, 2014 at 5:58 PM, Dickson, Matt MR 
<[email protected]<mailto:[email protected]>> wrote:

UNOFFICIAL

Thanks for your help on this Eric.

I've started deleting the transactions by running the, ./accumulo ...fate.Admin 
delete <txid>, and notice this takes about 20 seconds per transaction.  With 
7500 to delete this is going to take a long time (almost 2 days), so I tried 
running several threads each with a seperate range of id's to delete.  
Unfortunately this seemed to have some contention and I kept recieving an 
InvocationTargetException .... Caused by zookeeper.KeeperException: 
KeeperErrorCode = noNode for 
/accumulo/xxxxx-xxxx-xxxx-xxxx/table_locks/3n/lock-xxxxxx

When I go back to one thread this error disappears.

Is there a better way to run this?

Thanks in advance,
Matt

________________________________
From: Eric Newton [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, 19 February 2014 01:21

To: [email protected]<mailto:[email protected]>
Subject: Re: Failing to BulkIngest [SEC=UNOFFICIAL]

The "LeaseExpiredException" is part of the recovery process.  The master 
determines that a tablet server has lost its lock, or it is unresponsive and 
has been halted, possibly indirectly by removing the lock.

The master then steals the write lease on the WAL file, which causes future 
writes to the WALog to fail.  The message you have seen is part of that 
failure.  You should have seen a tablet server failure associated with this 
message on the machine with <ip>.

Having 50K FATE IN_PROGRESS lines is bad.  That is preventing your bulk imports 
from getting run.

Are there any lines that show locked: [W:3n] ?  The other FATE transactions are 
waiting to get a READ lock on table id 3n.

-Eric

On Sun, Feb 16, 2014 at 7:59 PM, Dickson, Matt MR 
<[email protected]<mailto:[email protected]>> wrote:
UNOFFICIAL

Josh,

Zookeepr - 3.4.5-cdh4.3.0
Accumulo - 1.5.0
Hadoop - cdh 4.3.0

In the accumulo console getting

ERROR RemoteException(...LeaseExpiredException): Lease mismatch on 
/accumulo/wal/<ip>+9997/<uid> owned by DFSClient_NONMAPREDUCE_699577321_12 but 
is accessed by DFSClient_NONMAPREDUCE_903051502_12

We can scan the table without issues and can load rows directly, ie not using 
bulk import.

A bit more information - we recently extended how we manage old tablets in the 
system. We load data by date, creating splits for each day and then ageoff 
using the ageoff filters.  This leaves empty tablets so we now merge these old 
tablets together to effectively remove them.  I mention it because I'm not sure 
if this might have introduced another issue.

Matt

-----Original Message-----
From: Josh Elser [mailto:[email protected]<mailto:[email protected]>]
Sent: Monday, 17 February 2014 11:32
To: [email protected]<mailto:[email protected]>
Subject: Re: Failing to BulkIngest [SEC=UNOFFICIAL]

Matt,

Can you provide Hadoop, ZK and Accumulo versions? Does the cluster appear to be 
functional otherwise (can you scan that table you're bulk importing to? any 
other errors on the monitor? etc)

On 2/16/14, 7:07 PM, Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> I have a situation where bulk ingests are failing with a "Thread "shell"
> stuck on IO to xxx:9999:99999 ...
>  From the management console the table we are loading to has no
> compactions running, yet we ran "./accumulo
> org.apache.accumulo.server.fate.Admin print and can see 50,000 lines
> stating
> txid: xxxx     status:IN_PROGRESS op: CompactRange     locked: []
> locking: [R:3n]     top: Compact:Range
> Does this mean there are actually compactions running or old
> comapaction locks still hanging around that will be preventing the builk 
> ingest to run?
> Thanks in advance,
> Matt

RE: Failing to BulkIngest [SEC=UNOFFICIAL]

Reply via email to