Looks like the only thing we have in the gc logs are:
DEBUG: deleted [hdfs://../accumulo/wal/<uuid> ...]
DEBUG: Removing sorted WAL hdfs://...<uuid>
I can't tell if they are before or after in time than when I deleted the
file
hdfs://accumulo/wal/<uuid>/failed
Here's the other issue we were looking at:
https://issues.apache.org/jira/browse/ACCUMULO-3727
FYI I originally increased the num WALs up to 8 to help batch write
ingest...Now I've modified it only to be for the tables that needed
ingest instead of the entire cluster, and reset the num WALs for the
cluster back to 3, and I haven't had any errors since (3 days). Not sure
why that would be a problem except for the few times that the metadata
table was involved.
Andrew
On 03/18/2016 09:43 AM, Andrew Hulbert wrote:
I'll tar them up and see what I can find! Thanks.
On 03/17/2016 08:18 PM, Michael Wall wrote:
Andrew,
Sounds a lot like
https://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to see
if what you describe could also happen with this bug. If you still
have the gc logs, can you look for a message like "Removing WAL for
offline server" with the uuid?
Mike
On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <[email protected]
<mailto:[email protected]>> wrote:
Hi folks,
We experienced a problem this morning with a recovery on 1.6.1
that went something like this:
FileNotFoundException: File does not exist:
hdfs:///accumulo/recovery/<uuid>/failed/data
at Tablet.java:1410
at Tablet.java:1233
etc.
at TabletServer:2923
Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
was a 0 byte file, not a directory...and it was preventing
tablets from getting assigned (I am not sure what caused the
original failure, but I believe what happened is a tserver node
was going down...the master indicated it was trying to shutdown
the a tserver which was so bad off someone just rekicked the node).
I looked through the fixes for 1.6.2,3,4,5 but didn't see
anything related on the release notes pages but I haven't gone
through all the tickets yet. I haven't been able to get anyone to
upgrade to 1.6.5 yet and perhaps its already fixed.
Just wondering if that's something that has been seen before?
In order to fix it I just deleted the failed file and it proceeded
Thanks!
Andrew