+1 to that. Great suggestion, Mike, and great find, Matt!

I think this would be a great thing to capture in the Accumulo User Manual if you're interested..

http://accumulo.apache.org/1.8/accumulo_user_manual.html#_troubleshooting

Michael Wall wrote:
Hi Matt,

Glad you got the metadata table to come up.  So some more questions for you.

How many nodes do you have?
How many tservers?
How many tablets are hosted per tserver across all tables?

If you deleted a table, those entries in the metadata table should be
gone.  Are you still seeing stuff from the deleted table in the metadata
table?  If all metadata entries are in one tablet, then there are no
splits for the metadata table and running merge will not help.  After we
see the answers to the questions above, I will try to recommend
something else.

Mike

On Tue, Feb 21, 2017 at 6:22 PM Dickson, Matt MR
<matt.dick...@defence.gov.au <mailto:matt.dick...@defence.gov.au>> wrote:

    __

    *UNOFFICIAL*

    Firstly, thankyou for your advice its been very helpful.
    Increasing the tablet server memory has allowed the metadata table
    to come online.  From using the rfile-info and looking at the splits
    for the metadata table it appears that all the metadata table
    entries are in one tablet.  All tablet servers then query the one
    node hosting that tablet.
    I suspect the cause of this was a poorly designed table that at one
    point the Accumulo gui reported 1.02T tablets for.  We've
    subsequently deleted that table but it might be that there were so
    many entries in the metadata table that all splits on it were due to
    this massive table that had the table id 1vm.
    To rectify this, is it safe to run a merge on the metadata table to
    force it to redistribute?

    ------------------------------------------------------------------------
    *From:* Michael Wall [mailto:mjw...@gmail.com
    <mailto:mjw...@gmail.com>]
    *Sent:* Wednesday, 22 February 2017 02:44

    *To:* user@accumulo.apache.org <mailto:user@accumulo.apache.org>
    *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
    Matt,

    If I am reading this correctly, you have a tablet that is being
    loading onto a tserver.  That tserver dies, so the tablet is then
    assigned to another tablet.  While the tablet is being loading, that
    tserver dies and so on.  Is that correct?

    Can you identify the tablet that is bouncing around?  If so, try
    using rfile-info -d to inspect the rfiles associated with that
    tablet.  Also look at the rfiles that compose that tablet to see if
    anything sticks out.

    Any logs that would help explain why the tablet server is dying?
    Can you increase the memory of the tserver?

    Mike

    On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.el...@gmail.com
    <mailto:josh.el...@gmail.com>> wrote:

        ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
        communicating with ZooKeeper, will retry
        SessionExpiredException: KeeperErrorCode = Session expired for
        
/accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory

        There can be a number of causes for this, but here are the most
        likely ones.

        * JVM gc pauses
        * ZooKeeper max client connections
        * Operating System/Hardware-level pauses

        The former should be noticeable by the Accumulo log. There is a
        daemon
        running which watches for pauses that happen and then reports
        them. If
        this is happening, you might have to give the process some more Java
        heap, tweak your CMS/G1 parameters, etc.

        For maxClientConnections, see
        
https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html

        For the latter, swappiness is the most likely candidate
        (assuming this
        is hopping across different physical nodes), as are "transparent
        huge
        pages". If it is limited to a single host, things like bad NICs,
        hard
        drives, and other hardware issues might be a source of slowness.

        On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
        <matt.dick...@defence.gov.au
        <mailto:matt.dick...@defence.gov.au>> wrote:
         > UNOFFICIAL
         >
         > It looks like an issue with one of the metadata table
        tablets. On startup
         > the server that hosts a particular metadata tablet gets
        scanned by all other
         > tablet servers in the cluster.  This then crashes that tablet
        server with an
         > error in the tserver log;
         >
         > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
         > communicating with ZooKeeper, will retry
         > SessionExpiredException: KeeperErrorCode = Session expired for
         >
        
/accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
         >
         > That metadata table tablet is then transferred to another
        host which then
         > fails also, and so on.
         >
         > While the server is hosting this metadata tablet, we see the
        following log
         > statement from all tserver.logs in the cluster:
         >
         > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
         > org.apache.thrift.transport.TTransportException  null
         > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
        <http://server.com.org:9997>,2342423df12341d)
         > Hope that helps complete the picture.
         >
         >
         > ________________________________
         > From: Christopher [mailto:ctubb...@apache.org
        <mailto:ctubb...@apache.org>]
         > Sent: Tuesday, 21 February 2017 13:17
         >
         > To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
         > Subject: Re: accumulo.root invalid table reference
        [SEC=UNOFFICIAL]
         >
         > Removing them is probably a bad idea. The root table entries
        correspond to
         > split points in the metadata table. There is no need for the
        tables which
         > existed when the metadata table split to still exist for this
        to continue to
         > act as a valid split point.
         >
         > Would need to see the exception stack trace, or at least an
        error message,
         > to troubleshoot the shell scanning error you saw.
         >
         >
         > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
        <matt.dick...@defence.gov.au <mailto:matt.dick...@defence.gov.au>>
         > wrote:
         >>
         >> UNOFFICIAL
         >>
         >> In case it is ok to remove these from the root table, how
        can I scan the
         >> root table for rows with a rowid starting with !0;1vm?
         >>
         >> Running "scan -b !0;1vm" throws an exception and exits the
        shell.
         >>
         >>
         >> -----Original Message-----
         >> From: Dickson, Matt MR [mailto:matt.dick...@defence.gov.au
        <mailto:matt.dick...@defence.gov.au>]
         >> Sent: Tuesday, 21 February 2017 09:30
         >> To: 'user@accumulo.apache.org <mailto:user@accumulo.apache.org>'
         >> Subject: RE: accumulo.root invalid table reference
        [SEC=UNOFFICIAL]
         >>
         >> UNOFFICIAL
         >>
         >>
         >> Does that mean I should have entries for 1vm in the metadata
        table
         >> corresponding to the root table?
         >>
         >> We are running 1.6.5
         >>
         >>
         >> -----Original Message-----
         >> From: Josh Elser [mailto:josh.el...@gmail.com
        <mailto:josh.el...@gmail.com>]
         >> Sent: Tuesday, 21 February 2017 09:22
         >> To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
         >> Subject: Re: accumulo.root invalid table reference
        [SEC=UNOFFICIAL]
         >>
         >> The root table should only reference the tablets in the
        metadata table.
         >> It's a hierarchy: like metadata is for the user tables, root
        is for the
         >> metadata table.
         >>
         >> What version are ya running, Matt?
         >>
         >> Dickson, Matt MR wrote:
         >> > *UNOFFICIAL*
         >> >
         >> > I have a situation where all tablet servers are
        progressively being
         >> > declared dead. From the logs the tservers report errors like:
         >> > 2017-02-.... DEBUG: Scan failed thrift error
         >> > org.apache.thrift.trasport.TTransportException null
         >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
        <http://server.com.org:9997>,2342423df12341d)
         >> > 1vm was a table id that was deleted several months ago so
        it appears
         >> > there is some invalid reference somewhere.
         >> > Scanning the metadata table "scan -b 1vm" returns no rows
        returned for
         >> > 1vm.
         >> > A scan of the accumulo.root table returns approximately 15
        rows that
         >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
        the root
         >> > table entries used and would it be safe to remove these
        entries since
         >> > they reference a deleted table?
         >> > Thanks in advance,
         >> > Matt
         >> > //
         >
         > --
         > Christopher

Reply via email to