Denis, After our tests, it's highly unlikely this problem is in the server. If you can provide a test client that has ever replicated the problem, please attach it to the ticket. Otherwise, we'll close the ticket unless someone else can reproduce the problem.
-Eric On Fri, Feb 20, 2015 at 1:46 PM, Keith Turner <[email protected]> wrote: > I updated ACCUMULO-3603 w/ details about an experiment I ran. > > On Wed, Feb 18, 2015 at 9:44 PM, Eric Newton <[email protected]> > wrote: > >> https://issues.apache.org/jira/browse/ACCUMULO-3603 >> >> -Eric >> >> >> On Wed, Feb 18, 2015 at 7:12 PM, Denis <[email protected]> wrote: >> >>> On 2/18/15, Christopher <[email protected]> wrote: >>> >>> > To rule out some scenarios, is it possible that your clients are >>> writing to >>> > the wrong tables? >>> That was the first idea, so I added assert()'s to the code of the >>> writers few days ago. No assert was triggered, but some invalid values >>> appear after new tserver failure. >>> >>> > Have you ever seen a failure affecting a table which does >>> > not exist (like what might happen if there's an off-by-one error in >>> the WAL >>> > code)? Or affecting the metadata tables? >>> No. >>> Also, no tables were created or deleted during last two months. >>> >>> > Can you reproduce this error reliably, or can you share the relevant >>> ingest >>> > code which can reproduce this failure? >>> >>> I will think how to reproduce it. >>> What could be special about the code: inserts are performed to few >>> (5..8) tables at once (one data table + few index tables) but no >>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are >>> created and flushed consequentially, in the same thread. For Accumulo >>> 1.4 it was a performance optimization, if worked faster than >>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was >>> not changed after migration to 1.6.1. >>> In all cases with invalid values the index tables were affected (one >>> of the index table had values typical for another of the index >>> tables). >>> >>> > Also, what kind of tablet server failures are you experiencing when >>> this happens? >>> Spontaneous power-offs. There is something wrong with the power units >>> so every 2-3 days one of the servers suddenly turns off and reboots. >>> >> >> >
