> If you can provide a test client that has ever replicated the problem, > please attach it to the ticket.
I have seen it 3 times within a month timeframe, so I do not know how to reproduce it reliable. Perhaps, I have to backup walogs next time and look into them. > Is this the exact same cluster or is it just the same code you were using? Same code, another cluster. > Did you have walogs laying around when you upgraded? In 1.4-cluster (and first time in 1.6-cluster) I had walogs enabled for data tables and disabled for index tables. There was a bug in 1.4, if a tablet had empty walog there were some startup issues (tablet remains offline or something like this), and it happened often with index tables (hmm, the same tables I have this problem). So, in 1.4-cluster I disabled walog and ran full reindex periodically. After running 1.6-cluster some time I enabled walogs for all tables as the new cluster have less reliable hardware, which reboots from time to time. > Did you upgrade through 1.5 or straight from 1.4 to 1.6? >From 1.4 to 1.6. But it was not upgrade, it was copy of .rf files to a new cluster and then importdirectory. On 2/20/15, John Vines <[email protected]> wrote: > You said that you were operating this on 1.4. Is this the exact same > cluster or is it just the same code you were using? Did you have walogs > laying around when you upgraded? Did you upgrade through 1.5 or straight > from 1.4 to 1.6? > > On Fri, Feb 20, 2015 at 1:46 PM, Keith Turner <[email protected]> wrote: > >> I updated ACCUMULO-3603 w/ details about an experiment I ran. >> >> On Wed, Feb 18, 2015 at 9:44 PM, Eric Newton <[email protected]> >> wrote: >> >>> https://issues.apache.org/jira/browse/ACCUMULO-3603 >>> >>> -Eric >>> >>> >>> On Wed, Feb 18, 2015 at 7:12 PM, Denis <[email protected]> wrote: >>> >>>> On 2/18/15, Christopher <[email protected]> wrote: >>>> >>>> > To rule out some scenarios, is it possible that your clients are >>>> writing to >>>> > the wrong tables? >>>> That was the first idea, so I added assert()'s to the code of the >>>> writers few days ago. No assert was triggered, but some invalid values >>>> appear after new tserver failure. >>>> >>>> > Have you ever seen a failure affecting a table which does >>>> > not exist (like what might happen if there's an off-by-one error in >>>> the WAL >>>> > code)? Or affecting the metadata tables? >>>> No. >>>> Also, no tables were created or deleted during last two months. >>>> >>>> > Can you reproduce this error reliably, or can you share the relevant >>>> ingest >>>> > code which can reproduce this failure? >>>> >>>> I will think how to reproduce it. >>>> What could be special about the code: inserts are performed to few >>>> (5..8) tables at once (one data table + few index tables) but no >>>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are >>>> created and flushed consequentially, in the same thread. For Accumulo >>>> 1.4 it was a performance optimization, if worked faster than >>>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was >>>> not changed after migration to 1.6.1. >>>> In all cases with invalid values the index tables were affected (one >>>> of the index table had values typical for another of the index >>>> tables). >>>> >>>> > Also, what kind of tablet server failures are you experiencing when >>>> this happens? >>>> Spontaneous power-offs. There is something wrong with the power units >>>> so every 2-3 days one of the servers suddenly turns off and reboots. >>>> >>> >>> >> >
