https://issues.apache.org/jira/browse/ACCUMULO-2261
I'll make my comments over on the ticket. Thanks for reporting! -Eric On Mon, Jan 27, 2014 at 9:09 AM, Anthony F <[email protected]> wrote: > Eric, I'm currently digging through the logs and will report back. Keep > in mind, I'm using Accumulo 1.5.0 on a Hadoop 2.2.0 stack. To determine > data loss, I have a 'ConsistencyCheckingIterator' that verifies each row > has the expected data (it takes a long time to scan the whole table). > Below is a quick summary of what happened. The tablet in question is > "d;72~gcm~201304". Notice that it is assigned to > 192.168.2.233:9997[343bc1fa155242c] > at 2014-01-25 09:49:36,233. At 2014-01-25 09:49:54,141, the tserver goes > away. Then, the tablet gets assigned to 192.168.2.223:9997[143bc1f14412432] > and shortly after that, I see the BadLocationStateException. The master > never recovers from the BLSE - I have to manually delete one of the > offending locations. > > 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning > tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c] > 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning > tablet p;18~thm~2012101;18=192.168.2.233:9997[343bc1fa155242c] > 2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers > [192.168.2.233:9997[343bc1fa155242c]] > 2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead > servers: > [d;03~u36~201302;03~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;06~u36~2013;06~thm~2012083@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;25;24~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;25~u36~201303;25~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;27~gcm~2013041;27@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;30~u36~2013031;30~thm~2012082@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;34~thm;34~gcm~2013022@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;39~thm~20121;39~gcm~20130418@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;41~thm;41~gcm~2013041@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;42~u36~201304;42~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;45~thm~201208;45~gcm~201303@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;48~gcm~2013052;48@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;60~u36~2013021;60~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;68~gcm~2013041;68@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;72;71~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;72~gcm~201304;72@(192.168.2.233:9997[343bc1fa155242c],null,null), > d;75~thm~2012101;75~gcm~2013032@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;78;77~u36~201305@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;90~u36~2013032;90~thm~2012092@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;91~thm;91~gcm~201304@(null,192.168.2.233:9997[343bc1fa155242c],null), > d;93~u36~2013012;93~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), > m;20;19@(null,192.168.2.233:9997[343bc1fa155242c],null), m;38;37@ > (null,192.168.2.233:9997[343bc1fa155242c],null), m;51;50@ > (null,192.168.2.233:9997[343bc1fa155242c],null), m;60;59@ > (null,192.168.2.233:9997[343bc1fa155242c],null), m;92;91@ > (null,192.168.2.233:9997[343bc1fa155242c],null), > o;01<@(null,192.168.2.233:9997[343bc1fa155242c],null), o;04;03@ > (null,192.168.2.233:9997[343bc1fa155242c],null), o;50;49@ > (null,192.168.2.233:9997[343bc1fa155242c],null), o;63;62@ > (null,192.168.2.233:9997[343bc1fa155242c],null), o;74;73@ > (null,192.168.2.233:9997[343bc1fa155242c],null), o;97;96@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;08~thm~20121;08@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;09~thm~20121;09@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;10;09~thm~20121@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;18~thm~2012101;18@ > (192.168.2.233:9997[343bc1fa155242c],null,null), p;21;20~thm~201209@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;22~thm~2012091;22@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;23;22~thm~2012091@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;41~thm~2012111;41@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;42;41~thm~2012111@ > (null,192.168.2.233:9997[343bc1fa155242c],null), p;58~thm~201208;58@ > (null,192.168.2.233:9997[343bc1fa155242c],null)]... > 2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning > tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432] > 2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet > d;72~gcm~201304;72 was loaded on 192.168.2.223:9997 > 2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR: > java.lang.RuntimeException: > org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: > found two locations for the same extent d;72~gcm~201304: > 192.168.2.223:9997[143bc1f14412432] > and 192.168.2.233:9997[343bc1fa155242c] > java.lang.RuntimeException: > org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: > found two locations for the same extent d;72~gcm~201304: > 192.168.2.223:9997[143bc1f14412432] > and 192.168.2.233:9997[343bc1fa155242c] > > > > On Mon, Jan 27, 2014 at 8:53 AM, Eric Newton <[email protected]>wrote: > >> Having two "last" locations... is annoying, and useless. Having two >> "loc" locations is disastrous. We do a *lot* of testing that verifies that >> data is not lost, with live ingest and with bulk ingest, and just about >> every other condition you can imagine. Presently, this testing is being >> done by me for 1.6.0 on Hadoop 2.2.0 and ZK 3.4.5. >> >> If you can provide any of the following, it would be helpful: >> >> * an automated test case that demonstrates the problem >> * logs that document what happened >> * a description of the *exact* things you did to detect data loss >> >> Please don't use the approximate counts displayed on the monitor pages to >> confirm ingest. These are known to be incorrect with both bulk ingested >> data and right after splits. The data is there, but the counts are just >> estimates. >> >> If you find you have verified data loss, please open a ticket, and >> provide as many details as you can, even if it does not happen consistently. >> >> Thanks! >> >> -Eric >> >> >> >> On Mon, Jan 27, 2014 at 7:57 AM, Anthony F <[email protected]> wrote: >> >>> I took a look in the code . . . the stack trace is not quite the same. >>> In 1.6.0, the fixed issue related to METADATA_LAST_LOCATION_COLUMN_FAMILY. >>> The issue I am seeing (in 1.5.0) is related to >>> METADATA_CURRENT_LOCATION_COLUMN_FAMILY (line 144). >>> >>> >>> On Sun, Jan 26, 2014 at 7:00 PM, Anthony F <[email protected]> wrote: >>> >>>> The stack trace is pretty close and the steps to reproduce match the >>>> scenario in which I observed the issue. But there's no fix (in Jira) >>>> against 1.5.0, just 1.6.0. >>>> >>>> >>>> On Sun, Jan 26, 2014 at 5:56 PM, Josh Elser <[email protected]>wrote: >>>> >>>>> Just because the error message is the same doesn't mean that the root >>>>> cause is also the same. >>>>> >>>>> Without looking more into Eric's changes, I'm not sure if >>>>> ACCUMULO-2057 would also affect 1.5.0. We're usually pretty good about >>>>> checking backwards when bugs are found in newer versions, but things slip >>>>> through the cracks, too. >>>>> >>>>> >>>>> On 1/26/2014 5:09 PM, Anthony F wrote: >>>>> >>>>>> This is pretty much the issue: >>>>>> >>>>>> https://issues.apache.org/jira/browse/ACCUMULO-2057 >>>>>> >>>>>> Slightly different error message but it's a different version. Looks >>>>>> like its fixed in 1.6.0. I'll probably need to upgrade. >>>>>> >>>>>> >>>>>> On Sun, Jan 26, 2014 at 4:47 PM, Anthony F <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Thanks, I'll check Jira. As for versions, Hadoop 2.2.0, Zk 3.4.5, >>>>>> CentOS 64bit (kernel 2.6.32-431.el6.x86_64). Has much testing >>>>>> been >>>>>> done using Hadoop 2.2.0? I tried Hadoop 2.0.0 (CDH 4.5.0) but ran >>>>>> into HDFS-5225/5031 which basically makes it a non-starter. >>>>>> >>>>>> >>>>>> On Sun, Jan 26, 2014 at 4:29 PM, Josh Elser <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> I meant to reply to your original email, but I didn't yet, >>>>>> sorry. >>>>>> >>>>>> First off, if Accumulo is reporting that it found multiple >>>>>> locations for the same extent, this is a (very bad) bug in >>>>>> Accumulo. It might be worth looking at tickets that at marked >>>>>> as >>>>>> "affects 1.5.0" and "fixed in 1.5.1" on Jira. It's likely that >>>>>> we've already encountered and fixed the issue, but, if you >>>>>> can't >>>>>> find a fix that was already made, we don't want to overlook >>>>>> the >>>>>> potential need for one. >>>>>> >>>>>> For both "live" and "bulk" ingest, *neither* should lose any >>>>>> data. This is one thing that Accumulo should never be doing. >>>>>> If >>>>>> you have multiple locations for an extent, it seems plausible >>>>>> to >>>>>> me that you would run into data loss. However, you should >>>>>> focus >>>>>> on trying to determine why you keep running into multiple >>>>>> locations for a tablet. >>>>>> >>>>>> After you take a look at Jira, I would likely go ahead and >>>>>> file >>>>>> a jira to track this since it's easier to follow than an email >>>>>> thread. Be sure to note if there is anything notable about >>>>>> your >>>>>> installation (did you download it directly from the >>>>>> accumulo.apache.org <http://accumulo.apache.org> site)? You >>>>>> >>>>>> should also include what OS and version and what Hadoop and >>>>>> ZooKeeper versions you are running. >>>>>> >>>>>> >>>>>> On 1/26/2014 4:10 PM, Anthony F wrote: >>>>>> >>>>>> I have observed a loss of data when tservers fail during >>>>>> bulk ingest. >>>>>> The keys that are missing are right around the table's >>>>>> splits indicating >>>>>> that data was lost when a tserver died during a split. I >>>>>> am >>>>>> using >>>>>> Accumulo 1.5.0. At around the same time, I observe the >>>>>> master logging a >>>>>> message about "Found two locations for the same extent". >>>>>> Can anyone >>>>>> shed light on this behavior? Are tserver failures during >>>>>> bulk ingest >>>>>> supposed to be fault tolerant? >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>> >> >
