Great! I see the JIRA bug you just opened. I'll enable debug logging on FlushSnapshotSubprocedure and HRegion on the region servers in the cluster to see if I can capture log messages as better evidence. Since it's a timing issue I'm not sure when we might see it again, but I'll keep an eye out.
Thanks so much for your help, --Jacob -----Original Message----- From: Ted Yu [mailto:[email protected]] Sent: Wednesday, May 24, 2017 11:29 AM To: [email protected] Subject: Re: Missing data in snapshot - possible flush timing issue? In FlushSnapshotSubprocedure (running on region server), there is debug log: LOG.debug("... Flush Snapshotting region " + region.toString() + " completed."); If you enable debug log, we would know whether the underlying region is considered having completed the flush. Higher in call() method there is this: region.flush(true); The return value is not checked. In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to: String msg = "Not flushing since " + (writestate.flushing ? "already flushing" : "writes not enabled"); The above seems to correlate with your description. Let me log a JIRA referring to this thread. On Wed, May 24, 2017 at 8:08 AM, LeBlanc, Jacob <[email protected]> wrote: > Thanks for looking Ted! > > My understanding of the log messages is that the last line of the > pastebin is the end of the flush of the memstore for the region where > we missed data, but that line is tagged with "[MemStoreFlusher.1]" > whereas the other regions that were getting flushed as part of > snapshot are tagged with "[rs( > a1-qa-hbr31416d.lab.lynx-connected.com,16020,1494432106955)-snapshot-pool81-thread-1]". > With only a superficial understanding, it seems like the flush of that > region where messages were tagged with "[MemStoreFlusher.1]", while > happening at the same time, wasn't really part of the snapshot > process. For example, line 3 in the pastebin shows the flush of one > region starting and tagged with snapshot-pool81-thread-1, line 4 shows > the flush starting for the region we missed data and tagged with > MemStoreFlusher.1, and line 5 continues with the flush of region as > part of snapshot. So it definitely looks like multiple flushes were > occurring at the same time whereas elsewhere in the logs it seems like > the flushes are always done sequentially as part of snapshot. So I > came to the theory that perhaps there is a timing issue where the > flushed data for a region is missed as part of a snapshot because the > flush is occurring on another thread as part of normal, periodic flushing of > memstore. > > The last line I see in the full region server log that has anything to > do with the snapshot is line 11 in the pastebin at 2017-05-12 > 02:06:05,577 where it's processing events from zookeeper. Again with > only a superficial understanding, I was assuming this had something to > do with the master signaling that the snapshot was complete. We'll be > sure to capture the master log next time. > > And thanks for also checking JIRA for me. If there is a bug here it > seems as though we don't have an option to upgrade to fix it and we'll > have to plan on coding around it for now. > > Thanks, > > --Jacob > > -----Original Message----- > From: Ted Yu [mailto:[email protected]] > Sent: Wednesday, May 24, 2017 8:47 AM > To: [email protected] > Subject: Re: Missing data in snapshot - possible flush timing issue? > > bq. the snapshot finishes before the flush of that last region > finishes > > According to the last line in the pastebin, flush finished at > 2017-05-12 > 02:06:06,063 > Did you find something in master log which indicated that snapshot > finished before the above time ? > > I went thru snapshot bug fixes in branch-1.1 backward until seeing > HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes, there > was none matching your description. > > Next time this happens to your cluster, can you capture master log > (hopefully with DEBUG logging) ? We would get more clue from master log. > > Thanks > > On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob > <[email protected]> > wrote: > > > Hi all, > > > > We've recently been testing a Spark job that takes a snapshot of an > > HBase table and then creates an RDD using TableSnapshotInputFormat > > class, and we've run into an interesting issue. It appears that the > > snapshot, and therefore resulting RDD, was missing some data that > > was recently written to the table. After some digging, it looks like > > the missing data was limited to a particular region as if the > > memstore for the region failed to flush before the snapshot was taken. > > > > For our testing the job was scheduled in Oozie to run hourly and we > > are missing data from the region for exactly the last hour, > > indicating that when the job ran at 1:05 and took the snapshot > > everything was flushed and worked fine, but at 2:05 when the job ran > > we missed everything written to the region since the flush at 1:05. > > > > Here is an excerpt from the region server at that time: > > > > https://pastebin.com/1ECXjhRp > > > > Of the three regions for the table owned by this region server, two > > of them have messages from "snapshot-pool81-thread-1" saying it's > > flushing the memstore. But at "2017-05-12 02:06:03,229" before the > > flush of region with rowkey starting with ";uP\xE1" finishes, there > > is a message from "MemStoreFlusher.1" saying it's starting the flush > > of memstore for region with rowkey starting with "\xC0\x8F". Then it > > appears that the snapshot finishes before the flush of that last > > region finishes, and that is the exact region from which we are > > missing > data. > > > > So it kind of looks like some timing issue where a normal scheduled > > flush of the region's memstore kicked in on one thread while the > > regions were being flushed for the snapshot by another thread > > causing us to miss the flushed data for that one region in the > > snapshot. Out of something like > > 50-100 times this job was run we saw this occur twice. > > > > We are running HBase 1.1.4. Is this a known problem? I've tried > > looking around JIRA but didn't find anything obvious. Anyone > > familiar with the code could see how this might happen? > > > > We can code defensively against this, but wanted to understand what > > we are dealing with. > > > > BTW, I don't have the logs from the master at that time as they've > > been rolled since then. > > > > Thanks! > > > > --Jacob LeBlanc > > >
