Brennon: Can you try hbck to see if the problem is repaired ? Thanks
On Fri, Apr 12, 2013 at 9:27 AM, ramkrishna vasudevan < [email protected]> wrote: > Oh..sorry to hear that . But i think it should be there in the system but > not allowing you to access. We should be able to bring it back. > > One set of logs that would be of interest is that of the RS and master when > the split happened. > > And the main thing would be that when you restarted your cluster and the > Master again came back. That is where the system does some self > rectification after it sees if there were some partial splits. > > Regards > Ram > > > On Fri, Apr 12, 2013 at 9:34 PM, Brennon Church <[email protected]> > wrote: > > > Hello, > > > > We lost the data when the parent regions got reopened. My guess, and > it's > > only that, is that the regions were essentially empty when they started > up > > again in these cases. We definitely lost data from the tables. > > > > I've looked through the hdfs and hbase logs and can't find any obvious > > difference between a successful split and these failed ones. All steps > > show up the same in all cases. After the handled split message that > listed > > the parent and daughter regions, the next reference is to the parent > > regions once again as hbase is started back up after the failure. No > > further reference to the daughters is made. > > > > I couldn't cleanly shut several of the regionservers down, so they were > > abruptly killed, yes. > > > > HBase version is 0.92.0, and hadoop is 1.0.1. > > > > Thanks. > > > > --Brennon > > > > > > On 4/11/13 10:58 PM, ramkrishna vasudevan wrote: > > > >> When you say that the parent regions got reopened does that mean that > you > >> did not lose any data(any data could not be read). The reason am asking > >> is > >> if after the parent got split into daughters and the data was written to > >> daughters and if the daughters related files could not be opened you > could > >> have ended up in not able to read the data. > >> > >> Some logs could tell us what made the parent to get reopened rather than > >> daughters. Another thing i would like to ask is was the cluster brought > >> down abruptly by killing the RS. > >> > >> Which version of HBase? > >> > >> Regards > >> Ram > >> > >> > >> > >> > >> On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <[email protected]> > >> wrote: > >> > >> Hello, > >>> > >>> I had an interesting problem come up recently. We have a few thousand > >>> regions across 8 datanode/regionservers. I made a change, increasing > the > >>> heap size for hadoop from 128M to 2048M which ended up bringing the > >>> cluster > >>> to a complete halt after about 1 hour. I reverted back to 128M and > >>> turned > >>> things back on again but didn't realize at the time that I came up > with 9 > >>> fewer regions than I started. Upon further investigation, I found that > >>> all > >>> 9 missing regions were from splits that occurred while the cluster was > >>> running after making the heap change and before it came to a halt. > There > >>> was a 10th regions (5 splits involved in total) that managed to get > >>> recovered. The really odd thing is that in the case of the other 9 > >>> regions, the original parent regions, which as far as I can tell in the > >>> logs were deleted, were re-opened upon restarting things once again. > The > >>> daughter regions were gone. Interestingly, I found the orphaned > >>> datablocks > >>> still intact, and in at least some cases have been able to extract the > >>> data > >>> from them and will hopefully re-add it to the tables. > >>> > >>> My question is this. Does anyone know based on the rather muddled > >>> description I've given above, what could have possibly happened here? > My > >>> best guess is that the bad state that hdfs was in caused some critical > >>> component of the split process to be missed, which resulted a reference > >>> to > >>> the parent regions sticking around and losing the references to the > >>> daughter regions. > >>> > >>> Thanks for any insight you can provide. > >>> > >>> --Brennon > >>> > >>> > >>> > >>> > >>> > > >
