I would approach this problem by trying to find the common characteristics of the rows that are missing. A common pattern I've see is rows missing at the end of a batch (meaning some issues with flushing the buffers). If the missing rows aren't in sequences, meaning one missing every few other rows, and you're using a buffer than that would mean that something strange (and possibly user induced) is happening.
You could also try to find what happened to a single row. Track when it was inserted, which region got it, and then what happened to it. J-D On Mon, Apr 4, 2011 at 10:27 AM, Chris Tarnas <[email protected]> wrote: > Hi JD, > > Sorry for taking a while - I was in traveling. Thank you very much for > looking through these. > > See answers below: > > On Apr 1, 2011, at 11:19 AM, Jean-Daniel Cryans wrote: > >> Thanks for taking the time to upload all those logs, I really appreciate it. >> >> So from the looks of it, only 1 region wasn't able to split during the >> time of those logs and it successfully rolled back. At first I thought >> the data could have been deleted in the parent region, but we don't do >> that in the region server (it's the master that's responsible for that >> deletion) meaning that you couldn't lose data. >> >> Which makes me think, those rows that are missing... are they part of >> that region or they are also in other regions? If it's the case, then >> maybe this is just a red herring. >> > > I think you are correct that that was a red herring. > >> You say tat you insert in two different families at different row >> keys. IIUC that means you would insert row A in family f1 and row B in >> family f2, and so on. And you say only one of the rows is there... I >> guess you don't really mean that you were inserting into 2 rows for 11 >> hours and one of them was missing right? More like, all the data in >> one family was missing for those 11B rows? Is that right? >> > > Its a little more complicated than that. I have multiple families, one of the > families is an index where the rowkey is the an index to the rest of the data > in the other column families. Over the process of loading some test data I > have noticed that 0.05% of the indexes point to missing rows. I'm going back > to ruling out application errors now just to be sure, but so far I have only > noticed this with very large loads with more than 100M rows of data and > another ~800M rows of indexes. > > I've grepped all of the logs (thirft, datanode, regionserver) during the time > of the most recent load, and the only ERRORs were found in the datanode logs > and were either the attempts to delete already deleted blocks in the > datanodes that I mentioned in my first email or ones like this: > > 2011-04-04 05:46:43,805 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.56.24.20:50010, > storageID=DS-122374912-10.56.24.20-50010-1297226452541, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.IOException: Block blk_-2233766441053341849_1392526 is not valid. > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:981) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:944) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:954) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:94) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:206) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114) > > There were quite a few WARNs, mostly related to flushing and taking a long > time to write to the edit logs (> 3000ms). > > I'm going to see if there is some edge cases in our indexing and loading > modules that slipped through earlier testing for now, but if you have any > other pointers that would be great. > > thanks, > -chris > > >
