I would approach this problem by trying to find the common
characteristics of the rows that are missing. A common pattern I've
see is rows missing at the end of a batch (meaning some issues with
flushing the buffers). If the missing rows aren't in sequences,
meaning one missing every few other rows, and you're using a buffer
than that would mean that something strange (and possibly user
induced) is happening.

You could also try to find what happened to a single row. Track when
it was inserted, which region got it, and then what happened to it.

J-D

On Mon, Apr 4, 2011 at 10:27 AM, Chris Tarnas <[email protected]> wrote:
> Hi JD,
>
> Sorry for taking a while - I was in traveling. Thank you very much for 
> looking through these.
>
> See answers below:
>
> On Apr 1, 2011, at 11:19 AM, Jean-Daniel Cryans wrote:
>
>> Thanks for taking the time to upload all those logs, I really appreciate it.
>>
>> So from the looks of it, only 1 region wasn't able to split during the
>> time of those logs and it successfully rolled back. At first I thought
>> the data could have been deleted in the parent region, but we don't do
>> that in the region server (it's the master that's responsible for that
>> deletion) meaning that you couldn't lose data.
>>
>> Which makes me think, those rows that are missing... are they part of
>> that region or they are also in other regions? If it's the case, then
>> maybe this is just a red herring.
>>
>
> I think you are correct that that was a red herring.
>
>> You say tat you insert in two different families at different row
>> keys. IIUC that means you would insert row A in family f1 and row B in
>> family f2, and so on. And you say only one of the rows is there... I
>> guess you don't really mean that you were inserting into 2 rows for 11
>> hours and one of them was missing right? More like, all the data in
>> one family was missing for those 11B rows? Is that right?
>>
>
> Its a little more complicated than that. I have multiple families, one of the 
> families is an index where the rowkey is the an index to the rest of the data 
> in the other column families. Over the process of loading some test data I 
> have noticed that 0.05% of the indexes point to missing rows. I'm going back 
> to ruling out application errors now just to be sure, but so far I have only 
> noticed this with very large loads with more than 100M rows of data and 
> another ~800M rows of indexes.
>
> I've grepped all of the logs (thirft, datanode, regionserver) during the time 
> of the most recent load, and the only ERRORs were found in the datanode logs 
> and were either the attempts to delete already deleted blocks in the 
> datanodes that I mentioned in my first email or ones like this:
>
> 2011-04-04 05:46:43,805 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(10.56.24.20:50010, 
> storageID=DS-122374912-10.56.24.20-50010-1297226452541, infoPort=50075, 
> ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_-2233766441053341849_1392526 is not valid.
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:981)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:944)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:954)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:94)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:206)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114)
>
> There were quite a few WARNs, mostly related to flushing and taking a long 
> time to write to the edit logs (> 3000ms).
>
> I'm going to see if there is some edge cases in our indexing and loading 
> modules that slipped through earlier testing for now, but if you have any 
> other pointers that would be great.
>
> thanks,
> -chris
>
>
>

Reply via email to