I have had replication running for about a week now, and have had a lot of
data flowing to our slave cluster over that time. Now, I'm running the
verifyrep MR job over a 1-hour period a couple days ago (which should be
fully replicated), and I'm seeing a small number of "BADROWS".
Spot-checking a few of them, the issue seems to be that the rows are
present, and have the same values, but a single cell in the row will be off
by 1ms.

For instance, the log reports this error:
java.lang.Exception: This result was different:
keyvalues={01e581745c6a43aba01adf105af4e4a92013071015/data:!\xDF\xE0\x01/1373470622986/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:&s\xC0\x01/1373470923084/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:+\x07\xA0\x01/1373471223717/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:/\x9B\x80\x01/1373471523316/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:4/`\x01/1373471822913/Put/vlen=8}
compared to
keyvalues={01e581745c6a43aba01adf105af4e4a92013071015/data:!\xDF\xE0\x01/1373470622986/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:&s\xC0\x01/1373470923084/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:+\x07\xA0\x01/1373471223716/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:/\x9B\x80\x01/1373471523316/Put/vlen=8,
01e581745c6a43aba01adf105af4e4a92013071015/data:4/`\x01/1373471822913/Put/vlen=8}

Some diffing reduces the issue down to:
01e581745c6a43aba01adf105af4e4a92013071015/data:+\x07\xA0\x01/1373471223717/Put/vlen=8
compared to
01e581745c6a43aba01adf105af4e4a92013071015/data:+\x07\xA0\x01/1373471223716/Put/vlen=8.

I'm assuming that the value before "/Put" is the cell's timestamp, which
means that the copies are off by 1ms.

Any idea what could cause this? So far (the job is still running), the
problem seems rare (about 0.05% of rows).

Thanks,
Patrick

Reply via email to