Just in the interest of stating the obvious: Don't write a tool that scans 
through all the data in MySQL (or HBase) and then looks up each individual row 
(or even batches of rows) in the other store. That is very inefficient if you 
have a lot of data.

Do it like a merge-join instead: Get sorted results from MySQL such that they 
sort in the same order as the HBase key. Then read through those results and 
the HBase scan at the same time, advancing both sides together. That way you 
need only a single scan on both sides (per table).

-- Lars



________________________________
 From: tobe <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Tuesday, August 12, 2014 7:01 AM
Subject: Re: How to verify data from MySQL and HBase
 

Thank @JM for the detailed explanation.

I totally agree with you and we're developing an internal tool to do it.
It's not so general because we have to write the sql and generate the row
key manually. But it works and easy to understand. I would like to share
with anybody who also needs it.






On Tue, Aug 12, 2014 at 8:45 PM, Jean-Marc Spaggiari <
[email protected]> wrote:

> Hi Tobe,
>
> Thing is, your data in HBase might be organize very differently than in
> MySQL. Have you denormlized some of it? Have you used some Avro containers
> into an HBase cell? Have you do any cleanup? Or enrichment? At the end, it
> my be very different that what is stored into MySQL. There is not any easy
> process to validate that migration as been done correctly between the two
> databases because tools don't know the transformations you applied. I think
> you will have to build someone internally which will do this validation
> because not such tool exist today.
>
> You can use the row counters to count the rows, but will that mean content
> is correct? No. Calculate a CRC on the cells value? Not even, because you
> might have denormalized.
>
> Sorry, but you will have to do some coding here I think.
>
> JM
>
> JM
>
>
> 2014-08-12 3:49 GMT-04:00 tobe <[email protected]>:
>
> > Thanks for replaying. @Serega
> >
> > I know MySQL and HBase are reliable. What I want to validate is the date
> in
> > both MySQL and HBase. The upper-stream application and unexpected
> operation
> > may make it inconsistent.
> >
> > I'm also wondering could sqoop validate the values from each database?
> > There's RowCountValidator but it's not suitable for us.
> >
> >
> > On Tue, Aug 12, 2014 at 3:23 PM, Serega Sheypak <
> [email protected]>
> > wrote:
> >
> > > you should design resilient ETL-processes. Also introduce post-ETL
> > checks.
> > > There is no need to test MySQL or HBase. They are already tested.
> > > See this:
> > >
> >
> http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
> > > Pretty old, but gives basic ideas. Nothing changed from that time.
> > >
> > >
> > > 2014-08-12 7:55 GMT+04:00 tobe <[email protected]>:
> > >
> > > > Most of our users migrated their data form MySQL to HBase. Before
> they
> > > > totally trust HBase, they use MySQL and HBase at the same time.
> > Sometimes
> > > > the data is inconsistent because they use it incorrectly or maybe
> > > there're
> > > > bugs of HBase. Anyway, we have to make sure the data from MySQL and
> > HBase
> > > > is consistent.
> > > >
> > > > So how can we do that? Write a simple script or is there any general
> > > > method?
> > > >
> > >
> >
>

Reply via email to