Hi Colin, Does your table contain some really large rows?
There're some errors when I copy a table with the rows which have 400K columns. I have not tested the content but I'm shocked when you said you were missing data with CopyTable. On Wed, Aug 13, 2014 at 9:00 AM, Colin Kincaid Williams <disc...@uw.edu> wrote: > it appears that there is a bug in the copytable operation. We are missing a > large amount of data after copying between clusters. I don't know if I can > provide a sample data set or not, but I can try to dig up some details. > One of our developers rewrote the operation using another library, and is > testing his copy now. > > > On Sun, Aug 10, 2014 at 11:24 PM, anil gupta <anilgupt...@gmail.com> > wrote: > > > Hi Colin, > > > > We also faced the scenario where after copying Table "A" from cluster to > 1 > > to 2. Size of hdfs files between clusters was not equal. We also assumed > > that it should be equal. Hence we ran verifyRep job. > > I don't know whats the reason behind this discrepancy but i just wanted > to > > share this so that you are aware that you are not the only one facing > this. > > > > ~Anil > > > > > > On Sun, Aug 10, 2014 at 2:21 PM, Colin Kincaid Williams <disc...@uw.edu> > > wrote: > > > > > By the way I have copied the table across clusters, with the tables > > > configured the same. the source cluster has an underlying ext2 > > filesystem, > > > while the dest cluster had an underlying ext4 filesystem. The counts > are > > > the same for the tables. Will the filesystem account for the difference > > in > > > directory size? > > > > > > [root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/ > > > dus: DEPRECATED: Please use 'du -s' instead. > > > 225.9g /a_d > > > > > > > > > [root@clusterB_ext4 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/ > > > dus: DEPRECATED: Please use 'du -s' instead. > > > 172.8g /a_d > > > > > > > > > > > > On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari < > > > jean-m...@spaggiari.org> wrote: > > > > > > > HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715> > > opened. > > > > > > > > > > > > 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari < > jean-m...@spaggiari.org > > >: > > > > > > > > > +1 too for a tool to produce a hash of a table. Like, one hash per > > > > region, > > > > > or as Lars said, one hash per range. You define the number of > buckets > > > you > > > > > want, run the MR job, which produce a list of hash, and compare > that > > > from > > > > > the 2 clusters. Might be pretty simple to do. The more buckets you > > > > define, > > > > > the less risk you have to have a hash collision. We can even have a > > > > global > > > > > hash and one hash per bucket, and other options... > > > > > > > > > > > > > > > 2014-08-10 1:59 GMT-04:00 anil gupta <anilgupt...@gmail.com>: > > > > > > > > > > +1 for MerkleTree or Range Hash based implementation. We had a > table > > > > with 1 > > > > >> Billion records. We ran verifyRep for that table across two Data > > > Centers > > > > >> and it took close to 1 week to finish. It seems at present, > > VerifyRep > > > > >> comapres every row byte by byte. > > > > >> > > > > >> > > > > >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org> > > > wrote: > > > > >> > > > > >> > VerifyReplication is something you could use. It's not > replication > > > > >> > specific, just named that way because it was initially conceived > > as > > > a > > > > >> tool > > > > >> > to verify that replication is working correctly. Unfortunately > it > > > will > > > > >> need > > > > >> > to ship all data from the remote cluster, which is quite > > > inefficient. > > > > >> > I think we should include a better way with HBase, maybe using > > > > >> > Merkletrees, or at least hashes of ranges, and compare those. > > > > >> > > > > > >> > -- Lars > > > > >> > > > > > >> > > > > > >> > > > > > >> > ________________________________ > > > > >> > From: Colin Kincaid Williams <disc...@uw.edu> > > > > >> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org> > > > > >> > Sent: Saturday, August 9, 2014 2:28 PM > > > > >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after > > > > >> copytable > > > > >> > operation. > > > > >> > > > > > >> > > > > > >> > > > > > >> > Hi Everybody, > > > > >> > > > > > >> > I do wish to upgrade to a more recent hbase soon. However the > > choice > > > > >> isn't > > > > >> > entirely mine. Does anybody know how to verify the contents > > between > > > > >> tables > > > > >> > across clusters after a copytable operation? > > > > >> > I see replication.VerifyReplication , but that seems replication > > > > >> specific. > > > > >> > Maybe I should have began with replication in the first place... > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org > > > > > > wrote: > > > > >> > > > > > >> > Hi Colin, > > > > >> > > > > > > >> > >you might want to consider upgrading. The current stable > version > > is > > > > >> > 0.98.4 (soon .5). > > > > >> > > > > > > >> > >Even just going to 0.94 will give a lot of new features, > > stability, > > > > and > > > > >> > performance. > > > > >> > >0.92.x can be upgraded to 0.94.x without any downtime and > without > > > any > > > > >> > upgrade steps necessary. > > > > >> > >For an upgrade to 0.98 and later you'd need some downtime and > > also > > > > >> excute > > > > >> > an upgrade step. > > > > >> > > > > > > >> > > > > > > >> > >-- Lars > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > >----- Original Message ----- > > > > >> > >From: Colin Kincaid Williams <disc...@uw.edu> > > > > >> > >To: user@hbase.apache.org > > > > >> > >Cc: > > > > >> > >Sent: Friday, August 8, 2014 1:16 PM > > > > >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after > > > > >> copytable > > > > >> > operation. > > > > >> > > > > > > >> > >Not in the hbase shell I have: > > > > >> > > > > > > >> > >hbase version > > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3 > > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3 > > > > >> > >-r Unknown > > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on > > Sat > > > > >> Jan 26 > > > > >> > >17:11:38 PST 2013 > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yuzhih...@gmail.com> > > > wrote: > > > > >> > > > > > > >> > >> Using simplified version of your command, I saw the following > > in > > > > >> shell > > > > >> > >> output (you may have noticed as well): > > > > >> > >> > > > > >> > >> An argument ignored (unknown or overridden): BLOOMFILTER > > > > >> > >> An argument ignored (unknown or overridden): VERSIONS > > > > >> > >> 0 row(s) in 2.1110 seconds > > > > >> > >> > > > > >> > >> Cheers > > > > >> > >> > > > > >> > >> > > > > >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams < > > > > >> disc...@uw.edu > > > > >> > > > > > > >> > >> wrote: > > > > >> > >> > > > > >> > >> > I have discovered the error. I made the mistake regarding > the > > > > >> > compression > > > > >> > >> > and the bloom filter. The new table doesn't have them > > enabled, > > > > and > > > > >> the > > > > >> > >> old > > > > >> > >> > does. However I'm wondering how I can create tables with > > splits > > > > >> and bf > > > > >> > >> and > > > > >> > >> > compression enabled. Shouldn't the following command return > > an > > > > >> error? > > > > >> > >> > > > > > >> > >> > hbase(main):001:0> create 'ADMd5','a',{ > > > > >> > >> > > > > > >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW', > > > > >> > >> > hbase(main):003:1* VERSIONS => '1', > > > > >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY', > > > > >> > >> > hbase(main):005:1* MIN_VERSIONS => '0', > > > > >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==', > > > > >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==', > > > > >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==', > > > > >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==', > > > > >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==', > > > > >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==', > > > > >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==', > > > > >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==', > > > > >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==', > > > > >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==', > > > > >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==', > > > > >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==', > > > > >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==', > > > > >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==', > > > > >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==', > > > > >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==', > > > > >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==', > > > > >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]} > > > > >> > >> > 0 row(s) in 1.8010 seconds > > > > >> > >> > > > > > >> > >> > hbase(main):024:0> describe 'ADMd5' > > > > >> > >> > DESCRIPTION ENABLED > > > > >> > >> > > > > > >> > >> > {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true > > > > >> > >> > > > > > >> > >> > MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS > > > > >> > >> > > > > > >> > >> > IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS > > > > >> > >> > > > > > >> > >> > => '0', TTL => '2147483647', BLOCKSIZE => '65536' > > > > >> > >> > > > > > >> > >> > , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} > > > > >> > >> > > > > > >> > >> > 1 row(s) in 0.0420 seconds > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari < > > > > >> > >> > jean-m...@spaggiari.org > > > > >> > >> > > wrote: > > > > >> > >> > > > > > >> > >> > > Hi Colin, > > > > >> > >> > > > > > > >> > >> > > Just to make sure. > > > > >> > >> > > > > > > >> > >> > > Is table A from the source cluster and not compressed, > and > > > > table > > > > >> B > > > > >> > in > > > > >> > >> the > > > > >> > >> > > destination cluster and SNAPPY compressed? Is that > correct? > > > > Then > > > > >> > ratio > > > > >> > >> > > should be the opposite. Are you able to du -h from hadoop > > to > > > > see > > > > >> if > > > > >> > all > > > > >> > >> > > regions are evenly bigger or if anything else is wrong? > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams < > > > > >> disc...@uw.edu>: > > > > >> > >> > > > > > > >> > >> > > > I haven't yet tried to major compact table B. I will > look > > > up > > > > >> some > > > > >> > >> > > > documentation on WALs and snapshots to find this > > > information > > > > in > > > > >> > the > > > > >> > >> > hdfs > > > > >> > >> > > > filesystem tomorrow. Could it be caused by the > > bloomfilter > > > > >> > existing > > > > >> > >> on > > > > >> > >> > > > table B, but not table A? The funny thing is the source > > > table > > > > >> is > > > > >> > >> > smaller > > > > >> > >> > > > than the destination. > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez < > > > > >> > >> > este...@cloudera.com> > > > > >> > >> > > > wrote: > > > > >> > >> > > > > > > > >> > >> > > > > Hi Colin, > > > > >> > >> > > > > > > > > >> > >> > > > > Have you verified if the content of /a_d includes > WALs > > > > and/or > > > > >> > the > > > > >> > >> > > content > > > > >> > >> > > > > of the snapshots or the HBase archive? have you tried > > to > > > > >> major > > > > >> > >> > compact > > > > >> > >> > > > > table B? does it makes any difference? > > > > >> > >> > > > > > > > > >> > >> > > > > regards, > > > > >> > >> > > > > esteban. > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > -- > > > > >> > >> > > > > Cloudera, Inc. > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid > Williams > > < > > > > >> > >> > disc...@uw.edu > > > > >> > >> > > > > > > > >> > >> > > > > wrote: > > > > >> > >> > > > > > > > > >> > >> > > > > > I used the copy table command to copy a database > > > between > > > > >> the > > > > >> > >> > original > > > > >> > >> > > > > > cluster A and a new cluster B. I have noticed that > > the > > > > >> > rootdir is > > > > >> > >> > > > larger > > > > >> > >> > > > > > than 2X the size of the original. I am trying to > > > account > > > > >> for > > > > >> > >> such a > > > > >> > >> > > > large > > > > >> > >> > > > > > difference. The following are some details about > the > > > > table. > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > I'm trying to figure out why my copied table is > more > > > than > > > > >> 2X > > > > >> > the > > > > >> > >> > size > > > > >> > >> > > > of > > > > >> > >> > > > > > the original table. Could the bloomfilter itself > > > account > > > > >> for > > > > >> > >> this? > > > > >> > >> > > > > > > > > > >> > >> > > > > > The guide I used as a reference: > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > Supposedly the original command used to create the > > > table > > > > on > > > > >> > >> cluster > > > > >> > >> > > A: > > > > >> > >> > > > > > > > > > >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', > > > > >> VERSIONS > > > > >> > => > > > > >> > >> > '1', > > > > >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'} > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > How I created the target table on cluster B: > > > > >> > >> > > > > > > > > > >> > >> > > > > > create 'ADMd5','a',{ > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > BLOOMFILTER => 'ROW', > > > > >> > >> > > > > > VERSIONS => '1', > > > > >> > >> > > > > > COMPRESSION => 'SNAPPY', > > > > >> > >> > > > > > MIN_VERSIONS => '0', > > > > >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==', > > > > >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==', > > > > >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==', > > > > >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==', > > > > >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==', > > > > >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==', > > > > >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==', > > > > >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==', > > > > >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==', > > > > >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==', > > > > >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==', > > > > >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==', > > > > >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==', > > > > >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==', > > > > >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==', > > > > >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==', > > > > >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==', > > > > >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']} > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > How the tables now appear in hbase shell: > > > > >> > >> > > > > > > > > > >> > >> > > > > > table A: > > > > >> > >> > > > > > > > > > >> > >> > > > > > describe 'ADMd5' > > > > >> > >> > > > > > DESCRIPTION > > > > >> > >> > > > > > > > > > >> > >> > > > > > ENABLED > > > > >> > >> > > > > > > > > > >> > >> > > > > > {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', > > > BLOOMFILTER > > > > >> => > > > > >> > >> > 'NONE', > > > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', > > COMPRESSION > > > => > > > > >> > 'NONE', > > > > >> > >> > > > MIN_VER > > > > >> > >> > > > > > true > > > > >> > >> > > > > > > > > > >> > >> > > > > > SIONS => '0', TTL => '2147483647', BLOCKSIZE => > > > '65536', > > > > >> > >> IN_MEMORY > > > > >> > >> > > => > > > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]} > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > 1 row(s) in 0.0370 seconds > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > table B: > > > > >> > >> > > > > > > > > > >> > >> > > > > > hbase(main):003:0> describe 'ADMd5' > > > > >> > >> > > > > > DESCRIPTION > > > > >> > >> > > > > > > > > > >> > >> > > > > > ENABLED > > > > >> > >> > > > > > > > > > >> > >> > > > > > {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', > > > BLOOMFILTER > > > > >> => > > > > >> > >> 'ROW', > > > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', > > COMPRESSION > > > => > > > > >> > >> 'SNAPPY', > > > > >> > >> > > > > MIN_VE > > > > >> > >> > > > > > true > > > > >> > >> > > > > > > > > > >> > >> > > > > > RSIONS => '0', TTL => '2147483647', BLOCKSIZE => > > > > '65536', > > > > >> > >> > IN_MEMORY > > > > >> > >> > > => > > > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]} > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > 1 row(s) in 0.0280 seconds > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > The containing foldersize in hdfs: > > > > >> > >> > > > > > table A: > > > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d > > > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead. > > > > >> > >> > > > > > 227.4g /a_d > > > > >> > >> > > > > > > > > > >> > >> > > > > > table B: > > > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d > > > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead. > > > > >> > >> > > > > > 501.0g /a_d > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2 > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > > >> > > > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> Thanks & Regards, > > > > >> Anil Gupta > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > > >