Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

anil gupta Sun, 10 Aug 2014 23:25:34 -0700

Hi Colin,

We also faced the scenario where after copying Table "A" from cluster to 1
to 2. Size of hdfs files between clusters was not equal. We also assumed
that it should be equal. Hence we ran verifyRep job.
I don't know whats the reason behind this discrepancy but i just wanted to
share this so that you are aware that you are not the only one facing this.


~Anil


On Sun, Aug 10, 2014 at 2:21 PM, Colin Kincaid Williams <disc...@uw.edu>
wrote:

> By the way I have copied the table across clusters, with the tables
> configured the same. the source cluster has an underlying ext2 filesystem,
> while the dest cluster had an underlying ext4 filesystem. The counts are
> the same for the tables. Will the filesystem account for the difference in
> directory size?
>
> [root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/
> dus: DEPRECATED: Please use 'du -s' instead.
> 225.9g  /a_d
>
>
> [root@clusterB_ext4 ~]#  sudo -u hdfs hadoop fs -dus -h /a_d/
> dus: DEPRECATED: Please use 'du -s' instead.
> 172.8g  /a_d
>
>
>
> On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> wrote:
>
> > HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715> opened.
> >
> >
> > 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <jean-m...@spaggiari.org>:
> >
> > > +1 too for a tool to produce a hash of a table. Like, one hash per
> > region,
> > > or as Lars said, one hash per range. You define the number of buckets
> you
> > > want, run the MR job, which produce a list of hash, and compare that
> from
> > > the 2 clusters. Might be pretty simple to do. The more buckets you
> > define,
> > > the less risk you have to have a hash collision. We can even have a
> > global
> > > hash and one hash per bucket, and other options...
> > >
> > >
> > > 2014-08-10 1:59 GMT-04:00 anil gupta <anilgupt...@gmail.com>:
> > >
> > > +1 for MerkleTree or Range Hash based implementation. We had a table
> > with 1
> > >> Billion records. We ran verifyRep for that table across two Data
> Centers
> > >> and it took close to 1 week to finish. It seems at present, VerifyRep
> > >> comapres every row byte by byte.
> > >>
> > >>
> > >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org>
> wrote:
> > >>
> > >> > VerifyReplication is something you could use. It's not replication
> > >> > specific, just named that way because it was initially conceived as
> a
> > >> tool
> > >> > to verify that replication is working correctly. Unfortunately it
> will
> > >> need
> > >> > to ship all data from the remote cluster, which is quite
> inefficient.
> > >> > I think we should include a better way with HBase, maybe using
> > >> > Merkletrees, or at least hashes of ranges, and compare those.
> > >> >
> > >> > -- Lars
> > >> >
> > >> >
> > >> >
> > >> > ________________________________
> > >> >  From: Colin Kincaid Williams <disc...@uw.edu>
> > >> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > >> > Sent: Saturday, August 9, 2014 2:28 PM
> > >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > >> copytable
> > >> > operation.
> > >> >
> > >> >
> > >> >
> > >> > Hi Everybody,
> > >> >
> > >> > I do wish to upgrade to a more recent hbase soon. However the choice
> > >> isn't
> > >> > entirely mine. Does anybody know how to verify the contents between
> > >> tables
> > >> > across clusters after a copytable operation?
> > >> > I see replication.VerifyReplication , but that seems replication
> > >> specific.
> > >> > Maybe I should have began with replication in the first place...
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > >> >
> > >> > Hi Colin,
> > >> > >
> > >> > >you might want to consider upgrading. The current stable version is
> > >> > 0.98.4 (soon .5).
> > >> > >
> > >> > >Even just going to 0.94 will give a lot of new features, stability,
> > and
> > >> > performance.
> > >> > >0.92.x can be upgraded to 0.94.x without any downtime and without
> any
> > >> > upgrade steps necessary.
> > >> > >For an upgrade to 0.98 and later you'd need some downtime and also
> > >> excute
> > >> > an upgrade step.
> > >> > >
> > >> > >
> > >> > >-- Lars
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >----- Original Message -----
> > >> > >From: Colin Kincaid Williams <disc...@uw.edu>
> > >> > >To: user@hbase.apache.org
> > >> > >Cc:
> > >> > >Sent: Friday, August 8, 2014 1:16 PM
> > >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > >> copytable
> > >> > operation.
> > >> > >
> > >> > >Not in the hbase shell I have:
> > >> > >
> > >> > >hbase version
> > >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> > >> >
> > >> >
> > >>
> >
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> > >> > >-r Unknown
> > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat
> > >> Jan 26
> > >> > >17:11:38 PST 2013
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> > >> > >
> > >> > >> Using simplified version of your command, I saw the following in
> > >> shell
> > >> > >> output (you may have noticed as well):
> > >> > >>
> > >> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> > >> > >> An argument ignored (unknown or overridden): VERSIONS
> > >> > >> 0 row(s) in 2.1110 seconds
> > >> > >>
> > >> > >> Cheers
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> > >> disc...@uw.edu
> > >> > >
> > >> > >> wrote:
> > >> > >>
> > >> > >> > I have discovered the error. I made the mistake regarding the
> > >> > compression
> > >> > >> > and the bloom filter. The new table doesn't have them enabled,
> > and
> > >> the
> > >> > >> old
> > >> > >> > does. However I'm wondering how I can create tables with splits
> > >> and bf
> > >> > >> and
> > >> > >> > compression enabled. Shouldn't the following command return an
> > >> error?
> > >> > >> >
> > >> > >> > hbase(main):001:0> create 'ADMd5','a',{
> > >> > >> >
> > >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > >> > >> > hbase(main):003:1* VERSIONS => '1',
> > >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > >> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> > >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > >> > >> > 0 row(s) in 1.8010 seconds
> > >> > >> >
> > >> > >> > hbase(main):024:0> describe 'ADMd5'
> > >> > >> > DESCRIPTION                                        ENABLED
> > >> > >> >
> > >> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > >> > >> >
> > >> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > >> > >> >
> > >> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > >> > >> >
> > >> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > >> > >> >
> > >> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > >> > >> >
> > >> > >> > 1 row(s) in 0.0420 seconds
> > >> > >> >
> > >> > >> >
> > >> > >> >
> > >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > >> > >> > jean-m...@spaggiari.org
> > >> > >> > > wrote:
> > >> > >> >
> > >> > >> > > Hi Colin,
> > >> > >> > >
> > >> > >> > > Just to make sure.
> > >> > >> > >
> > >> > >> > > Is table A from the source cluster and not compressed, and
> > table
> > >> B
> > >> > in
> > >> > >> the
> > >> > >> > > destination cluster and SNAPPY compressed? Is that correct?
> > Then
> > >> > ratio
> > >> > >> > > should be the opposite. Are you able to du -h from hadoop to
> > see
> > >> if
> > >> > all
> > >> > >> > > regions are evenly bigger or if anything else is wrong?
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
> > >> disc...@uw.edu>:
> > >> > >> > >
> > >> > >> > > > I haven't yet tried to major compact table B. I will look
> up
> > >> some
> > >> > >> > > > documentation on WALs and snapshots to find this
> information
> > in
> > >> > the
> > >> > >> > hdfs
> > >> > >> > > > filesystem tomorrow. Could it be caused by the bloomfilter
> > >> > existing
> > >> > >> on
> > >> > >> > > > table B, but not table A? The funny thing is the source
> table
> > >> is
> > >> > >> > smaller
> > >> > >> > > > than the destination.
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > >> > >> > este...@cloudera.com>
> > >> > >> > > > wrote:
> > >> > >> > > >
> > >> > >> > > > > Hi Colin,
> > >> > >> > > > >
> > >> > >> > > > > Have you verified if the content of /a_d includes WALs
> > and/or
> > >> > the
> > >> > >> > > content
> > >> > >> > > > > of the snapshots or the HBase archive? have you tried to
> > >> major
> > >> > >> > compact
> > >> > >> > > > > table B?  does it makes any difference?
> > >> > >> > > > >
> > >> > >> > > > > regards,
> > >> > >> > > > > esteban.
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > > --
> > >> > >> > > > > Cloudera, Inc.
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> > >> > >> > disc...@uw.edu
> > >> > >> > > >
> > >> > >> > > > > wrote:
> > >> > >> > > > >
> > >> > >> > > > > > I used the copy table command to copy a database
> between
> > >> the
> > >> > >> > original
> > >> > >> > > > > > cluster A and a new cluster B. I have noticed that the
> > >> > rootdir is
> > >> > >> > > > larger
> > >> > >> > > > > > than 2X the size of the original. I am trying to
> account
> > >> for
> > >> > >> such a
> > >> > >> > > > large
> > >> > >> > > > > > difference. The following are some details about the
> > table.
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > I'm trying to figure out why my copied table is more
> than
> > >> 2X
> > >> > the
> > >> > >> > size
> > >> > >> > > > of
> > >> > >> > > > > > the original table. Could the bloomfilter itself
> account
> > >> for
> > >> > >> this?
> > >> > >> > > > > >
> > >> > >> > > > > > The guide I used as a reference:
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > Supposedly the original command used to create the
> table
> > on
> > >> > >> cluster
> > >> > >> > > A:
> > >> > >> > > > > >
> > >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW',
> > >> VERSIONS
> > >> > =>
> > >> > >> > '1',
> > >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > How I created the target table on cluster B:
> > >> > >> > > > > >
> > >> > >> > > > > > create 'ADMd5','a',{
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > BLOOMFILTER => 'ROW',
> > >> > >> > > > > > VERSIONS => '1',
> > >> > >> > > > > > COMPRESSION => 'SNAPPY',
> > >> > >> > > > > > MIN_VERSIONS => '0',
> > >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > How the tables now appear in hbase shell:
> > >> > >> > > > > >
> > >> > >> > > > > > table A:
> > >> > >> > > > > >
> > >> > >> > > > > > describe 'ADMd5'
> > >> > >> > > > > > DESCRIPTION
> > >> > >> > > > > >
> > >> > >> > > > > >   ENABLED
> > >> > >> > > > > >
> > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> BLOOMFILTER
> > >> =>
> > >> > >> > 'NONE',
> > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION
> =>
> > >> > 'NONE',
> > >> > >> > > > MIN_VER
> > >> > >> > > > > > true
> > >> > >> > > > > >
> > >> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> '65536',
> > >> > >> IN_MEMORY
> > >> > >> > > =>
> > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > 1 row(s) in 0.0370 seconds
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > table B:
> > >> > >> > > > > >
> > >> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> > >> > >> > > > > > DESCRIPTION
> > >> > >> > > > > >
> > >> > >> > > > > >   ENABLED
> > >> > >> > > > > >
> > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> BLOOMFILTER
> > >> =>
> > >> > >> 'ROW',
> > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION
> =>
> > >> > >> 'SNAPPY',
> > >> > >> > > > > MIN_VE
> > >> > >> > > > > > true
> > >> > >> > > > > >
> > >> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > '65536',
> > >> > >> > IN_MEMORY
> > >> > >> > > =>
> > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > 1 row(s) in 0.0280 seconds
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > The containing foldersize in hdfs:
> > >> > >> > > > > > table A:
> > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > >> > >> > > > > > 227.4g  /a_d
> > >> > >> > > > > >
> > >> > >> > > > > > table B:
> > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > >> > >> > > > > > 501.0g  /a_d
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks & Regards,
> > >> Anil Gupta
> > >>
> > >
> > >
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Reply via email to