Re: hbase bulk load / table split

Marc Limotte Wed, 05 Jan 2011 09:28:14 -0800

Michael,

Your solution seems to work.  My keys are not evenly dispersed, so I
couldn't use the createTable with N regions signature, but I was able to
sample one data set and come up with a half-way reasonable set of starter
keys.  I used those to create a new table with 20 splits.  HBase appears to
be much happier now, it's continuing to re-split my initial tables and
distribute the load better.


thanks,
Mike


On Tue, Jan 4, 2011 at 9:09 AM, Michael Segel <[email protected]>wrote:

>
>
> Marc,
>
> Just an idea.
> Can you create your table with N regions with null as the start key and the
> largest possible key value and your end key?
> Talking with ssechrist on IRC he pointed me to the API :
>
> http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable%28org.apache.hadoop.hbase.HTableDescriptor,%20byte[],%20byte[],%20int%29
>
> And looking at the createTable() methods....
>
> HTH
>
> -Mike
>
> > From: [email protected]
> > Date: Tue, 4 Jan 2011 06:03:21 -0800
> > Subject: hbase bulk load / table split
> > To: [email protected]
> >
> > I've made some good progress using the HBase Bulk Load Tool.   With HBase
> > 0.89.20100924+28.
> >
> > My initial implementation did not have importtsv do compression, and it
> ran
> > directly on the hbase cluster's hadoop.  It's been working ok for a while
> > (but slow, because of limited resources).
> >
> > My next implementation, as discussed in another thread, has compression
> > settings turned on for importtsv (thanks, Lars).  And I am running the
> > importtsv on a remote cluster and then distcp'ing (thanks, Todd) the
> results
> > to the HBase cluster for the completebulkload step.
> >
> > I'm trying this out with a fresh (empty) Hbase table.  So, the first run
> of
> > importtsv takes a very long time, because the table only has one region,
> so
> > it starts only one Reducer.
> >
> >    - Bulk load into a new table
> >    - About 20 GB of data (compressed with gzip)
> >    - Created one massive region
> >
> > It seemed to complete successfully.  But we are seeing some intermittent
> > errors (missing blocks and such).
> >
> > Could not obtain block: blk_-5944324410280250477_429443
> > >
> file=/hbase/mytable/7c2b09e1ef8c4984732f362d7876305c/metrics/7947729174003011436
> > >
> >
> > The initial region seems to have split once, but I'm not sure the split
> > completed, since the key ranges overlap and the storeFileSizeMB seems to
> be
> > about as big as it started out.  My theory is that the initial load is
> too
> > large for a region, and the split either failed or is still in progress.
> >
> >  Both on the same Region Server:
> > >
> >
> > >
> >
> mytable,ad_format728x90site_category2advertiser14563countrysepublisher2e03ab73-b234-4413-bcee-6183a99bd840starttime1293897600,1294094158507.2360f0a03e2566c72ea1a07c40f5f296.
> > > stores=2, storefiles=1075, storefileSizeMB=19230, memstoreSizeMB=0,
> > > storefileIndexSizeMB=784
> > > --
> > > mytable,,1294094158507.33b1e47c5fb004aa801b0bd88ce8322d.
> > > stores=2, storefiles=1083, storefileSizeMB=19546, memstoreSizeMB=0,
> > > storefileIndexSizeMB=796
> > >
> >
> > Another new table on this same hbase loaded around the same time, has
> > already split into 69 regions (storefileSizeMB 200 - 400 each).  This one
> > was loaded in smaller chunks with importtsv running directly on the hbase
> > cluster, but also with compression on.
> >
> > Now that I've gotten all the background down, here are my questions:
> >
> >    1. Is it still working on the split?  Any way to monitor progress?
> >    2. Can I force more splits?
> >    3. Should I have done something first to avoid having the bulk load
> >    create one big region?
> >    4. Would it be easier to split if my initial bulkload was not gzip
> >    compressed?
> >    5. Am I looking in the wrong place entirely for this issue?
> >
> > thanks,
> > Marc
>

Re: hbase bulk load / table split

Reply via email to