Heh. Thanks for the links. I already read the Do and Donts :-). The videos volume is rather low.
I am already using lzo as my compression method. My regions are set to 30GB in resident memory. On Sat, Mar 31, 2012 at 1:19 PM, Marcos Ortiz <[email protected]> wrote: > Well, doing some calculations, you have 18 TB of data, divided in 9200 > regions, you have approximately 2.4 GB by regions. Is this correct? > > Well, my first advice is that you have to unable the automatic split > mechanism in HBase. It better to do this manually, but you will have an > insane number on regions in short time. > > The second is to enable compression (Gzip, LZO, Snappy) in all your HBase > cluster. This brings to you less data to work, and less network > overhead. > > Omer, one of the Software Engineer at the LA Hadoop User Group gave a > excellent talk about HBase called: "HBase Do's and Don'ts". I recommend > that you should see this talk. > > See the post first in the Cloudera's blog: > http://www.cloudera.com/blog/**2011/04/hbase-dos-and-donts/<http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/> > > - Video > http://www.meetup.com/LA-HUG/**pages/Video_from_April_13th_** > HBASE_DO%27S_and_DON%27TS/<http://www.meetup.com/LA-HUG/pages/Video_from_April_13th_HBASE_DO%27S_and_DON%27TS/> > > > > On 3/31/2012 5:33 AM, Rita wrote: > >> I have close to 9200 regions. Is there an example I can follow? or are >> there tools to do this already? >> >> >> >> On Fri, Mar 30, 2012 at 10:11 AM, Marcos Ortiz <[email protected] >> <mailto:[email protected]>> wrote: >> >> >> >> On 03/30/2012 04:54 AM, Rita wrote: >> >>> Thanks for the responses. I am using 0.90.4-cdh3. i exported the table >>> using hbase exporter. Yes, the previous table still exists but on a >>> different cluster.My region servers are large, close to 12GB in size. >>> >> Which is the total number of your regions? >> >> I want to understand regarding Hfiles. We export the table as a >>> series of >>> Hfiles and then import them in? >>> >> Yes, The simplest way to do this is using the TableOutputFormat, but >> if you use instead the HFileOutputFormat, the process will be more >> efficient, because using this feature (bulk loads) will use less CPU >> and network. With a MapReduce job, you prepare your data using the >> HFileOutputFormat (Hadoop's TotalOrderPartitioner class in used to >> partition the map output >> into disjoint ranges of the key space, corresponding to the key >> ranges of the regions in the table). >> >> >> What is the difference between that in the >>> regular MR export job? >>> >> The main difference with regular MR jobs is the output, instead to >> use the classic ouput formats like TextOutputFormat, >> MultipleOutputFormat, SequenceFileOutputFormat, etc, you will use >> the HFileOutputFormat, that is the native data file type for HBase >> (HFile). >> >> I idea sounds good because it sounds simple on the >>> surface :-) >>> >> >> >>> On Fri, Mar 30, 2012 at 12:08 AM, Stack<[email protected]> <mailto: >>> [email protected]> wrote: >>> >>> On Thu, Mar 29, 2012 at 7:57 PM, Rita<[email protected]> >>>> <mailto:[email protected]> wrote: >>>> >>>> Hello, >>>>> >>>>> I am importing a 40+ billion row table which I exported several >>>>> months >>>>> >>>> ago. >>>> >>>>> The data size is close to 18TB on hdfs (3x replication). >>>>> >>>>> Does the table from back then still exist? Or do you remember what >>>> the key spread was like? Could you precreate the old table? >>>> >>>> My problem is when I try to import it with mapreduce it takes a few >>>>> days >>>>> >>>> -- >>>> >>>>> which is ok -- however when the job fails to whatever reason, I >>>>> have to >>>>> restart everything. Is it possible to import the table in chunks >>>>> like, >>>>> import 1/3, 2/3, and then finally 3/3 of the table? >>>>> >>>>> Yeah. Funny how the plug gets pulled on the rack when the three >>>> day >>>> job is at the end 95% done. >>>> >>>> Btw, the jobs creates close to 150k mapper jobs, thats a problem >>>>> waiting >>>>> >>>> to >>>> >>>>> happen :-) >>>>> >>>>> Are you running 0.92? If not, you should and go for bigger >>>> regions. 10G? >>>> >>>> St.Ack >>>> >>>> >> -- >> Marcos Luis Ortíz Valmaseda (@marcosluis2186) >> Data Engineer at UCI >> >> http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com> >> >> >> <http://www.uci.cu/> >> >> >> >> >> >> -- >> --- Get your facts first, then you can distort them as you please.-- >> > > -- > Marcos Luis Ortíz Valmaseda (@marcosluis2186) > Data Engineer at UCI > http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com> > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/**universidad.uci<http://www.facebook.com/universidad.uci> > http://www.flickr.com/photos/**universidad_uci<http://www.flickr.com/photos/universidad_uci> > -- --- Get your facts first, then you can distort them as you please.--
