Thank you so much for pointing out the mistake sir. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com
On Mon, Jan 21, 2013 at 12:06 PM, Anoop Sam John <[email protected]> wrote: > @Mohammad > As he is using HFileOutputFormat, there is no put call happening on > HTable. In this case the MR will create the HFiles directly with out using > the normal HBase write path. Then later using HRS API the HFiles are loaded > to the table regions. > In this case the number of reducers will be that of the table regions. So > Austin you can check with proper presplit of table. > > -Anoop- > ________________________________________ > From: Mohammad Tariq [[email protected]] > Sent: Monday, January 21, 2013 12:01 PM > To: [email protected] > Subject: Re: Loading data, hbase slower than Hive? > > Apart from this you can have some additional tweaks to improve > put performance. Like, creating pre-splitted tables, making use of > put(List<Put> puts) instead of normal put etc. > > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Mon, Jan 21, 2013 at 11:46 AM, Austin Chungath <[email protected] > >wrote: > > > Anoop, > > > > I am using HFileOutputFormat. I am doing nothing but splitting the data > > from each row by the delimiter and sending it into their respective > > columns. > > Is there some kind of preprocessing or steps that I should do before > this? > > As suggested I will look into the above solutions and let you guys know > > what the problem was. I might have to rethink the Rowkey design. > > > > Regards, > > Austin. > > > > On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John <[email protected]> > > wrote: > > > > > Austin, > > > You are using HFileOutputFormat or TableOutputFormat? > > > > > > -Anoop- > > > ________________________________________ > > > From: Austin Chungath [[email protected]] > > > Sent: Monday, January 21, 2013 11:15 AM > > > To: [email protected] > > > Subject: Re: Loading data, hbase slower than Hive? > > > > > > Thank you Tariq. > > > I will let you know how things went after I implement these > suggestions. > > > > > > Regards, > > > Austin > > > > > > On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[email protected]> > > > wrote: > > > > > > > Hello Austin, > > > > > > > > I am sorry for the late response. > > > > > > > > Asaf has made a very valid point. Rowkwey design is very crucial. > > > > Specially if the data is gonna be sequential(timeseries kinda thing). > > > > You may end up with hotspotting problem. Use pre-splitted tables > > > > or hash the keys to avoid that. It'll also allow you to fetch the > > results > > > > faster. > > > > > > > > Warm Regards, > > > > Tariq > > > > https://mtariq.jux.com/ > > > > cloudfront.blogspot.com > > > > > > > > > > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[email protected]> > > > > wrote: > > > > > > > > > Start by telling us your row key design. > > > > > Check for pre splitting your table regions. > > > > > I managed to get to 25mb/sec write throughput in Hbase using 1 > region > > > > > server. If your data is evenly spread you can get around 7 times > that > > > in > > > > a > > > > > 10 regions server environment. Should mean that 1 gig should take 4 > > > sec. > > > > > > > > > > > > > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > > > > > > > > > Hey, > > > > > > Can someone throw some pointers on what would be the best > practice > > > for > > > > > bulk > > > > > > imports in hbase ? > > > > > > That would be really helpful. > > > > > > > > > > > > Regards, > > > > > > Praveenesh > > > > > > > > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq < > > [email protected] > > > > > <javascript:;>> > > > > > > wrote: > > > > > > > > > > > > > Just to add to whatever all the heavyweights have said above, > > your > > > MR > > > > > job > > > > > > > may not be as efficient as the MR job corresponding to your > Hive > > > > query. > > > > > > You > > > > > > > can enhance the performance by setting the mapred config > > parameters > > > > > > wisely > > > > > > > and by tuning your MR job. > > > > > > > > > > > > > > Warm Regards, > > > > > > > Tariq > > > > > > > https://mtariq.jux.com/ > > > > > > > cloudfront.blogspot.com > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > > > > > > [email protected] <javascript:;>> wrote: > > > > > > > > > > > > > > > Hive is more for batch and HBase is for more of real time > data. > > > > > > > > > > > > > > > > Regards > > > > > > > > Ram > > > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John < > > > > [email protected] > > > > > <javascript:;> > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > In case of Hive data insertion means placing the file under > > > table > > > > > > path > > > > > > > in > > > > > > > > > HDFS. HBase need to read the data and convert it into its > > > > format. > > > > > > > > (HFiles) > > > > > > > > > MR is doing this work.. So this makes it clear that HBase > > will > > > > be > > > > > > > > slower. > > > > > > > > > :) As Michael said the read operation... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < > > > > > > [email protected] <javascript:;> > > > > > > > > > >wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > Problem: hive took 6 mins to load a data set, hbase took > 1 > > hr > > > > 14 > > > > > > > mins. > > > > > > > > > > It's a 20 gb data set approx 230 million records. The > data > > is > > > > in > > > > > > > hdfs, > > > > > > > > > > single text file. The cluster is 11 nodes, 8 cores. > > > > > > > > > > > > > > > > > > > > I loaded this in hive, partitioned by date and bucketed > > into > > > 32 > > > > > and > > > > > > > > > sorted. > > > > > > > > > > Time taken is 6 mins. > > > > > > > > > > > > > > > > > > > > I loaded the same data into hbase, in the same cluster by > > > > > writing a > > > > > > > map > > > > > > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't > > running > > > > > > anything > > > > > > > > > else > > > > > > > > > > and assuming that the code that i wrote is good enough, > > what > > > is > > > > > it > > > > > > > that > > > > > > > > > > makes hbase slower than hive in loading the data? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Austin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
