Re: Loading data, hbase slower than Hive?

Mohammad Tariq Sun, 20 Jan 2013 22:41:08 -0800

Thank you so much for pointing out the mistake sir.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com



On Mon, Jan 21, 2013 at 12:06 PM, Anoop Sam John <[email protected]> wrote:

> @Mohammad
> As he is using HFileOutputFormat, there is no put call happening on
> HTable. In this case the MR will create the HFiles directly with out using
> the normal HBase write path. Then later using HRS API the HFiles are loaded
> to the table regions.
> In this case the number of reducers will be that of the table regions. So
> Austin you can check with proper presplit of table.
>
> -Anoop-
> ________________________________________
> From: Mohammad Tariq [[email protected]]
> Sent: Monday, January 21, 2013 12:01 PM
> To: [email protected]
> Subject: Re: Loading data, hbase slower than Hive?
>
> Apart from this you can have some additional tweaks to improve
> put performance. Like, creating pre-splitted tables, making use of
> put(List<Put> puts) instead of normal put etc.
>
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Jan 21, 2013 at 11:46 AM, Austin Chungath <[email protected]
> >wrote:
>
> > Anoop,
> >
> > I am using HFileOutputFormat. I am doing nothing but splitting the data
> > from each row by the delimiter and sending it into their respective
> > columns.
> > Is there some kind of preprocessing or steps that I should do before
> this?
> > As suggested I will look into the above solutions and let you guys know
> > what the problem was. I might have to rethink the Rowkey design.
> >
> > Regards,
> > Austin.
> >
> > On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John <[email protected]>
> > wrote:
> >
> > > Austin,
> > >         You are using HFileOutputFormat or TableOutputFormat?
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: Austin Chungath [[email protected]]
> > > Sent: Monday, January 21, 2013 11:15 AM
> > > To: [email protected]
> > > Subject: Re: Loading data, hbase slower than Hive?
> > >
> > > Thank you Tariq.
> > > I will let you know how things went after I implement these
> suggestions.
> > >
> > > Regards,
> > > Austin
> > >
> > > On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[email protected]>
> > > wrote:
> > >
> > > > Hello Austin,
> > > >
> > > >           I am sorry for the late response.
> > > >
> > > > Asaf has made a very valid point. Rowkwey design is very crucial.
> > > > Specially if the data is gonna be sequential(timeseries kinda thing).
> > > > You may end up with hotspotting problem. Use pre-splitted tables
> > > > or hash the keys to avoid that. It'll also allow you to fetch the
> > results
> > > > faster.
> > > >
> > > > Warm Regards,
> > > > Tariq
> > > > https://mtariq.jux.com/
> > > > cloudfront.blogspot.com
> > > >
> > > >
> > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[email protected]>
> > > > wrote:
> > > >
> > > > > Start by telling us your row key design.
> > > > > Check for pre splitting your table regions.
> > > > > I managed to get to 25mb/sec write throughput in Hbase using 1
> region
> > > > > server. If your data is evenly spread you can get around 7 times
> that
> > > in
> > > > a
> > > > > 10 regions server environment. Should mean that 1 gig should take 4
> > > sec.
> > > > >
> > > > >
> > > > > On Friday, January 18, 2013, praveenesh kumar wrote:
> > > > >
> > > > > > Hey,
> > > > > > Can someone throw some pointers on what would be the best
> practice
> > > for
> > > > > bulk
> > > > > > imports in hbase ?
> > > > > > That would be really helpful.
> > > > > >
> > > > > > Regards,
> > > > > > Praveenesh
> > > > > >
> > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <
> > [email protected]
> > > > > <javascript:;>>
> > > > > > wrote:
> > > > > >
> > > > > > > Just to add to whatever all the heavyweights have said above,
> > your
> > > MR
> > > > > job
> > > > > > > may not be as efficient as the MR job corresponding to your
> Hive
> > > > query.
> > > > > > You
> > > > > > > can enhance the performance by setting the mapred config
> > parameters
> > > > > > wisely
> > > > > > > and by tuning your MR job.
> > > > > > >
> > > > > > > Warm Regards,
> > > > > > > Tariq
> > > > > > > https://mtariq.jux.com/
> > > > > > > cloudfront.blogspot.com
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan <
> > > > > > > [email protected] <javascript:;>> wrote:
> > > > > > >
> > > > > > > > Hive is more for batch and HBase is for more of real time
> data.
> > > > > > > >
> > > > > > > > Regards
> > > > > > > > Ram
> > > > > > > >
> > > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <
> > > > [email protected]
> > > > > <javascript:;>
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > In case of Hive data insertion means placing the file under
> > > table
> > > > > > path
> > > > > > > in
> > > > > > > > > HDFS.  HBase need to read the data and convert it into its
> > > > format.
> > > > > > > > (HFiles)
> > > > > > > > > MR is doing this work..  So this makes it clear that HBase
> > will
> > > > be
> > > > > > > > slower.
> > > > > > > > > :)  As Michael said the read operation...
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -Anoop-
> > > > > > > > >
> > > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath <
> > > > > > [email protected] <javascript:;>
> > > > > > > > > >wrote:
> > > > > > > > >
> > > > > > > > > >   Hi,
> > > > > > > > > > Problem: hive took 6 mins to load a data set, hbase took
> 1
> > hr
> > > > 14
> > > > > > > mins.
> > > > > > > > > > It's a 20 gb data set approx 230 million records. The
> data
> > is
> > > > in
> > > > > > > hdfs,
> > > > > > > > > > single text file. The cluster is 11 nodes, 8 cores.
> > > > > > > > > >
> > > > > > > > > > I loaded this in hive, partitioned by date and bucketed
> > into
> > > 32
> > > > > and
> > > > > > > > > sorted.
> > > > > > > > > > Time taken is 6 mins.
> > > > > > > > > >
> > > > > > > > > > I loaded the same data into hbase, in the same cluster by
> > > > > writing a
> > > > > > > map
> > > > > > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't
> > running
> > > > > > anything
> > > > > > > > > else
> > > > > > > > > > and assuming that the code that i wrote is good enough,
> > what
> > > is
> > > > > it
> > > > > > > that
> > > > > > > > > > makes hbase slower than hive in loading the data?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Austin
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Loading data, hbase slower than Hive?

Reply via email to