Re: How to get specified rows and avoid full table scanning?

Tao Xiao Wed, 23 Apr 2014 20:18:16 -0700

It is of great value if HBaseWD can avoid full table scan. I haven't spent
much time in Phoenix, but I am very interested in what Vladimir said,
especially that Phoenix is faster than any M/R job.



2014-04-24 1:33 GMT+08:00 Vladimir Rodionov <[email protected]>:

> I suggest you Phoenix:
>
> 1. It does all these salting things automatically
> 2. Does not require M/R
> 3. Faster than any M/R job
> 4. Allows you to use  SQL to access your data.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [email protected]
>
> ________________________________________
> From: Tao Xiao [[email protected]]
> Sent: Wednesday, April 23, 2014 8:09 AM
> To: user
> Subject: Re: How to get specified rows and avoid full table scanning?
>
> Hi all,
>
> Thank you all for your replies. After examining HBaseWD, OpenRSDB and
> Phoenix, I feel HBaseWD should  meet my requirements.
>
> My business is as follows:  Tens of millions of rows are appended to a
> table and each row has a date property, say 2014-04-01. I will submit a
> MapReduce job, whose input is some days of rows from that table, so I need
> to filter out rows of days other than what I specify. If the date is stored
> as part of the row key, I hope I can use a scan specifying the start and
> end key. At the same time measures should be taken to prevent the hot spot
> problem from happening, because naturally time-series row keys tend to be
> stored contiguously.
>
> HBaseWD avoids the problem of hot spot by decorating the original row key
> with a prefix.
>
> HBaseWD also makes it possibly for a MapReduce job to process data of a
> specified range (by creating a scan instance and pass it a *startKey* and a
> *stopKey*), *but I'm not sure whether this would trigger a full table
> scan*.
>
>
>
>
>
> 2014-04-22 2:05 GMT+08:00 James Taylor <[email protected]>:
>
> > Tao,
> > Just wanted to give you a couple of relevant pointers to Apache Phoenix
> for
> > your particular problem:
> > - Preventing hotspotting by salting your table:
> > http://phoenix.incubator.apache.org/salted.html
> > - Pig Integration for your map/reduce job:
> > http://phoenix.incubator.apache.org/pig_integration.html
> >
> > What kind of processing will you be doing in your map-reduce job? FWIW,
> > Phoenix will allow you to run SQL queries directly over your data, so
> that
> > might be an alternative for some of the processing you need to do.
> >
> > Thanks,
> > James
> >
> >
> > On Mon, Apr 21, 2014 at 9:20 AM, Jean-Marc Spaggiari <
> > [email protected]> wrote:
> >
> > > Hi Tao,
> > >
> > > also, if you are thinking about time series, you can take a look at
> TSBD
> > > http://opentsdb.net/
> > >
> > > JM
> > >
> > >
> > > 2014-04-21 11:56 GMT-04:00 Ted Yu <[email protected]>:
> > >
> > > > There're several alternatives.
> > > > One of which is HBaseWD :
> > > >
> > > >
> > >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > > >
> > > > You can also take a look at Phoenix.
> > > >
> > > > Cheers
> > > >
> > > >
> > > > On Mon, Apr 21, 2014 at 8:04 AM, Tao Xiao <[email protected]>
> > > > wrote:
> > > >
> > > > > I have a big table and rows will be added to this table each day. I
> > > wanna
> > > > > run a MapReduce job over this table and select rows of several days
> > as
> > > > the
> > > > > job's input data. How can I achieve this?
> > > > >
> > > > > If I prefix the rowkey with the date, I can easily select one day's
> > > data
> > > > as
> > > > > the job's input, but this will involve hot spot problem because
> > > hundreds
> > > > of
> > > > > millions of rows will be added to this table each day and the data
> > will
> > > > > probably go to a single region server. Secondary index would be
> good
> > > for
> > > > > query but not good for a batch processing job.
> > > > >
> > > > > Are there any other ways?
> > > > >
> > > > > Are there any other frameworks which can achieve this goal
> > easieruser?
> > > > > Shark? Stinger？HSearch?
> > > > >
> > > >
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or [email protected] and
> delete or destroy any copy of this message and its attachments.
>

Re: How to get specified rows and avoid full table scanning?

Reply via email to