It is of great value if HBaseWD can avoid full table scan. I haven't spent much time in Phoenix, but I am very interested in what Vladimir said, especially that Phoenix is faster than any M/R job.
2014-04-24 1:33 GMT+08:00 Vladimir Rodionov <[email protected]>: > I suggest you Phoenix: > > 1. It does all these salting things automatically > 2. Does not require M/R > 3. Faster than any M/R job > 4. Allows you to use SQL to access your data. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [email protected] > > ________________________________________ > From: Tao Xiao [[email protected]] > Sent: Wednesday, April 23, 2014 8:09 AM > To: user > Subject: Re: How to get specified rows and avoid full table scanning? > > Hi all, > > Thank you all for your replies. After examining HBaseWD, OpenRSDB and > Phoenix, I feel HBaseWD should meet my requirements. > > My business is as follows: Tens of millions of rows are appended to a > table and each row has a date property, say 2014-04-01. I will submit a > MapReduce job, whose input is some days of rows from that table, so I need > to filter out rows of days other than what I specify. If the date is stored > as part of the row key, I hope I can use a scan specifying the start and > end key. At the same time measures should be taken to prevent the hot spot > problem from happening, because naturally time-series row keys tend to be > stored contiguously. > > HBaseWD avoids the problem of hot spot by decorating the original row key > with a prefix. > > HBaseWD also makes it possibly for a MapReduce job to process data of a > specified range (by creating a scan instance and pass it a *startKey* and a > *stopKey*), *but I'm not sure whether this would trigger a full table > scan*. > > > > > > 2014-04-22 2:05 GMT+08:00 James Taylor <[email protected]>: > > > Tao, > > Just wanted to give you a couple of relevant pointers to Apache Phoenix > for > > your particular problem: > > - Preventing hotspotting by salting your table: > > http://phoenix.incubator.apache.org/salted.html > > - Pig Integration for your map/reduce job: > > http://phoenix.incubator.apache.org/pig_integration.html > > > > What kind of processing will you be doing in your map-reduce job? FWIW, > > Phoenix will allow you to run SQL queries directly over your data, so > that > > might be an alternative for some of the processing you need to do. > > > > Thanks, > > James > > > > > > On Mon, Apr 21, 2014 at 9:20 AM, Jean-Marc Spaggiari < > > [email protected]> wrote: > > > > > Hi Tao, > > > > > > also, if you are thinking about time series, you can take a look at > TSBD > > > http://opentsdb.net/ > > > > > > JM > > > > > > > > > 2014-04-21 11:56 GMT-04:00 Ted Yu <[email protected]>: > > > > > > > There're several alternatives. > > > > One of which is HBaseWD : > > > > > > > > > > > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > > > > > > You can also take a look at Phoenix. > > > > > > > > Cheers > > > > > > > > > > > > On Mon, Apr 21, 2014 at 8:04 AM, Tao Xiao <[email protected]> > > > > wrote: > > > > > > > > > I have a big table and rows will be added to this table each day. I > > > wanna > > > > > run a MapReduce job over this table and select rows of several days > > as > > > > the > > > > > job's input data. How can I achieve this? > > > > > > > > > > If I prefix the rowkey with the date, I can easily select one day's > > > data > > > > as > > > > > the job's input, but this will involve hot spot problem because > > > hundreds > > > > of > > > > > millions of rows will be added to this table each day and the data > > will > > > > > probably go to a single region server. Secondary index would be > good > > > for > > > > > query but not good for a batch processing job. > > > > > > > > > > Are there any other ways? > > > > > > > > > > Are there any other frameworks which can achieve this goal > > easieruser? > > > > > Shark? Stinger?HSearch? > > > > > > > > > > > > > > > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [email protected] and > delete or destroy any copy of this message and its attachments. >
