Hi James,

Great to hear that Phoenix supports this kind of table schema. Seems like
i'll go ahead with Bucketing for timeseries data.

Thanks,
Anil


On Tue, Sep 24, 2013 at 9:36 AM, James Taylor <[email protected]>wrote:

> Hey Anil,
> The solution you've described is the best we've found for Phoenix (inspired
> by the work of Alex at Sematext).
> You can do all of this in a few lines of SQL:
>
> CREATE TABLE event_data(
>     who VARCHAR, type SMALLINT, id BIGINT, when DATE, payload VARBINARY
>     CONSTRAINT pk PRIMARY KEY (who, type, id))
>     IMMUTABLE_ROWS=true;  // Declare event table as having immutable rows
> CREATE INDEX event_data_index ON event_data(when, type, who)
> INCLUDE(payload)
>     SALT_BUCKETS=10;  // Salt the index since it'll create write hotspots
> otherwise
>
> The following query would display event count per type across all users
> over the last week.
> It would automatically use the index:
>
> SELECT type, count(*) FROM event_data WHERE when > CURRENT_DATE() - 7 GROUP
> BY type
>
> The following query would display the event count by type for a particular
> user. It would
> automatically use the data table:
>
> SELECT who, type, count(*) FROM event_data WHERE who = ? GROUP BY who,
> type;
>
> As far as the read cost associated with reading from a salted table, we've
> found in most cases
> it actually performs better, because you get better parallelization. The
> case where it performs
> worse is on a selective query that returns a smallish set of rows that
> normally would be in the same
> block. In this case, you're reading an entire block for each row, where in
> the worst case these
> would be neighbors in the same block on an unsalted table.
>
> HTH,
>
> James
>
> On Tue, Sep 24, 2013 at 8:12 AM, anil gupta <[email protected]> wrote:
>
> > Inline
> >
> > On Mon, Sep 23, 2013 at 6:15 PM, Shahab Yunus <[email protected]
> > >wrote:
> >
> > > Yeah, I saw that. In fact that is why I recommended that to you as I
> > > couldn't infer from your email that whether you have already gone
> through
> > > that source or not.
> >
> > Yes, i was aware of that article. But my read pattern is slighty
> different
> > from that article.We are using HBase as DataSource for a RestFul service.
> > Even though if my range scan finds 400 rows with a specified timerange. I
> > only return top 20 for one rest request. So, if in case i do
> bucketing(lets
> > say bucket=10) then i will need to fetch 20 results from each bucket and
> > then i will have to do a merge sort on the client size and return final
> 20.
> > You can assume that i need to return the 20rows sorted by timestamp.
> >
>
>
> >
> > > A source, who did the exact same thing and discuss it
> > > in much more detail and concerns aligning with yours (in fact I think
> > some
> > > of the authors/creators of that link/group are members here of this
> > > community as well.)
> >
> > Do you know what the outcome of their experiment? Do you have any link
> for
> > that? Thanks for your time and help.
> >
> >
> > >
> > > Regards,
> > > Shahab
> > >
> > >
> > > On Mon, Sep 23, 2013 at 8:41 PM, anil gupta <[email protected]>
> > wrote:
> > >
> > > > Hi Shahab,
> > > >
> > > > If you read my solution carefully. I am already doing that.
> > > >
> > > > Thanks,
> > > > Anil Gupta
> > > >
> > > >
> > > > On Mon, Sep 23, 2013 at 3:51 PM, Shahab Yunus <
> [email protected]
> > > > >wrote:
> > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > > > >
> > > > > Here you can find the discussion, trade-offs and working code/API
> > (even
> > > > for
> > > > > M/R) about this and the approach you are trying out.
> > > > >
> > > > > Regards,
> > > > > Shahab
> > > > >
> > > > >
> > > > > On Mon, Sep 23, 2013 at 5:41 PM, anil gupta <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I have a secondary index(inverted index) table with a rowkey on
> the
> > > > basis
> > > > > > of Timestamp of an event. Assume the rowkey as <TimeStamp in
> > Epoch>.
> > > > > > I also store some extra(apart from main_table rowkey) columns in
> > that
> > > > > table
> > > > > > for doing filtering.
> > > > > >
> > > > > > The requirement is to do range-based scan on the basis of time of
> > > > > > event.  Hence, the index with this rowkey.
> > > > > > I cannot use Hashing or MD5 digest solution because then i cannot
> > do
> > > > > range
> > > > > > based scans.  And, i already have a index like OpenTSDB in
> another
> > > > table
> > > > > > for the same dataset.(I have many secondary Index for same data
> > set.)
> > > > > >
> > > > > > Problem: When we increase the write workload during stress test.
> > Time
> > > > > > secondary index becomes a bottleneck due to the famous Region
> > > > HotSpotting
> > > > > > problem.
> > > > > > Solution: I am thinking of adding a prefix of { (<TimeStamp in
> > > > > Epoch>%10) =
> > > > > > bucket}  in the rowkey. Then my row key will become:
> > > > > >  <Bucket><TimeStamp in Epoch>
> > > > > > By using above rowkey i can at least alleviate *WRITE* problem.(i
> > > don't
> > > > > > think problem can be fixed permanently because of the use case
> > > > > requirement.
> > > > > > I would love to be proven wrong.)
> > > > > > However, with the above row key, now when i want to *READ* data,
> > for
> > > > > every
> > > > > > single range scans i have to read data from 10 different regions.
> > > This
> > > > > > extra load for read is scaring me a bit.
> > > > > >
> > > > > > I am wondering if anyone has better suggestion/approach to solve
> > this
> > > > > > problem given the constraints i have.  Looking for feedback from
> > > > > community.
> > > > > >
> > > > > > --
> > > > > > Thanks & Regards,
> > > > > > Anil Gupta
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Regards,
> > > > Anil Gupta
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>



-- 
Thanks & Regards,
Anil Gupta

Reply via email to