As I said already, this not a " how to design time-series data in HBase"
kind of question.

Usually, to fight heavy followers - following number skewness (some has 1 M
followers, others - follows 1M) one might to identify top X persons who has
 # of followers > N and top Y persons who follows more than M users.

For X : we keep all updates in memory for last N days, distribute/replicate
them across the cluster
For Y:  for every user who makes twits, we check his followers and if some
of them belong to group Y - we do in-place update for that user

1. All users not from Y will always read updates one-by-one from data
store.
2. Users from group Y always have all updates indexed and read them using
one scan operation (in HBase lingo)

3. Users from group X cache their tweets in a fast memory store with
replication - they have extreme # of followers and their tweets are most
hot.
4. Users from not group X (99%) store tweets directly into HBase, besides
this, if some of their followers are from group Y - their index is updated
(tweets stored directly into Y user record)

If you can implement something using only HBase - great, let us know ;)

-Vlad








On Wed, Jul 1, 2015 at 4:09 PM, Sleiman Jneidi <[email protected]>
wrote:

> Thanks Stack, looks like a good read.
> Vladimir, I called it time-series because (ordering by time/ filtering by
> the tweet owner) is the goal. To answer your questions, lets for now assume
> that its not as massive as Twitter because otherwise it will be very
> complicated as you mentioned. So
>
> 1. How many updates per second in the system? We never mutate data, we
> write 500 tweets/sec.
> 2. How many users? 10000
> 3. Average # of followers per user? 250 users.
>
> Even with these modest numbers, the schema is still tricky to be highly
> optimised for reads. Any thoughts?
> Thanks.
>
>
> On Wed, Jul 1, 2015 at 11:36 PM, Vladimir Rodionov <[email protected]
> >
> wrote:
>
> > That is not time-series modeling issue per se ... You can't come up with
> > anything
> > until you get the basic performance/load SLA numbers
> >
> > 1. How many updates per second in the system?
> > 2. How many users?
> > 3. Average # of followers per user with percentiles up to 99.9%
> >
> > Twitter architecture to support user-follower relationships is not based
> on
> > a single data store and
> > much more complex. Therefore, I think, in this case everything will
> depend
> > on ## 1. 2. 3.
> >
> > Scale matters.
> >
> > -Vlad
> >
> >
> > On Wed, Jul 1, 2015 at 2:17 PM, Stack <[email protected]> wrote:
> >
> > > To add to Amandeep's pointer, this one is good for concerns modeling
> > > timeseries:
> > > https://cloud.google.com/bigtable/pdf/CloudBigtableTimeSeries.pdf
> > >
> > > St.Ack
> > >
> > > On Wed, Jul 1, 2015 at 11:53 AM, Sleiman Jneidi <
> > [email protected]>
> > > wrote:
> > >
> > > > Hello everyone, I am working on a scheme design for a time series
> > > database.
> > > > Something very similar to Twitter where people can follow each other
> > and
> > > > see their posts. I've looked at opentsdb but I think my problem is
> more
> > > > complicated because I don't have the leading "metricid" in the row
> key.
> > > > I've made several attempts so far but I am not happy with the
> > > performance.
> > > >
> > > > 1. Md5(user)+timestamp . The problem with is when I want to query the
> > > feed,
> > > > I have to do a scan with the highest user ( alphabetical order) and
> the
> > > > lowest and then add column column filter. Getting the next batch is
> > hard.
> > > >
> > > > 2. Md5(user)+day and then put the posts of the day in the columns
> with
> > > > timestamp in the qualifier name. Not optimal, getting the next batch
> is
> > > > hard.
> > > >
> > > > So... What do you guys think? Any ideas for making this efficient or
> > > > possible?
> > > >
> > > > Thanks for your time in reading this.
> > > > Sleiman
> > > >
> > >
> >
>

Reply via email to