As I said already, this not a " how to design time-series data in HBase" kind of question.
Usually, to fight heavy followers - following number skewness (some has 1 M followers, others - follows 1M) one might to identify top X persons who has # of followers > N and top Y persons who follows more than M users. For X : we keep all updates in memory for last N days, distribute/replicate them across the cluster For Y: for every user who makes twits, we check his followers and if some of them belong to group Y - we do in-place update for that user 1. All users not from Y will always read updates one-by-one from data store. 2. Users from group Y always have all updates indexed and read them using one scan operation (in HBase lingo) 3. Users from group X cache their tweets in a fast memory store with replication - they have extreme # of followers and their tweets are most hot. 4. Users from not group X (99%) store tweets directly into HBase, besides this, if some of their followers are from group Y - their index is updated (tweets stored directly into Y user record) If you can implement something using only HBase - great, let us know ;) -Vlad On Wed, Jul 1, 2015 at 4:09 PM, Sleiman Jneidi <[email protected]> wrote: > Thanks Stack, looks like a good read. > Vladimir, I called it time-series because (ordering by time/ filtering by > the tweet owner) is the goal. To answer your questions, lets for now assume > that its not as massive as Twitter because otherwise it will be very > complicated as you mentioned. So > > 1. How many updates per second in the system? We never mutate data, we > write 500 tweets/sec. > 2. How many users? 10000 > 3. Average # of followers per user? 250 users. > > Even with these modest numbers, the schema is still tricky to be highly > optimised for reads. Any thoughts? > Thanks. > > > On Wed, Jul 1, 2015 at 11:36 PM, Vladimir Rodionov <[email protected] > > > wrote: > > > That is not time-series modeling issue per se ... You can't come up with > > anything > > until you get the basic performance/load SLA numbers > > > > 1. How many updates per second in the system? > > 2. How many users? > > 3. Average # of followers per user with percentiles up to 99.9% > > > > Twitter architecture to support user-follower relationships is not based > on > > a single data store and > > much more complex. Therefore, I think, in this case everything will > depend > > on ## 1. 2. 3. > > > > Scale matters. > > > > -Vlad > > > > > > On Wed, Jul 1, 2015 at 2:17 PM, Stack <[email protected]> wrote: > > > > > To add to Amandeep's pointer, this one is good for concerns modeling > > > timeseries: > > > https://cloud.google.com/bigtable/pdf/CloudBigtableTimeSeries.pdf > > > > > > St.Ack > > > > > > On Wed, Jul 1, 2015 at 11:53 AM, Sleiman Jneidi < > > [email protected]> > > > wrote: > > > > > > > Hello everyone, I am working on a scheme design for a time series > > > database. > > > > Something very similar to Twitter where people can follow each other > > and > > > > see their posts. I've looked at opentsdb but I think my problem is > more > > > > complicated because I don't have the leading "metricid" in the row > key. > > > > I've made several attempts so far but I am not happy with the > > > performance. > > > > > > > > 1. Md5(user)+timestamp . The problem with is when I want to query the > > > feed, > > > > I have to do a scan with the highest user ( alphabetical order) and > the > > > > lowest and then add column column filter. Getting the next batch is > > hard. > > > > > > > > 2. Md5(user)+day and then put the posts of the day in the columns > with > > > > timestamp in the qualifier name. Not optimal, getting the next batch > is > > > > hard. > > > > > > > > So... What do you guys think? Any ideas for making this efficient or > > > > possible? > > > > > > > > Thanks for your time in reading this. > > > > Sleiman > > > > > > > > > >
