Yes, you are correct, event3 never emits for the time "10:07". The proper result table is, as you mention: ======================== event1 | event2 event2 | event3 event3 |
I guess i was thinking about the old example(T=7). :) On Thu, Jan 31, 2013 at 12:39 PM, Oleg Ruchovets <[email protected]>wrote: > Hi Rodrigo , > > That is just GREAT Idea :-) !!! > > But how did you get a final result: > > ======================== > event1 | event2, event3 > event2 | event3 > event3 | > I tried to simulate and didn't get event1| event2,event3 > > > (10:03, [*after*, event1]) > (10:04, [*after*, event1]) > (10:05, [*after*, event1]) > (10:06, [*after*, event1]), (10:06, [*after*, event2]) > (10:07, *[*begin*,*event1]) , (10:07, [*after*, event2]) > (10:08, [*after*, event2]), (10:08, [*after*, event3]) > (10:09, [*after *, event2]), (10:09, [*after*, event3]) > (10:10, *[*begin*, *event2]), (10:10, [*after*, event3]) > (10:11, [*after *, event3]) > (10:12, *[*begin*, *event3]) > > Thanks > Oleg. > > > > > On Thu, Jan 31, 2013 at 4:34 PM, Rodrigo Ribeiro < > [email protected]> wrote: > > > Hi, > > The Map and Reduce steps that you mention is the same as how i though. > > > > How should I work with this table.Should I have to scan Main table : row > by > > > row and for every row get event time and based on that time query > second > > > table? > > > > > > In case I will do so , i still need to execute 50 million request? > > > > > > May be I need to work only with second table. How do I know what to > query > > > (scan)? > > > > > > Yes, using that approach you need to query both tables for each eventId > you > > need to lookup. > > > > I thought about something else right now, i think it'll be better for > your > > use case. > > You could could distinguish the events that begin and those that are > after > > a time when you emit it. > > For the example using T=5, the emits would be: > > > > For event1 in map phase will be (10:07, [*begin*,event1]) , (10:06, > > [*after*, > > event1]) , (10:05, [*after*, event1]), (10:04, [*after*, event1]), > (10:03, > > [ > > *after*, event1]). > > For event2 in map phase will be (10:10, [*begin*, event2]) , (10:09, > > [*after > > *, event2]) , (10:08, [*after*, event2]), (10:07, [*after*, event2]), > > (10:06, [*after*, event2]). > > For event3 in map phase will be (10:12, [*begin*, event3]) , (10:11, > > [*after > > *, event3]) , (10:10, [*after*, event3]), (10:09, [*after*, event3]), > > (10:08, [*after*, event3]). > > > > > > So, the reduce step know exactly who began in a given time and those in > the > > window of time after it. > > > > The reduce step for key "10:07", would receive { [*begin*, event1], > > [*after*, > > event2], [*after*, event3] }, > > So you know that event1 began in this time and events 2 and 3 are in his > > window of time, and save it to a second table. > > > > The reduce step for key "10:06", would receive { [*after*, event1], > > [*after*, > > event2]}, > > No event began this time, so don't need to save. > > > > After all this, you gets a second table that i believe contains exactly > > what you want: > > eventid | events_window_time > > ======================== > > event1 | event2, event3 > > event2 | event3 > > event3 | > > > > Let me know if i'm not being clear. > > > > On Thu, Jan 31, 2013 at 10:52 AM, Oleg Ruchovets <[email protected] > > >wrote: > > > > > Hi Rodrigo , > > > As usual you have very intereting ! :-) > > > > > > I am not sure that I understand exactly what do you mean and I try to > > > simulate: > > > Suppose we have such events in MAIN Table: > > > event1 | 10:07 > > > event2 | 10:10 > > > event3 | 10:12 > > > Time window T=5 minutes. > > > > > > =================on map================ : > > > > > > what should I emit for event1 and event2 > > > > > > For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , > > (10:05 > > > ,event1), (10:04 ,event1), (10:03 ,event1). > > > For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , > > (10:08 > > > ,event2), (10:07 ,event2), (10:06 ,event2). > > > For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , > > (10:10 > > > ,event3), (10:09 ,event3), (10:08 ,event3). > > > > > > I calculate from the event time T=5 steps back Is it correct? > > > > > > ==================on reduce =========: > > > > > > 10:03|event1 > > > 10:04|event1 > > > 10:05|event1 > > > 10:06|event1,event2 > > > 10:07|event1,event2 > > > 10:08|event2,event3 > > > 10:09|event2,event3 > > > 10:10|event2,event3 > > > 10:11|event3 > > > 10:12|event3 > > > > > > this output will be writtent to the second table. is it correct? > > > > > > ============================================= > > > > > > How should I work with this table.Should I have to scan Main table : > row > > by > > > row and for every row get event time and based on that time query > second > > > table? > > > > > > In case I will do so , i still need to execute 50 million request? > > > > > > May be I need to work only with second table. How do I know what to > query > > > (scan)? > > > > > > I am sure I simply don't understand well what is your approach for > > > solution. > > > > > > Please explain. > > > > > > Thanks > > > Oleg. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro < > > > [email protected]> wrote: > > > > > > > There is another option, > > > > You could do a MapReduce job that, for each row from the main table, > > > emits > > > > all times that it would be in the window of time, > > > > For example, "event1" would emit {"10:06": event1}, {"10:05": event1} > > ... > > > > {"10:00": event1}. (also for "10:07" if you want to know those who > > happen > > > > in the same minute too) > > > > And in the Reduce step you aggregate and save in another table all > > events > > > > that are in the window of a given time. > > > > > > > > For: > > > > event_id | time > > > > ============= > > > > event1 | 10:07 > > > > event2 | 10:10 > > > > event3 | 10:12 > > > > > > > > The result table would look like: > > > > time | events > > > > 10:00 | event1 > > > > 10:01 | event1 > > > > 10:02 | event1 > > > > 10:03 | event1,event2 > > > > 10:04 | event1,event2 > > > > 10:05 | event1,event2,event3 > > > > 10:06 | event1,event2,event3 > > > > 10:07 | event2,event3 > > > > 10:08 | event2,event3 > > > > ... > > > > > > > > So that, knowing a time when a event happens, you can get the list of > > > > events after it. > > > > For event1, we only look in the this table for the key "10:07". > > > > > > > > Sorry any typo, writing in a bit of hurry. > > > > > > > > On Wed, Jan 30, 2013 at 6:57 AM, Oleg Ruchovets < > [email protected] > > > > >wrote: > > > > > > > > > Hi Rodrigo. > > > > > Using solution with 2 tables : one main and one as index. > > > > > I have ~50 Million records , in my case I need scan all table and > as > > a > > > > > result I will have 50 Millions scans and It will kill all > > performance. > > > > > > > > > > Is there any other approach to model my usecase using hbase? > > > > > > > > > > Thanks > > > > > Oleg. > > > > > > > > > > > > > > > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < > > > > > [email protected]> wrote: > > > > > > > > > > > In the approach that i mentioned, you would need a table to > > retrieve > > > > the > > > > > > time of a certain event(if this information can retrieve in > another > > > > way, > > > > > > you may ignore this table). It would be like you posted: > > > > > > event_id | time > > > > > > ============= > > > > > > event1 | 10:07 > > > > > > event2 | 10:10 > > > > > > event3 | 10:12 > > > > > > event4 | 10:20 > > > > > > > > > > > > And a secundary table would be like: > > > > > > rowkey > > > > > > =========== > > > > > > 10:07:event1 > > > > > > 10:10:event2 > > > > > > 10:12:event3 > > > > > > 10:20:event4 > > > > > > > > > > > > That way, for your first example, you first retrieve the time of > > the > > > > > > "event1" on the main table, and then scan starting from his > > position > > > on > > > > > the > > > > > > secondary table("10:07:event1"), until the end of the window. > > > > > > In this case(T=7) the scan will range ["10:07:event1", "10:05"). > > > > > > > > > > > > As Michel Segel mentioned, there is a hotspot problem on > insertion > > > > using > > > > > > this approach alone. > > > > > > Using multiples buckets(could be a hash from the eventId) would > > > > > distribute > > > > > > it better, but requires to scan on all buckets from the second > > table > > > to > > > > > get > > > > > > all events of the window of time. > > > > > > > > > > > > Assuming you use 3 buckets, it would look like: > > > > > > rowkey > > > > > > =========== > > > > > > *1_*10:07:event1 > > > > > > *2_*10:10:event2 > > > > > > *3_*10:12:event3 > > > > > > *2_*10:20:event4 > > > > > > > > > > > > The scans would be: ["*1*_10:07:event1", "1_10:15"), > > > > ["*2*_10:07:event1", > > > > > > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then > > combine > > > > the > > > > > > results. > > > > > > > > > > > > Hope it helps. > > > > > > > > > > > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets < > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > Hi Rodrigo. > > > > > > > Can you please explain in more details your solution.You said > > > that > > > > I > > > > > > will > > > > > > > have another table. How many table will I have? Will I have 2 > > > tables? > > > > > > What > > > > > > > will be the schema of the tables? > > > > > > > > > > > > > > I try to explain what I try to achive: > > > > > > > I have ~50 million records like {time|event}. I want to put > > the > > > > > data > > > > > > in > > > > > > > Hbase in such way : > > > > > > > events of time X and all events what was after event X > during > > > > time > > > > > > > T minutes (for example during 7 minutes). > > > > > > > So I will be able to scan all table and get groups like: > > > > > > > > > > > > > > {event1:10:02} corresponds to events {event2:10:03} , > > > > {event3:10:05} > > > > > , > > > > > > > {event4:10:06} > > > > > > > {event2:10:30} correnponds to events {events5:10:32} , > > > > > {event3:10:33} , > > > > > > > {event3:10:36}. > > > > > > > > > > > > > > Thanks > > > > > > > Oleg. > > > > > > > > > > > > > > > > > > > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < > > > > > > > [email protected]> wrote: > > > > > > > > > > > > > > > You can use another table as a index, using a rowkey like > > > > > > > > '{time}:{event_id}', and then scan in the range ["10:07", > > > "10:15"). > > > > > > > > > > > > > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets < > > > > > [email protected] > > > > > > > > >wrote: > > > > > > > > > > > > > > > > > Hi , > > > > > > > > > > > > > > > > > > I have such row data structure: > > > > > > > > > > > > > > > > > > event_id | time > > > > > > > > > ============= > > > > > > > > > event1 | 10:07 > > > > > > > > > event2 | 10:10 > > > > > > > > > event3 | 10:12 > > > > > > > > > > > > > > > > > > event4 | 10:20 > > > > > > > > > event5 | 10:23 > > > > > > > > > event6 | 10:25 > > > > > > > > > > > > > > > > > > > > > > > > > > > Numbers of records is 50-100 million. > > > > > > > > > > > > > > > > > > > > > > > > > > > Question: > > > > > > > > > > > > > > > > > > I need to find group of events starting form eventX and > > enters > > > to > > > > > the > > > > > > > > time > > > > > > > > > window bucket = T. > > > > > > > > > > > > > > > > > > > > > > > > > > > For example: if T=7 munutes. > > > > > > > > > Starting from event event1- {event1, event2 , event3} were > > > > detected > > > > > > > > durint > > > > > > > > > 7 minutes. > > > > > > > > > > > > > > > > > > Starting from event event2- {event2 , event3} were detected > > > > durint > > > > > 7 > > > > > > > > > minutes. > > > > > > > > > > > > > > > > > > Starting from event event4 - {event4, event5 , event6} were > > > > > detected > > > > > > > > during > > > > > > > > > 7 minutes. > > > > > > > > > Is there a way to model the data in hbase to get? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > *Rodrigo Pereira Ribeiro* > > > > > > > > Software Developer > > > > > > > > www.jusbrasil.com.br > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > *Rodrigo Pereira Ribeiro* > > > > > > Software Developer > > > > > > T (71) 3033-6371 > > > > > > C (71) 8612-5847 > > > > > > [email protected] > > > > > > www.jusbrasil.com.br > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > *Rodrigo Pereira Ribeiro* > > > > Software Developer > > > > www.jusbrasil.com.br > > > > > > > > > > > > > > > -- > > > > *Rodrigo Pereira Ribeiro* > > Software Developer > > www.jusbrasil.com.br > > > -- *Rodrigo Pereira Ribeiro* Software Developer www.jusbrasil.com.br
