Re: how to model data based on "time bucket"

Rodrigo Ribeiro Thu, 31 Jan 2013 07:52:12 -0800

Yes, you are correct, event3 never emits for the time "10:07".
The proper result table is, as you mention:
========================
event1 | event2
event2 | event3
event3 |


I guess i was thinking about the old example(T=7). :)

On Thu, Jan 31, 2013 at 12:39 PM, Oleg Ruchovets <[email protected]>wrote:

> Hi Rodrigo ,
>
>   That is just GREAT Idea :-) !!!
>
>  But how did you get a final result:
>
> ========================
> event1 | event2, event3
> event2 | event3
> event3 |
> I tried to simulate and didn't get event1| event2,event3
>
>
>    (10:03, [*after*, event1])
>    (10:04, [*after*, event1])
>    (10:05, [*after*, event1])
>    (10:06, [*after*, event1]), (10:06, [*after*, event2])
>    (10:07, *[*begin*,*event1]) , (10:07, [*after*, event2])
>    (10:08, [*after*, event2]), (10:08, [*after*, event3])
>    (10:09, [*after *, event2]),   (10:09, [*after*, event3])
>    (10:10, *[*begin*, *event2]), (10:10, [*after*, event3])
>    (10:11, [*after *, event3])
>    (10:12, *[*begin*, *event3])
>
> Thanks
> Oleg.
>
>
>
>
> On Thu, Jan 31, 2013 at 4:34 PM, Rodrigo Ribeiro <
> [email protected]> wrote:
>
> > Hi,
> > The Map and Reduce steps that you mention is the same as how i though.
> >
> > How should I work with this table.Should I have to scan Main table : row
> by
> > > row and for every row get event time and based on that time query
> second
> > > table?
> > >
> > >     In case I will do so , i still need to execute 50 million request?
> > >
> > > May be I need to work only with second table. How do I know what to
> query
> > > (scan)?
> >
> >
> > Yes, using that approach you need to query both tables for each eventId
> you
> > need to lookup.
> >
> > I thought about something else right now, i think it'll be better for
> your
> > use case.
> > You could could distinguish the events that begin and those that are
> after
> > a time when you emit it.
> > For the example using T=5, the emits would be:
> >
> > For event1 in map phase will be (10:07, [*begin*,event1]) , (10:06,
> > [*after*,
> > event1]) , (10:05, [*after*, event1]), (10:04, [*after*, event1]),
> (10:03,
> > [
> > *after*, event1]).
> > For event2 in map phase will be (10:10, [*begin*, event2]) , (10:09,
> > [*after
> > *, event2]) , (10:08, [*after*, event2]), (10:07, [*after*, event2]),
> > (10:06, [*after*, event2]).
> > For event3 in map phase will be (10:12, [*begin*, event3]) , (10:11,
> > [*after
> > *, event3]) , (10:10, [*after*, event3]), (10:09, [*after*, event3]),
> > (10:08, [*after*, event3]).
> >
> >
> > So, the reduce step know exactly who began in a given time and those in
> the
> > window of time after it.
> >
> > The reduce step for key "10:07", would receive { [*begin*, event1],
> > [*after*,
> > event2], [*after*, event3] },
> > So you know that event1 began in this time and events 2 and 3 are in his
> > window of time, and save it to a second table.
> >
> > The reduce step for key "10:06", would receive { [*after*, event1],
> > [*after*,
> > event2]},
> > No event began this time, so don't need to save.
> >
> > After all this, you gets a second table that i believe contains exactly
> > what you want:
> > eventid | events_window_time
> > ========================
> > event1  | event2, event3
> > event2  | event3
> > event3  |
> >
> > Let me know if i'm not being clear.
> >
> > On Thu, Jan 31, 2013 at 10:52 AM, Oleg Ruchovets <[email protected]
> > >wrote:
> >
> > > Hi Rodrigo ,
> > >   As usual you have very intereting ! :-)
> > >
> > > I am not sure that I understand exactly what do you mean and I try to
> > > simulate:
> > >      Suppose we have such events in MAIN Table:
> > >             event1 | 10:07
> > >             event2 | 10:10
> > >             event3 | 10:12
> > >      Time window T=5 minutes.
> > >
> > > =================on  map================ :
> > >
> > > what should I emit for event1 and event2
> > >
> > > For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) ,
> > (10:05
> > > ,event1), (10:04 ,event1), (10:03 ,event1).
> > > For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) ,
> > (10:08
> > > ,event2), (10:07 ,event2), (10:06 ,event2).
> > > For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) ,
> > (10:10
> > > ,event3), (10:09 ,event3), (10:08 ,event3).
> > >
> > > I calculate from the event time T=5 steps back Is it correct?
> > >
> > > ==================on reduce =========:
> > >
> > > 10:03|event1
> > > 10:04|event1
> > > 10:05|event1
> > > 10:06|event1,event2
> > > 10:07|event1,event2
> > > 10:08|event2,event3
> > > 10:09|event2,event3
> > > 10:10|event2,event3
> > > 10:11|event3
> > > 10:12|event3
> > >
> > > this output will be writtent to the second table. is it correct?
> > >
> > > =============================================
> > >
> > > How should I work with this table.Should I have to scan Main table :
> row
> > by
> > > row and for every row get event time and based on that time query
> second
> > > table?
> > >
> > >     In case I will do so , i still need to execute 50 million request?
> > >
> > > May be I need to work only with second table. How do I know what to
> query
> > > (scan)?
> > >
> > > I am sure I simply don't understand well what is your approach for
> > > solution.
> > >
> > > Please explain.
> > >
> > > Thanks
> > > Oleg.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro <
> > > [email protected]> wrote:
> > >
> > > > There is another option,
> > > > You could do a MapReduce job that, for each row from the main table,
> > > emits
> > > > all times that it would be in the window of time,
> > > > For example, "event1" would emit {"10:06": event1}, {"10:05": event1}
> > ...
> > > > {"10:00": event1}. (also for "10:07" if you want to know those who
> > happen
> > > > in the same minute too)
> > > > And in the Reduce step you aggregate and save in another table all
> > events
> > > > that are in the window of a given time.
> > > >
> > > > For:
> > > > event_id | time
> > > > =============
> > > > event1 | 10:07
> > > > event2 | 10:10
> > > > event3 | 10:12
> > > >
> > > > The result table would look like:
> > > > time   | events
> > > > 10:00 | event1
> > > > 10:01 | event1
> > > > 10:02 | event1
> > > > 10:03 | event1,event2
> > > > 10:04 | event1,event2
> > > > 10:05 | event1,event2,event3
> > > > 10:06 | event1,event2,event3
> > > > 10:07 | event2,event3
> > > > 10:08 | event2,event3
> > > > ...
> > > >
> > > > So that, knowing a time when a event happens, you can get the list of
> > > > events after it.
> > > > For event1, we only look in the this table for the key "10:07".
> > > >
> > > > Sorry any typo, writing in a bit of hurry.
> > > >
> > > > On Wed, Jan 30, 2013 at 6:57 AM, Oleg Ruchovets <
> [email protected]
> > > > >wrote:
> > > >
> > > > > Hi Rodrigo.
> > > > >     Using solution with 2 tables : one main and one as index.
> > > > > I have ~50 Million records , in my case I need scan all table and
> as
> > a
> > > > > result I will have 50 Millions scans and It will kill all
> > performance.
> > > > >
> > > > > Is there any other approach to model my usecase using hbase?
> > > > >
> > > > > Thanks
> > > > > Oleg.
> > > > >
> > > > >
> > > > > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > In the approach that i mentioned, you would need a table to
> > retrieve
> > > > the
> > > > > > time of a certain event(if this information can retrieve in
> another
> > > > way,
> > > > > > you may ignore this table). It would be like you posted:
> > > > > > event_id | time
> > > > > > =============
> > > > > > event1 | 10:07
> > > > > > event2 | 10:10
> > > > > > event3 | 10:12
> > > > > > event4 | 10:20
> > > > > >
> > > > > > And a secundary table would be like:
> > > > > > rowkey
> > > > > > ===========
> > > > > > 10:07:event1
> > > > > > 10:10:event2
> > > > > > 10:12:event3
> > > > > > 10:20:event4
> > > > > >
> > > > > > That way, for your first example, you first retrieve the time of
> > the
> > > > > > "event1" on the main table, and then scan starting from his
> > position
> > > on
> > > > > the
> > > > > > secondary table("10:07:event1"), until the end of the window.
> > > > > > In this case(T=7) the scan will range ["10:07:event1", "10:05").
> > > > > >
> > > > > > As Michel Segel mentioned, there is a hotspot problem on
> insertion
> > > > using
> > > > > > this approach alone.
> > > > > > Using multiples buckets(could be a hash from the eventId) would
> > > > > distribute
> > > > > > it better, but requires to scan on all buckets from the second
> > table
> > > to
> > > > > get
> > > > > > all events of the window of time.
> > > > > >
> > > > > > Assuming you use 3 buckets, it would look like:
> > > > > > rowkey
> > > > > > ===========
> > > > > > *1_*10:07:event1
> > > > > > *2_*10:10:event2
> > > > > > *3_*10:12:event3
> > > > > > *2_*10:20:event4
> > > > > >
> > > > > > The scans would be: ["*1*_10:07:event1", "1_10:15"),
> > > > ["*2*_10:07:event1",
> > > > > > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then
> > combine
> > > > the
> > > > > > results.
> > > > > >
> > > > > > Hope it helps.
> > > > > >
> > > > > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <
> > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi Rodrigo.
> > > > > > >   Can you please explain in more details your solution.You said
> > > that
> > > > I
> > > > > > will
> > > > > > > have another table. How many table will I have? Will I have 2
> > > tables?
> > > > > > What
> > > > > > > will be the schema of the tables?
> > > > > > >
> > > > > > > I try to explain what I try to achive:
> > > > > > >     I have ~50 million records like {time|event}. I want to put
> > the
> > > > > data
> > > > > > in
> > > > > > > Hbase in such way :
> > > > > > >     events of time X and all events what was after event X
> during
> > > > time
> > > > > > > T minutes (for example during 7 minutes).
> > > > > > > So I will be able to scan all table and get groups like:
> > > > > > >
> > > > > > >   {event1:10:02} corresponds to events {event2:10:03} ,
> > > > {event3:10:05}
> > > > > ,
> > > > > > > {event4:10:06}
> > > > > > >   {event2:10:30} correnponds to events {events5:10:32} ,
> > > > > {event3:10:33} ,
> > > > > > > {event3:10:36}.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Oleg.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > > > You can use another table as a index, using a rowkey like
> > > > > > > > '{time}:{event_id}', and then scan in the range ["10:07",
> > > "10:15").
> > > > > > > >
> > > > > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <
> > > > > [email protected]
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > Hi ,
> > > > > > > > >
> > > > > > > > > I have such row data structure:
> > > > > > > > >
> > > > > > > > > event_id | time
> > > > > > > > > =============
> > > > > > > > > event1 | 10:07
> > > > > > > > > event2 | 10:10
> > > > > > > > > event3 | 10:12
> > > > > > > > >
> > > > > > > > > event4 | 10:20
> > > > > > > > > event5 | 10:23
> > > > > > > > > event6 | 10:25
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Numbers of records is 50-100 million.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Question:
> > > > > > > > >
> > > > > > > > > I need to find group of events starting form eventX and
> > enters
> > > to
> > > > > the
> > > > > > > > time
> > > > > > > > > window bucket = T.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > For example: if T=7 munutes.
> > > > > > > > > Starting from event event1- {event1, event2 , event3} were
> > > > detected
> > > > > > > > durint
> > > > > > > > > 7 minutes.
> > > > > > > > >
> > > > > > > > > Starting from event event2- {event2 , event3} were detected
> > > > durint
> > > > > 7
> > > > > > > > > minutes.
> > > > > > > > >
> > > > > > > > > Starting from event event4 - {event4, event5 , event6} were
> > > > > detected
> > > > > > > > during
> > > > > > > > > 7 minutes.
> > > > > > > > > Is there a way to model the data in hbase to get?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > *Rodrigo Pereira Ribeiro*
> > > > > > > > Software Developer
> > > > > > > > www.jusbrasil.com.br
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Rodrigo Pereira Ribeiro*
> > > > > > Software Developer
> > > > > > T (71) 3033-6371
> > > > > > C (71) 8612-5847
> > > > > > [email protected]
> > > > > > www.jusbrasil.com.br
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Rodrigo Pereira Ribeiro*
> > > > Software Developer
> > > > www.jusbrasil.com.br
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Rodrigo Pereira Ribeiro*
> > Software Developer
> > www.jusbrasil.com.br
> >
>



-- 

*Rodrigo Pereira Ribeiro*
Software Developer
www.jusbrasil.com.br

Re: how to model data based on "time bucket"

Reply via email to