Thanks again for all your comments.

Jonathan, would you mind to enlighten me on the way you would keep track of
the
people you need to "eject". I don't get the min heap based tuple...

Cheers
-Marco

On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <[email protected]>wrote:

> Perhaps I'm misunderstanding your use case, and this depends on the amount
> of data, but you could consider something like this (to avoid exploding the
> data, which could perhaps be inavoidable but I hate resorting to that if I
> don't have to).
>
> a = foreach yourdata generate student_id, start_time, start_time+duration
> as end_time, course;
> b = group a by course;
> c = foreach b {
>  ord = order a by start_time;
>  generate yourudf.process(ord);
> }
>
> Here is generally what process could do. It would be an accumulator UDF
> that expected tuples sorted on start_time. Now you basically need a way to
> know who the distinct users are. Now, since you want 30s windows, your
> first window will presumably be 30s after the first start_time in your
> data, and you would just tick ahead in 1s and write to a bag which would
> have second, # of distinct student_ids. To know when to eject people, you
> could have any number of data structures... perhaps a min heap based on
> end_time, and of course instead of "ticking" ahead, you would grab a new
> tuple (since this is the only thing that would change the state of the # of
> distinct ids), and then do all of the ticking ahead as you adjust the heap
> and write the seconds in between the current time pointer and the
> start_time of the new tuple, making sure in each step to check against the
> min heap to eject any users that expired.
>
> That was a little rambly, I could quickly put together some more reasonable
> pseudocode if that would help. I think the general idea is clear though...
>
> 2011/10/31 Guy Bayes <[email protected]>
>
> > ahh TV that explains it
> >
> > 12G data file is a bit too big for R unless you sample, not sure if the
> use
> > case is conducive to sampling?
> >
> > If it is, could sample it down and structure in pig/hadoop and then load
> it
> > into the analytical/visualization tool of choice...
> >
> > Guy
> >
> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[email protected]> wrote:
> >
> > > The data is not about students but about television ;) Regarding the
> > size.
> > > The raw input data size is about 150m although when I 'explode' the
> > > timeseries
> > > it will be around 80x bigger. I guess the average user duration will be
> > > around
> > > 40 Minutes which means when sampling it at a 30s interval will increase
> > the
> > > size by ~12GB.
> > >
> > > I think that is a size which my hadoop cluster with five 8-core x 8GB x
> > 2TB
> > > HD
> > > should be able to cope with.
> > >
> > > I don't know about R. Are you able to handle 12Gb
> > > files well in R (off course it depends on your computer so assume an
> > > average business computer e.g. 2-core 2GHz 4GB ram)?
> > >
> > > Cheers
> > > -Marco
> > >
> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[email protected]>
> > wrote:
> > >
> > > > if it fits in R, it's trivial, draw a density plot or a histogram,
> > about
> > > > three lines of R code
> > > >
> > > > why I was wondering about the data volume.
> > > >
> > > > His example is students attending classes, if  that is really the
> data
> > > hard
> > > > to believe it's super huge?
> > > >
> > > > Guy
> > > >
> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > Perhaps another way to approach this problem is to visualize it
> > > > > geometrically.  You have a long series of class session instances,
> > > where
> > > > > each class session is like 1D line segment, beginning/stopping at
> > some
> > > > > start/end time.
> > > > >
> > > > > These segments naturally overlap, and I think the question you're
> > > asking
> > > > is
> > > > > equivalent to finding the number of overlaps at every subsegment.
> > > > >
> > > > > To answer this, you want to first break every class session into a
> > full
> > > > > list
> > > > > of subsegments, where a subsegment is created by "breaking" each
> > class
> > > > > session/segment into multiple parts at the start/end point of any
> > other
> > > > > class session.  You can create this full set of subsegments in one
> > pass
> > > > by
> > > > > comparing pairwise (CROSS) each start/end point with your original
> > list
> > > > of
> > > > > class sessions.
> > > > >
> > > > > Once you have the full list of "broken" segments, then a final
> GROUP
> > > > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > > > approach
> > > > > would be faster than the previous approach if your class sessions
> are
> > > > very
> > > > > long, or there are many overlaps.
> > > > >
> > > > > Norbert
> > > > >
> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[email protected]>
> > > > wrote:
> > > > >
> > > > > > how big is your dataset?
> > > > > >
> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Thanks Bill and Norbert that seems like what I was looking for.
> > > I'm a
> > > > > bit
> > > > > > > worried about
> > > > > > > how much data/io this could create. But I'll see ;)
> > > > > > >
> > > > > > > Cheers
> > > > > > > -Marco
> > > > > > >
> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > > > [email protected]
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > In case what you're looking for is an analysis over the full
> > > > learning
> > > > > > > > duration, and not just the start interval, then one further
> > > insight
> > > > > is
> > > > > > > > that each original record can be transformed into a sequence
> of
> > > > > > > > records, where the size of the sequence corresponds to the
> > > session
> > > > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > > > original
> > > > > > > > record:
> > > > > > > >
> > > > > > > > 1,marco,1319708213,500,math
> > > > > > > >
> > > > > > > > into:
> > > > > > > >
> > > > > > > > 1,marco,1319708190,500,math
> > > > > > > > 1,marco,1319708220,500,math
> > > > > > > > 1,marco,1319708250,500,math
> > > > > > > > 1,marco,1319708280,500,math
> > > > > > > > 1,marco,1319708310,500,math
> > > > > > > > 1,marco,1319708340,500,math
> > > > > > > > 1,marco,1319708370,500,math
> > > > > > > > 1,marco,1319708400,500,math
> > > > > > > > 1,marco,1319708430,500,math
> > > > > > > > 1,marco,1319708460,500,math
> > > > > > > > 1,marco,1319708490,500,math
> > > > > > > > 1,marco,1319708520,500,math
> > > > > > > > 1,marco,1319708550,500,math
> > > > > > > > 1,marco,1319708580,500,math
> > > > > > > > 1,marco,1319708610,500,math
> > > > > > > > 1,marco,1319708640,500,math
> > > > > > > > 1,marco,1319708670,500,math
> > > > > > > > 1,marco,1319708700,500,math
> > > > > > > >
> > > > > > > > and then use Bill's suggestion to group by course, interval.
> > > > > > > >
> > > > > > > > Norbert
> > > > > > > >
> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > > You can pass your time to a udf that rounds it down to the
> > > > nearest
> > > > > 30
> > > > > > > > second
> > > > > > > > > interval and then group by course, interval to get counts
> for
> > > > each
> > > > > > > > course,
> > > > > > > > > interval.
> > > > > > > > >
> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >> I have a problem where I don't know how or if pig is even
> > > > suitable
> > > > > > to
> > > > > > > > > solve
> > > > > > > > >> it.
> > > > > > > > >>
> > > > > > > > >> I have a schema like this:
> > > > > > > > >>
> > > > > > > > >> student-id,student-name,start-time,duration,course
> > > > > > > > >> 1,marco,1319708213,500,math
> > > > > > > > >> 2,ralf,1319708111,112,english
> > > > > > > > >> 3,greg,1319708321,333,french
> > > > > > > > >> 4,diva,1319708444,80,english
> > > > > > > > >> 5,susanne,1319708123,2000,math
> > > > > > > > >> 1,marco,1319708564,500,french
> > > > > > > > >> 2,ralf,1319708789,123,french
> > > > > > > > >> 7,fred,1319708213,5675,french
> > > > > > > > >> 8,laura,1319708233,123,math
> > > > > > > > >> 10,sab,1319708999,777,math
> > > > > > > > >> 11,fibo,1319708789,565,math
> > > > > > > > >> 6,dan,1319708456,50,english
> > > > > > > > >> 9,marco,1319708123,60,english
> > > > > > > > >> 12,bo,1319708456,345,math
> > > > > > > > >> 1,marco,1319708789,673,math
> > > > > > > > >> ...
> > > > > > > > >> ...
> > > > > > > > >>
> > > > > > > > >> I would like to retrieve a graph (interpolation) over time
> > > > grouped
> > > > > > by
> > > > > > > > >> course. Meaning how many students are learning for a
> course
> > > > based
> > > > > on
> > > > > > a
> > > > > > > > 30
> > > > > > > > >> sec interval.
> > > > > > > > >> The grouping by course is easy but from there I've no clue
> > > how I
> > > > > > would
> > > > > > > > >> achieve the rest. I guess the rest needs to be achieved
> via
> > > some
> > > > > UDF
> > > > > > > > >> or is there any way how to this in pig? I often think
> that I
> > > > need
> > > > > a
> > > > > > > "for
> > > > > > > > >> loop" or something similar in pig.
> > > > > > > > >>
> > > > > > > > >> Thanks for your help!
> > > > > > > > >> -Marco
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to