Perhaps I'm misunderstanding your use case, and this depends on the amount
of data, but you could consider something like this (to avoid exploding the
data, which could perhaps be inavoidable but I hate resorting to that if I
don't have to).

a = foreach yourdata generate student_id, start_time, start_time+duration
as end_time, course;
b = group a by course;
c = foreach b {
  ord = order a by start_time;
  generate yourudf.process(ord);
}

Here is generally what process could do. It would be an accumulator UDF
that expected tuples sorted on start_time. Now you basically need a way to
know who the distinct users are. Now, since you want 30s windows, your
first window will presumably be 30s after the first start_time in your
data, and you would just tick ahead in 1s and write to a bag which would
have second, # of distinct student_ids. To know when to eject people, you
could have any number of data structures... perhaps a min heap based on
end_time, and of course instead of "ticking" ahead, you would grab a new
tuple (since this is the only thing that would change the state of the # of
distinct ids), and then do all of the ticking ahead as you adjust the heap
and write the seconds in between the current time pointer and the
start_time of the new tuple, making sure in each step to check against the
min heap to eject any users that expired.

That was a little rambly, I could quickly put together some more reasonable
pseudocode if that would help. I think the general idea is clear though...

2011/10/31 Guy Bayes <[email protected]>

> ahh TV that explains it
>
> 12G data file is a bit too big for R unless you sample, not sure if the use
> case is conducive to sampling?
>
> If it is, could sample it down and structure in pig/hadoop and then load it
> into the analytical/visualization tool of choice...
>
> Guy
>
> On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[email protected]> wrote:
>
> > The data is not about students but about television ;) Regarding the
> size.
> > The raw input data size is about 150m although when I 'explode' the
> > timeseries
> > it will be around 80x bigger. I guess the average user duration will be
> > around
> > 40 Minutes which means when sampling it at a 30s interval will increase
> the
> > size by ~12GB.
> >
> > I think that is a size which my hadoop cluster with five 8-core x 8GB x
> 2TB
> > HD
> > should be able to cope with.
> >
> > I don't know about R. Are you able to handle 12Gb
> > files well in R (off course it depends on your computer so assume an
> > average business computer e.g. 2-core 2GHz 4GB ram)?
> >
> > Cheers
> > -Marco
> >
> > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[email protected]>
> wrote:
> >
> > > if it fits in R, it's trivial, draw a density plot or a histogram,
> about
> > > three lines of R code
> > >
> > > why I was wondering about the data volume.
> > >
> > > His example is students attending classes, if  that is really the data
> > hard
> > > to believe it's super huge?
> > >
> > > Guy
> > >
> > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> > [email protected]
> > > >wrote:
> > >
> > > > Perhaps another way to approach this problem is to visualize it
> > > > geometrically.  You have a long series of class session instances,
> > where
> > > > each class session is like 1D line segment, beginning/stopping at
> some
> > > > start/end time.
> > > >
> > > > These segments naturally overlap, and I think the question you're
> > asking
> > > is
> > > > equivalent to finding the number of overlaps at every subsegment.
> > > >
> > > > To answer this, you want to first break every class session into a
> full
> > > > list
> > > > of subsegments, where a subsegment is created by "breaking" each
> class
> > > > session/segment into multiple parts at the start/end point of any
> other
> > > > class session.  You can create this full set of subsegments in one
> pass
> > > by
> > > > comparing pairwise (CROSS) each start/end point with your original
> list
> > > of
> > > > class sessions.
> > > >
> > > > Once you have the full list of "broken" segments, then a final GROUP
> > > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > > approach
> > > > would be faster than the previous approach if your class sessions are
> > > very
> > > > long, or there are many overlaps.
> > > >
> > > > Norbert
> > > >
> > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[email protected]>
> > > wrote:
> > > >
> > > > > how big is your dataset?
> > > > >
> > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[email protected]>
> > > wrote:
> > > > >
> > > > > > Thanks Bill and Norbert that seems like what I was looking for.
> > I'm a
> > > > bit
> > > > > > worried about
> > > > > > how much data/io this could create. But I'll see ;)
> > > > > >
> > > > > > Cheers
> > > > > > -Marco
> > > > > >
> > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > In case what you're looking for is an analysis over the full
> > > learning
> > > > > > > duration, and not just the start interval, then one further
> > insight
> > > > is
> > > > > > > that each original record can be transformed into a sequence of
> > > > > > > records, where the size of the sequence corresponds to the
> > session
> > > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > > original
> > > > > > > record:
> > > > > > >
> > > > > > > 1,marco,1319708213,500,math
> > > > > > >
> > > > > > > into:
> > > > > > >
> > > > > > > 1,marco,1319708190,500,math
> > > > > > > 1,marco,1319708220,500,math
> > > > > > > 1,marco,1319708250,500,math
> > > > > > > 1,marco,1319708280,500,math
> > > > > > > 1,marco,1319708310,500,math
> > > > > > > 1,marco,1319708340,500,math
> > > > > > > 1,marco,1319708370,500,math
> > > > > > > 1,marco,1319708400,500,math
> > > > > > > 1,marco,1319708430,500,math
> > > > > > > 1,marco,1319708460,500,math
> > > > > > > 1,marco,1319708490,500,math
> > > > > > > 1,marco,1319708520,500,math
> > > > > > > 1,marco,1319708550,500,math
> > > > > > > 1,marco,1319708580,500,math
> > > > > > > 1,marco,1319708610,500,math
> > > > > > > 1,marco,1319708640,500,math
> > > > > > > 1,marco,1319708670,500,math
> > > > > > > 1,marco,1319708700,500,math
> > > > > > >
> > > > > > > and then use Bill's suggestion to group by course, interval.
> > > > > > >
> > > > > > > Norbert
> > > > > > >
> > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > > You can pass your time to a udf that rounds it down to the
> > > nearest
> > > > 30
> > > > > > > second
> > > > > > > > interval and then group by course, interval to get counts for
> > > each
> > > > > > > course,
> > > > > > > > interval.
> > > > > > > >
> > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
> [email protected]>
> > > > > wrote:
> > > > > > > >> I have a problem where I don't know how or if pig is even
> > > suitable
> > > > > to
> > > > > > > > solve
> > > > > > > >> it.
> > > > > > > >>
> > > > > > > >> I have a schema like this:
> > > > > > > >>
> > > > > > > >> student-id,student-name,start-time,duration,course
> > > > > > > >> 1,marco,1319708213,500,math
> > > > > > > >> 2,ralf,1319708111,112,english
> > > > > > > >> 3,greg,1319708321,333,french
> > > > > > > >> 4,diva,1319708444,80,english
> > > > > > > >> 5,susanne,1319708123,2000,math
> > > > > > > >> 1,marco,1319708564,500,french
> > > > > > > >> 2,ralf,1319708789,123,french
> > > > > > > >> 7,fred,1319708213,5675,french
> > > > > > > >> 8,laura,1319708233,123,math
> > > > > > > >> 10,sab,1319708999,777,math
> > > > > > > >> 11,fibo,1319708789,565,math
> > > > > > > >> 6,dan,1319708456,50,english
> > > > > > > >> 9,marco,1319708123,60,english
> > > > > > > >> 12,bo,1319708456,345,math
> > > > > > > >> 1,marco,1319708789,673,math
> > > > > > > >> ...
> > > > > > > >> ...
> > > > > > > >>
> > > > > > > >> I would like to retrieve a graph (interpolation) over time
> > > grouped
> > > > > by
> > > > > > > >> course. Meaning how many students are learning for a course
> > > based
> > > > on
> > > > > a
> > > > > > > 30
> > > > > > > >> sec interval.
> > > > > > > >> The grouping by course is easy but from there I've no clue
> > how I
> > > > > would
> > > > > > > >> achieve the rest. I guess the rest needs to be achieved via
> > some
> > > > UDF
> > > > > > > >> or is there any way how to this in pig? I often think that I
> > > need
> > > > a
> > > > > > "for
> > > > > > > >> loop" or something similar in pig.
> > > > > > > >>
> > > > > > > >> Thanks for your help!
> > > > > > > >> -Marco
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to