The data is not about students but about television ;) Regarding the size.
The raw input data size is about 150m although when I 'explode' the
timeseries
it will be around 80x bigger. I guess the average user duration will be
around
40 Minutes which means when sampling it at a 30s interval will increase the
size by ~12GB.

I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB
HD
should be able to cope with.

I don't know about R. Are you able to handle 12Gb
files well in R (off course it depends on your computer so assume an
average business computer e.g. 2-core 2GHz 4GB ram)?

Cheers
-Marco

On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[email protected]> wrote:

> if it fits in R, it's trivial, draw a density plot or a histogram, about
> three lines of R code
>
> why I was wondering about the data volume.
>
> His example is students attending classes, if  that is really the data hard
> to believe it's super huge?
>
> Guy
>
> On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <[email protected]
> >wrote:
>
> > Perhaps another way to approach this problem is to visualize it
> > geometrically.  You have a long series of class session instances, where
> > each class session is like 1D line segment, beginning/stopping at some
> > start/end time.
> >
> > These segments naturally overlap, and I think the question you're asking
> is
> > equivalent to finding the number of overlaps at every subsegment.
> >
> > To answer this, you want to first break every class session into a full
> > list
> > of subsegments, where a subsegment is created by "breaking" each class
> > session/segment into multiple parts at the start/end point of any other
> > class session.  You can create this full set of subsegments in one pass
> by
> > comparing pairwise (CROSS) each start/end point with your original list
> of
> > class sessions.
> >
> > Once you have the full list of "broken" segments, then a final GROUP
> > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> approach
> > would be faster than the previous approach if your class sessions are
> very
> > long, or there are many overlaps.
> >
> > Norbert
> >
> > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[email protected]>
> wrote:
> >
> > > how big is your dataset?
> > >
> > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[email protected]>
> wrote:
> > >
> > > > Thanks Bill and Norbert that seems like what I was looking for. I'm a
> > bit
> > > > worried about
> > > > how much data/io this could create. But I'll see ;)
> > > >
> > > > Cheers
> > > > -Marco
> > > >
> > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > In case what you're looking for is an analysis over the full
> learning
> > > > > duration, and not just the start interval, then one further insight
> > is
> > > > > that each original record can be transformed into a sequence of
> > > > > records, where the size of the sequence corresponds to the session
> > > > > duration.  In other words, you can use a UDF to "explode" the
> > original
> > > > > record:
> > > > >
> > > > > 1,marco,1319708213,500,math
> > > > >
> > > > > into:
> > > > >
> > > > > 1,marco,1319708190,500,math
> > > > > 1,marco,1319708220,500,math
> > > > > 1,marco,1319708250,500,math
> > > > > 1,marco,1319708280,500,math
> > > > > 1,marco,1319708310,500,math
> > > > > 1,marco,1319708340,500,math
> > > > > 1,marco,1319708370,500,math
> > > > > 1,marco,1319708400,500,math
> > > > > 1,marco,1319708430,500,math
> > > > > 1,marco,1319708460,500,math
> > > > > 1,marco,1319708490,500,math
> > > > > 1,marco,1319708520,500,math
> > > > > 1,marco,1319708550,500,math
> > > > > 1,marco,1319708580,500,math
> > > > > 1,marco,1319708610,500,math
> > > > > 1,marco,1319708640,500,math
> > > > > 1,marco,1319708670,500,math
> > > > > 1,marco,1319708700,500,math
> > > > >
> > > > > and then use Bill's suggestion to group by course, interval.
> > > > >
> > > > > Norbert
> > > > >
> > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> [email protected]>
> > > > > wrote:
> > > > > > You can pass your time to a udf that rounds it down to the
> nearest
> > 30
> > > > > second
> > > > > > interval and then group by course, interval to get counts for
> each
> > > > > course,
> > > > > > interval.
> > > > > >
> > > > > > On Thursday, October 27, 2011, Marco Cadetg <[email protected]>
> > > wrote:
> > > > > >> I have a problem where I don't know how or if pig is even
> suitable
> > > to
> > > > > > solve
> > > > > >> it.
> > > > > >>
> > > > > >> I have a schema like this:
> > > > > >>
> > > > > >> student-id,student-name,start-time,duration,course
> > > > > >> 1,marco,1319708213,500,math
> > > > > >> 2,ralf,1319708111,112,english
> > > > > >> 3,greg,1319708321,333,french
> > > > > >> 4,diva,1319708444,80,english
> > > > > >> 5,susanne,1319708123,2000,math
> > > > > >> 1,marco,1319708564,500,french
> > > > > >> 2,ralf,1319708789,123,french
> > > > > >> 7,fred,1319708213,5675,french
> > > > > >> 8,laura,1319708233,123,math
> > > > > >> 10,sab,1319708999,777,math
> > > > > >> 11,fibo,1319708789,565,math
> > > > > >> 6,dan,1319708456,50,english
> > > > > >> 9,marco,1319708123,60,english
> > > > > >> 12,bo,1319708456,345,math
> > > > > >> 1,marco,1319708789,673,math
> > > > > >> ...
> > > > > >> ...
> > > > > >>
> > > > > >> I would like to retrieve a graph (interpolation) over time
> grouped
> > > by
> > > > > >> course. Meaning how many students are learning for a course
> based
> > on
> > > a
> > > > > 30
> > > > > >> sec interval.
> > > > > >> The grouping by course is easy but from there I've no clue how I
> > > would
> > > > > >> achieve the rest. I guess the rest needs to be achieved via some
> > UDF
> > > > > >> or is there any way how to this in pig? I often think that I
> need
> > a
> > > > "for
> > > > > >> loop" or something similar in pig.
> > > > > >>
> > > > > >> Thanks for your help!
> > > > > >> -Marco
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to