ahh TV that explains it

12G data file is a bit too big for R unless you sample, not sure if the use
case is conducive to sampling?

If it is, could sample it down and structure in pig/hadoop and then load it
into the analytical/visualization tool of choice...

Guy

On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[email protected]> wrote:

> The data is not about students but about television ;) Regarding the size.
> The raw input data size is about 150m although when I 'explode' the
> timeseries
> it will be around 80x bigger. I guess the average user duration will be
> around
> 40 Minutes which means when sampling it at a 30s interval will increase the
> size by ~12GB.
>
> I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB
> HD
> should be able to cope with.
>
> I don't know about R. Are you able to handle 12Gb
> files well in R (off course it depends on your computer so assume an
> average business computer e.g. 2-core 2GHz 4GB ram)?
>
> Cheers
> -Marco
>
> On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[email protected]> wrote:
>
> > if it fits in R, it's trivial, draw a density plot or a histogram, about
> > three lines of R code
> >
> > why I was wondering about the data volume.
> >
> > His example is students attending classes, if  that is really the data
> hard
> > to believe it's super huge?
> >
> > Guy
> >
> > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> [email protected]
> > >wrote:
> >
> > > Perhaps another way to approach this problem is to visualize it
> > > geometrically.  You have a long series of class session instances,
> where
> > > each class session is like 1D line segment, beginning/stopping at some
> > > start/end time.
> > >
> > > These segments naturally overlap, and I think the question you're
> asking
> > is
> > > equivalent to finding the number of overlaps at every subsegment.
> > >
> > > To answer this, you want to first break every class session into a full
> > > list
> > > of subsegments, where a subsegment is created by "breaking" each class
> > > session/segment into multiple parts at the start/end point of any other
> > > class session.  You can create this full set of subsegments in one pass
> > by
> > > comparing pairwise (CROSS) each start/end point with your original list
> > of
> > > class sessions.
> > >
> > > Once you have the full list of "broken" segments, then a final GROUP
> > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > approach
> > > would be faster than the previous approach if your class sessions are
> > very
> > > long, or there are many overlaps.
> > >
> > > Norbert
> > >
> > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[email protected]>
> > wrote:
> > >
> > > > how big is your dataset?
> > > >
> > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[email protected]>
> > wrote:
> > > >
> > > > > Thanks Bill and Norbert that seems like what I was looking for.
> I'm a
> > > bit
> > > > > worried about
> > > > > how much data/io this could create. But I'll see ;)
> > > > >
> > > > > Cheers
> > > > > -Marco
> > > > >
> > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > In case what you're looking for is an analysis over the full
> > learning
> > > > > > duration, and not just the start interval, then one further
> insight
> > > is
> > > > > > that each original record can be transformed into a sequence of
> > > > > > records, where the size of the sequence corresponds to the
> session
> > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > original
> > > > > > record:
> > > > > >
> > > > > > 1,marco,1319708213,500,math
> > > > > >
> > > > > > into:
> > > > > >
> > > > > > 1,marco,1319708190,500,math
> > > > > > 1,marco,1319708220,500,math
> > > > > > 1,marco,1319708250,500,math
> > > > > > 1,marco,1319708280,500,math
> > > > > > 1,marco,1319708310,500,math
> > > > > > 1,marco,1319708340,500,math
> > > > > > 1,marco,1319708370,500,math
> > > > > > 1,marco,1319708400,500,math
> > > > > > 1,marco,1319708430,500,math
> > > > > > 1,marco,1319708460,500,math
> > > > > > 1,marco,1319708490,500,math
> > > > > > 1,marco,1319708520,500,math
> > > > > > 1,marco,1319708550,500,math
> > > > > > 1,marco,1319708580,500,math
> > > > > > 1,marco,1319708610,500,math
> > > > > > 1,marco,1319708640,500,math
> > > > > > 1,marco,1319708670,500,math
> > > > > > 1,marco,1319708700,500,math
> > > > > >
> > > > > > and then use Bill's suggestion to group by course, interval.
> > > > > >
> > > > > > Norbert
> > > > > >
> > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > [email protected]>
> > > > > > wrote:
> > > > > > > You can pass your time to a udf that rounds it down to the
> > nearest
> > > 30
> > > > > > second
> > > > > > > interval and then group by course, interval to get counts for
> > each
> > > > > > course,
> > > > > > > interval.
> > > > > > >
> > > > > > > On Thursday, October 27, 2011, Marco Cadetg <[email protected]>
> > > > wrote:
> > > > > > >> I have a problem where I don't know how or if pig is even
> > suitable
> > > > to
> > > > > > > solve
> > > > > > >> it.
> > > > > > >>
> > > > > > >> I have a schema like this:
> > > > > > >>
> > > > > > >> student-id,student-name,start-time,duration,course
> > > > > > >> 1,marco,1319708213,500,math
> > > > > > >> 2,ralf,1319708111,112,english
> > > > > > >> 3,greg,1319708321,333,french
> > > > > > >> 4,diva,1319708444,80,english
> > > > > > >> 5,susanne,1319708123,2000,math
> > > > > > >> 1,marco,1319708564,500,french
> > > > > > >> 2,ralf,1319708789,123,french
> > > > > > >> 7,fred,1319708213,5675,french
> > > > > > >> 8,laura,1319708233,123,math
> > > > > > >> 10,sab,1319708999,777,math
> > > > > > >> 11,fibo,1319708789,565,math
> > > > > > >> 6,dan,1319708456,50,english
> > > > > > >> 9,marco,1319708123,60,english
> > > > > > >> 12,bo,1319708456,345,math
> > > > > > >> 1,marco,1319708789,673,math
> > > > > > >> ...
> > > > > > >> ...
> > > > > > >>
> > > > > > >> I would like to retrieve a graph (interpolation) over time
> > grouped
> > > > by
> > > > > > >> course. Meaning how many students are learning for a course
> > based
> > > on
> > > > a
> > > > > > 30
> > > > > > >> sec interval.
> > > > > > >> The grouping by course is easy but from there I've no clue
> how I
> > > > would
> > > > > > >> achieve the rest. I guess the rest needs to be achieved via
> some
> > > UDF
> > > > > > >> or is there any way how to this in pig? I often think that I
> > need
> > > a
> > > > > "for
> > > > > > >> loop" or something similar in pig.
> > > > > > >>
> > > > > > >> Thanks for your help!
> > > > > > >> -Marco
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to