The data is not about students but about television ;) Regarding the size. The raw input data size is about 150m although when I 'explode' the timeseries it will be around 80x bigger. I guess the average user duration will be around 40 Minutes which means when sampling it at a 30s interval will increase the size by ~12GB.
I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB HD should be able to cope with. I don't know about R. Are you able to handle 12Gb files well in R (off course it depends on your computer so assume an average business computer e.g. 2-core 2GHz 4GB ram)? Cheers -Marco On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[email protected]> wrote: > if it fits in R, it's trivial, draw a density plot or a histogram, about > three lines of R code > > why I was wondering about the data volume. > > His example is students attending classes, if that is really the data hard > to believe it's super huge? > > Guy > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <[email protected] > >wrote: > > > Perhaps another way to approach this problem is to visualize it > > geometrically. You have a long series of class session instances, where > > each class session is like 1D line segment, beginning/stopping at some > > start/end time. > > > > These segments naturally overlap, and I think the question you're asking > is > > equivalent to finding the number of overlaps at every subsegment. > > > > To answer this, you want to first break every class session into a full > > list > > of subsegments, where a subsegment is created by "breaking" each class > > session/segment into multiple parts at the start/end point of any other > > class session. You can create this full set of subsegments in one pass > by > > comparing pairwise (CROSS) each start/end point with your original list > of > > class sessions. > > > > Once you have the full list of "broken" segments, then a final GROUP > > BY/COUNT(*) will you give you the number of overlaps. Seems like > approach > > would be faster than the previous approach if your class sessions are > very > > long, or there are many overlaps. > > > > Norbert > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[email protected]> > wrote: > > > > > how big is your dataset? > > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[email protected]> > wrote: > > > > > > > Thanks Bill and Norbert that seems like what I was looking for. I'm a > > bit > > > > worried about > > > > how much data/io this could create. But I'll see ;) > > > > > > > > Cheers > > > > -Marco > > > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger < > > > [email protected] > > > > >wrote: > > > > > > > > > In case what you're looking for is an analysis over the full > learning > > > > > duration, and not just the start interval, then one further insight > > is > > > > > that each original record can be transformed into a sequence of > > > > > records, where the size of the sequence corresponds to the session > > > > > duration. In other words, you can use a UDF to "explode" the > > original > > > > > record: > > > > > > > > > > 1,marco,1319708213,500,math > > > > > > > > > > into: > > > > > > > > > > 1,marco,1319708190,500,math > > > > > 1,marco,1319708220,500,math > > > > > 1,marco,1319708250,500,math > > > > > 1,marco,1319708280,500,math > > > > > 1,marco,1319708310,500,math > > > > > 1,marco,1319708340,500,math > > > > > 1,marco,1319708370,500,math > > > > > 1,marco,1319708400,500,math > > > > > 1,marco,1319708430,500,math > > > > > 1,marco,1319708460,500,math > > > > > 1,marco,1319708490,500,math > > > > > 1,marco,1319708520,500,math > > > > > 1,marco,1319708550,500,math > > > > > 1,marco,1319708580,500,math > > > > > 1,marco,1319708610,500,math > > > > > 1,marco,1319708640,500,math > > > > > 1,marco,1319708670,500,math > > > > > 1,marco,1319708700,500,math > > > > > > > > > > and then use Bill's suggestion to group by course, interval. > > > > > > > > > > Norbert > > > > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham < > [email protected]> > > > > > wrote: > > > > > > You can pass your time to a udf that rounds it down to the > nearest > > 30 > > > > > second > > > > > > interval and then group by course, interval to get counts for > each > > > > > course, > > > > > > interval. > > > > > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <[email protected]> > > > wrote: > > > > > >> I have a problem where I don't know how or if pig is even > suitable > > > to > > > > > > solve > > > > > >> it. > > > > > >> > > > > > >> I have a schema like this: > > > > > >> > > > > > >> student-id,student-name,start-time,duration,course > > > > > >> 1,marco,1319708213,500,math > > > > > >> 2,ralf,1319708111,112,english > > > > > >> 3,greg,1319708321,333,french > > > > > >> 4,diva,1319708444,80,english > > > > > >> 5,susanne,1319708123,2000,math > > > > > >> 1,marco,1319708564,500,french > > > > > >> 2,ralf,1319708789,123,french > > > > > >> 7,fred,1319708213,5675,french > > > > > >> 8,laura,1319708233,123,math > > > > > >> 10,sab,1319708999,777,math > > > > > >> 11,fibo,1319708789,565,math > > > > > >> 6,dan,1319708456,50,english > > > > > >> 9,marco,1319708123,60,english > > > > > >> 12,bo,1319708456,345,math > > > > > >> 1,marco,1319708789,673,math > > > > > >> ... > > > > > >> ... > > > > > >> > > > > > >> I would like to retrieve a graph (interpolation) over time > grouped > > > by > > > > > >> course. Meaning how many students are learning for a course > based > > on > > > a > > > > > 30 > > > > > >> sec interval. > > > > > >> The grouping by course is easy but from there I've no clue how I > > > would > > > > > >> achieve the rest. I guess the rest needs to be achieved via some > > UDF > > > > > >> or is there any way how to this in pig? I often think that I > need > > a > > > > "for > > > > > >> loop" or something similar in pig. > > > > > >> > > > > > >> Thanks for your help! > > > > > >> -Marco > > > > > >> > > > > > > > > > > > > > > > > > > > > >
