ahh TV that explains it 12G data file is a bit too big for R unless you sample, not sure if the use case is conducive to sampling?
If it is, could sample it down and structure in pig/hadoop and then load it into the analytical/visualization tool of choice... Guy On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[email protected]> wrote: > The data is not about students but about television ;) Regarding the size. > The raw input data size is about 150m although when I 'explode' the > timeseries > it will be around 80x bigger. I guess the average user duration will be > around > 40 Minutes which means when sampling it at a 30s interval will increase the > size by ~12GB. > > I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB > HD > should be able to cope with. > > I don't know about R. Are you able to handle 12Gb > files well in R (off course it depends on your computer so assume an > average business computer e.g. 2-core 2GHz 4GB ram)? > > Cheers > -Marco > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[email protected]> wrote: > > > if it fits in R, it's trivial, draw a density plot or a histogram, about > > three lines of R code > > > > why I was wondering about the data volume. > > > > His example is students attending classes, if that is really the data > hard > > to believe it's super huge? > > > > Guy > > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger < > [email protected] > > >wrote: > > > > > Perhaps another way to approach this problem is to visualize it > > > geometrically. You have a long series of class session instances, > where > > > each class session is like 1D line segment, beginning/stopping at some > > > start/end time. > > > > > > These segments naturally overlap, and I think the question you're > asking > > is > > > equivalent to finding the number of overlaps at every subsegment. > > > > > > To answer this, you want to first break every class session into a full > > > list > > > of subsegments, where a subsegment is created by "breaking" each class > > > session/segment into multiple parts at the start/end point of any other > > > class session. You can create this full set of subsegments in one pass > > by > > > comparing pairwise (CROSS) each start/end point with your original list > > of > > > class sessions. > > > > > > Once you have the full list of "broken" segments, then a final GROUP > > > BY/COUNT(*) will you give you the number of overlaps. Seems like > > approach > > > would be faster than the previous approach if your class sessions are > > very > > > long, or there are many overlaps. > > > > > > Norbert > > > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[email protected]> > > wrote: > > > > > > > how big is your dataset? > > > > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[email protected]> > > wrote: > > > > > > > > > Thanks Bill and Norbert that seems like what I was looking for. > I'm a > > > bit > > > > > worried about > > > > > how much data/io this could create. But I'll see ;) > > > > > > > > > > Cheers > > > > > -Marco > > > > > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger < > > > > [email protected] > > > > > >wrote: > > > > > > > > > > > In case what you're looking for is an analysis over the full > > learning > > > > > > duration, and not just the start interval, then one further > insight > > > is > > > > > > that each original record can be transformed into a sequence of > > > > > > records, where the size of the sequence corresponds to the > session > > > > > > duration. In other words, you can use a UDF to "explode" the > > > original > > > > > > record: > > > > > > > > > > > > 1,marco,1319708213,500,math > > > > > > > > > > > > into: > > > > > > > > > > > > 1,marco,1319708190,500,math > > > > > > 1,marco,1319708220,500,math > > > > > > 1,marco,1319708250,500,math > > > > > > 1,marco,1319708280,500,math > > > > > > 1,marco,1319708310,500,math > > > > > > 1,marco,1319708340,500,math > > > > > > 1,marco,1319708370,500,math > > > > > > 1,marco,1319708400,500,math > > > > > > 1,marco,1319708430,500,math > > > > > > 1,marco,1319708460,500,math > > > > > > 1,marco,1319708490,500,math > > > > > > 1,marco,1319708520,500,math > > > > > > 1,marco,1319708550,500,math > > > > > > 1,marco,1319708580,500,math > > > > > > 1,marco,1319708610,500,math > > > > > > 1,marco,1319708640,500,math > > > > > > 1,marco,1319708670,500,math > > > > > > 1,marco,1319708700,500,math > > > > > > > > > > > > and then use Bill's suggestion to group by course, interval. > > > > > > > > > > > > Norbert > > > > > > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham < > > [email protected]> > > > > > > wrote: > > > > > > > You can pass your time to a udf that rounds it down to the > > nearest > > > 30 > > > > > > second > > > > > > > interval and then group by course, interval to get counts for > > each > > > > > > course, > > > > > > > interval. > > > > > > > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <[email protected]> > > > > wrote: > > > > > > >> I have a problem where I don't know how or if pig is even > > suitable > > > > to > > > > > > > solve > > > > > > >> it. > > > > > > >> > > > > > > >> I have a schema like this: > > > > > > >> > > > > > > >> student-id,student-name,start-time,duration,course > > > > > > >> 1,marco,1319708213,500,math > > > > > > >> 2,ralf,1319708111,112,english > > > > > > >> 3,greg,1319708321,333,french > > > > > > >> 4,diva,1319708444,80,english > > > > > > >> 5,susanne,1319708123,2000,math > > > > > > >> 1,marco,1319708564,500,french > > > > > > >> 2,ralf,1319708789,123,french > > > > > > >> 7,fred,1319708213,5675,french > > > > > > >> 8,laura,1319708233,123,math > > > > > > >> 10,sab,1319708999,777,math > > > > > > >> 11,fibo,1319708789,565,math > > > > > > >> 6,dan,1319708456,50,english > > > > > > >> 9,marco,1319708123,60,english > > > > > > >> 12,bo,1319708456,345,math > > > > > > >> 1,marco,1319708789,673,math > > > > > > >> ... > > > > > > >> ... > > > > > > >> > > > > > > >> I would like to retrieve a graph (interpolation) over time > > grouped > > > > by > > > > > > >> course. Meaning how many students are learning for a course > > based > > > on > > > > a > > > > > > 30 > > > > > > >> sec interval. > > > > > > >> The grouping by course is easy but from there I've no clue > how I > > > > would > > > > > > >> achieve the rest. I guess the rest needs to be achieved via > some > > > UDF > > > > > > >> or is there any way how to this in pig? I often think that I > > need > > > a > > > > > "for > > > > > > >> loop" or something similar in pig. > > > > > > >> > > > > > > >> Thanks for your help! > > > > > > >> -Marco > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
