Thanks again for all your comments. Jonathan, would you mind to enlighten me on the way you would keep track of the people you need to "eject". I don't get the min heap based tuple...
Cheers -Marco On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <[email protected]>wrote: > Perhaps I'm misunderstanding your use case, and this depends on the amount > of data, but you could consider something like this (to avoid exploding the > data, which could perhaps be inavoidable but I hate resorting to that if I > don't have to). > > a = foreach yourdata generate student_id, start_time, start_time+duration > as end_time, course; > b = group a by course; > c = foreach b { > ord = order a by start_time; > generate yourudf.process(ord); > } > > Here is generally what process could do. It would be an accumulator UDF > that expected tuples sorted on start_time. Now you basically need a way to > know who the distinct users are. Now, since you want 30s windows, your > first window will presumably be 30s after the first start_time in your > data, and you would just tick ahead in 1s and write to a bag which would > have second, # of distinct student_ids. To know when to eject people, you > could have any number of data structures... perhaps a min heap based on > end_time, and of course instead of "ticking" ahead, you would grab a new > tuple (since this is the only thing that would change the state of the # of > distinct ids), and then do all of the ticking ahead as you adjust the heap > and write the seconds in between the current time pointer and the > start_time of the new tuple, making sure in each step to check against the > min heap to eject any users that expired. > > That was a little rambly, I could quickly put together some more reasonable > pseudocode if that would help. I think the general idea is clear though... > > 2011/10/31 Guy Bayes <[email protected]> > > > ahh TV that explains it > > > > 12G data file is a bit too big for R unless you sample, not sure if the > use > > case is conducive to sampling? > > > > If it is, could sample it down and structure in pig/hadoop and then load > it > > into the analytical/visualization tool of choice... > > > > Guy > > > > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[email protected]> wrote: > > > > > The data is not about students but about television ;) Regarding the > > size. > > > The raw input data size is about 150m although when I 'explode' the > > > timeseries > > > it will be around 80x bigger. I guess the average user duration will be > > > around > > > 40 Minutes which means when sampling it at a 30s interval will increase > > the > > > size by ~12GB. > > > > > > I think that is a size which my hadoop cluster with five 8-core x 8GB x > > 2TB > > > HD > > > should be able to cope with. > > > > > > I don't know about R. Are you able to handle 12Gb > > > files well in R (off course it depends on your computer so assume an > > > average business computer e.g. 2-core 2GHz 4GB ram)? > > > > > > Cheers > > > -Marco > > > > > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[email protected]> > > wrote: > > > > > > > if it fits in R, it's trivial, draw a density plot or a histogram, > > about > > > > three lines of R code > > > > > > > > why I was wondering about the data volume. > > > > > > > > His example is students attending classes, if that is really the > data > > > hard > > > > to believe it's super huge? > > > > > > > > Guy > > > > > > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger < > > > [email protected] > > > > >wrote: > > > > > > > > > Perhaps another way to approach this problem is to visualize it > > > > > geometrically. You have a long series of class session instances, > > > where > > > > > each class session is like 1D line segment, beginning/stopping at > > some > > > > > start/end time. > > > > > > > > > > These segments naturally overlap, and I think the question you're > > > asking > > > > is > > > > > equivalent to finding the number of overlaps at every subsegment. > > > > > > > > > > To answer this, you want to first break every class session into a > > full > > > > > list > > > > > of subsegments, where a subsegment is created by "breaking" each > > class > > > > > session/segment into multiple parts at the start/end point of any > > other > > > > > class session. You can create this full set of subsegments in one > > pass > > > > by > > > > > comparing pairwise (CROSS) each start/end point with your original > > list > > > > of > > > > > class sessions. > > > > > > > > > > Once you have the full list of "broken" segments, then a final > GROUP > > > > > BY/COUNT(*) will you give you the number of overlaps. Seems like > > > > approach > > > > > would be faster than the previous approach if your class sessions > are > > > > very > > > > > long, or there are many overlaps. > > > > > > > > > > Norbert > > > > > > > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[email protected]> > > > > wrote: > > > > > > > > > > > how big is your dataset? > > > > > > > > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[email protected]> > > > > wrote: > > > > > > > > > > > > > Thanks Bill and Norbert that seems like what I was looking for. > > > I'm a > > > > > bit > > > > > > > worried about > > > > > > > how much data/io this could create. But I'll see ;) > > > > > > > > > > > > > > Cheers > > > > > > > -Marco > > > > > > > > > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger < > > > > > > [email protected] > > > > > > > >wrote: > > > > > > > > > > > > > > > In case what you're looking for is an analysis over the full > > > > learning > > > > > > > > duration, and not just the start interval, then one further > > > insight > > > > > is > > > > > > > > that each original record can be transformed into a sequence > of > > > > > > > > records, where the size of the sequence corresponds to the > > > session > > > > > > > > duration. In other words, you can use a UDF to "explode" the > > > > > original > > > > > > > > record: > > > > > > > > > > > > > > > > 1,marco,1319708213,500,math > > > > > > > > > > > > > > > > into: > > > > > > > > > > > > > > > > 1,marco,1319708190,500,math > > > > > > > > 1,marco,1319708220,500,math > > > > > > > > 1,marco,1319708250,500,math > > > > > > > > 1,marco,1319708280,500,math > > > > > > > > 1,marco,1319708310,500,math > > > > > > > > 1,marco,1319708340,500,math > > > > > > > > 1,marco,1319708370,500,math > > > > > > > > 1,marco,1319708400,500,math > > > > > > > > 1,marco,1319708430,500,math > > > > > > > > 1,marco,1319708460,500,math > > > > > > > > 1,marco,1319708490,500,math > > > > > > > > 1,marco,1319708520,500,math > > > > > > > > 1,marco,1319708550,500,math > > > > > > > > 1,marco,1319708580,500,math > > > > > > > > 1,marco,1319708610,500,math > > > > > > > > 1,marco,1319708640,500,math > > > > > > > > 1,marco,1319708670,500,math > > > > > > > > 1,marco,1319708700,500,math > > > > > > > > > > > > > > > > and then use Bill's suggestion to group by course, interval. > > > > > > > > > > > > > > > > Norbert > > > > > > > > > > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > You can pass your time to a udf that rounds it down to the > > > > nearest > > > > > 30 > > > > > > > > second > > > > > > > > > interval and then group by course, interval to get counts > for > > > > each > > > > > > > > course, > > > > > > > > > interval. > > > > > > > > > > > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg < > > [email protected]> > > > > > > wrote: > > > > > > > > >> I have a problem where I don't know how or if pig is even > > > > suitable > > > > > > to > > > > > > > > > solve > > > > > > > > >> it. > > > > > > > > >> > > > > > > > > >> I have a schema like this: > > > > > > > > >> > > > > > > > > >> student-id,student-name,start-time,duration,course > > > > > > > > >> 1,marco,1319708213,500,math > > > > > > > > >> 2,ralf,1319708111,112,english > > > > > > > > >> 3,greg,1319708321,333,french > > > > > > > > >> 4,diva,1319708444,80,english > > > > > > > > >> 5,susanne,1319708123,2000,math > > > > > > > > >> 1,marco,1319708564,500,french > > > > > > > > >> 2,ralf,1319708789,123,french > > > > > > > > >> 7,fred,1319708213,5675,french > > > > > > > > >> 8,laura,1319708233,123,math > > > > > > > > >> 10,sab,1319708999,777,math > > > > > > > > >> 11,fibo,1319708789,565,math > > > > > > > > >> 6,dan,1319708456,50,english > > > > > > > > >> 9,marco,1319708123,60,english > > > > > > > > >> 12,bo,1319708456,345,math > > > > > > > > >> 1,marco,1319708789,673,math > > > > > > > > >> ... > > > > > > > > >> ... > > > > > > > > >> > > > > > > > > >> I would like to retrieve a graph (interpolation) over time > > > > grouped > > > > > > by > > > > > > > > >> course. Meaning how many students are learning for a > course > > > > based > > > > > on > > > > > > a > > > > > > > > 30 > > > > > > > > >> sec interval. > > > > > > > > >> The grouping by course is easy but from there I've no clue > > > how I > > > > > > would > > > > > > > > >> achieve the rest. I guess the rest needs to be achieved > via > > > some > > > > > UDF > > > > > > > > >> or is there any way how to this in pig? I often think > that I > > > > need > > > > > a > > > > > > > "for > > > > > > > > >> loop" or something similar in pig. > > > > > > > > >> > > > > > > > > >> Thanks for your help! > > > > > > > > >> -Marco > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
