Not to much knowledge to help you. What is the nature of your data? You get it daily, montly? 30 days is a sliding window or month?
Imho the approach sould be: When data arrives find the most resent user activity Store output to /recent-activity/yyyy/mm/dd/hh When next pack arrives read data in conjunction with previously found recent activities and produce output to /recent-activity/yyyy/mm/dd/hh+1 So you always track the most recent events of users. Provide more details and we can think how to solve your problem. Right now there are more questions than answers 14.08.2013 0:33 пользователь "Mike Sukmanowsky" <[email protected]> написал: > Hi all, > > Trying to produce some data using clickstream logs from Pig that does the > following: > > 1. Pull data for the past 30 days (current period) > 2. Classify Group A as users who had activity in the current period but > not 30 days prior to the current period. > 3. Classify Group B effectively as all {users in current period} - > {Group A} > > To make the example concrete, let's say end date is July 30, 2013. > > So Group A users = anyone who had activity from Jul 1 - Jul 30, 2013 but > did not have activity in Jun 1 - Jun 30. > Group B users = anyone who had activity activity from Jul 1 - Jul 30, 2013 > and also had activity in Jun 1 - Jun 30. > > I've had some initial thoughts for how to approach this but none of them > seem great. Any thoughts from the group? > > Mike > > -- > Mike Sukmanowsky > > Product Lead, http://parse.ly > 989 Avenue of the Americas, 3rd Floor > New York, NY 10018 > p: +1 (416) 953-4248 > e: [email protected] >
