I wrote a post on a similar problem with pig. Finding similarity between comic book characters ;)
http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html --jacob @thedatachef On Sun, 2011-04-03 at 20:49 +0000, Diallo Mamadou Bobo wrote: > Hi There. > We need as part of our start-up product to compute "similar user feature". > And we've decided to go with pig for it. > I've been learning pig for a few days now and understand how it work. > So to start here is how the log file look like. > > user url time > user1 http://someurl.com 1235416 > user1 http://anotherlik.com 1255330 > user2 http://someurl.com 1705012 > user3 http://something.com 1705042 > user3 http://someurl.com 1705042 > > As the number of users and url can be huge, we can't use a bruteforce > approach here, so first we need to find the user's that have access at least > to on common url. > > The algorithm could be splited as bellow: > > 1. Find all users that has accessed to some common urls. > 2. generate pair-wise combination of all users for each resource accessed. > 3. for each pair and and url, compute the similarity of those users: the > similarity depend of the timeinterval between the access (so we need to keep > track of the time). > 4. sum up for each pair-url the similarity. > > here is what i've written so far: > > A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, > time:long); > grouped_pos = GROUP A BY ($1); > > I know it is not much yet, but now i don't know how to generate the pair or > move further. > So any help would be appreciated. > > Thanks.
