I wrote a post on a similar problem with pig. Finding similarity between
comic book characters ;)

http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html

--jacob
@thedatachef

On Sun, 2011-04-03 at 20:49 +0000, Diallo Mamadou Bobo wrote:
> Hi There.
> We need as part of our start-up product to compute "similar user feature". 
> And we've decided to go with pig for it.
> I've been learning pig for a few days now and understand how it work.
> So to start here is how the log file look like.
> 
> user          url                                             time
> user1         http://someurl.com              1235416
> user1         http://anotherlik.com           1255330
> user2         http://someurl.com              1705012
> user3         http://something.com            1705042
> user3         http://someurl.com              1705042
> 
> As the number of users and url can be huge, we can't use a bruteforce 
> approach here, so first we need to find the user's that have access at least 
> to on common url.
> 
> The algorithm could be splited as bellow:
> 
> 1. Find all users that has accessed to some common urls.
> 2. generate pair-wise combination of all users for each resource accessed.
> 3. for each pair and and url, compute the similarity of those users: the 
> similarity depend of the timeinterval between the access (so we need to keep 
> track of the time).
> 4. sum up for each pair-url the similarity.
> 
> here is what i've written so far:
> 
> A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, 
> time:long);
> grouped_pos = GROUP A BY ($1);
> 
> I know it is not much yet, but now i don't know how to generate the pair or 
> move further.
> So any help would be appreciated.
> 
> Thanks.


Reply via email to