I would proceed building an inverted list of the urls with the users and times as elements. Then, assuming there is not too much skew in the urls, use a UDF to compute the pairwise similarity. I would also skip the top 1/Kth most popular urls to ease processing.
Not sure Pig is the best candidate for this kind of job though. -- Gianmarco De Francisci Morales On Mon, Apr 4, 2011 at 18:33, Dan Brickley <[email protected]> wrote: > On 4 April 2011 18:17, jacob <[email protected]> wrote: > > I wrote a post on a similar problem with pig. Finding similarity between > > comic book characters ;) > > > > > http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html > > :) > > You're calling out to Ruby for Jaccard; might be worth trying to wire > up Mahout instead, since Pig's happy (happier?) invoking Java > methods... > http://people.apache.org/~isabel/mahout_site/mahout-core/apidocs/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.html > > Anyone tried something like that? > > Dan > > > --jacob > > @thedatachef > > > > On Sun, 2011-04-03 at 20:49 +0000, Diallo Mamadou Bobo wrote: > >> Hi There. > >> We need as part of our start-up product to compute "similar user > feature". And we've decided to go with pig for it. > >> I've been learning pig for a few days now and understand how it work. > >> So to start here is how the log file look like. > >> > >> user url time > >> user1 http://someurl.com 1235416 > >> user1 http://anotherlik.com 1255330 > >> user2 http://someurl.com 1705012 > >> user3 http://something.com 1705042 > >> user3 http://someurl.com 1705042 > >> > >> As the number of users and url can be huge, we can't use a bruteforce > approach here, so first we need to find the user's that have access at least > to on common url. > >> > >> The algorithm could be splited as bellow: > >> > >> 1. Find all users that has accessed to some common urls. > >> 2. generate pair-wise combination of all users for each resource > accessed. > >> 3. for each pair and and url, compute the similarity of those users: the > similarity depend of the timeinterval between the access (so we need to keep > track of the time). > >> 4. sum up for each pair-url the similarity. > >> > >> here is what i've written so far: > >> > >> A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, > url:bytearray, time:long); > >> grouped_pos = GROUP A BY ($1); > >> > >> I know it is not much yet, but now i don't know how to generate the pair > or move further. > >> So any help would be appreciated. > >> > >> Thanks. > > > > > > >
