Re: extract Similar users from logs

Dan Brickley Mon, 04 Apr 2011 13:27:58 -0700

On 4 April 2011 18:17, jacob <[email protected]> wrote:
> I wrote a post on a similar problem with pig. Finding similarity between
> comic book characters ;)
>
> http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html


:)

You're calling out to Ruby for Jaccard; might be worth trying to wire
up Mahout instead, since Pig's happy (happier?) invoking Java
methods...  
http://people.apache.org/~isabel/mahout_site/mahout-core/apidocs/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.html

Anyone tried something like that?

Dan

> --jacob
> @thedatachef
>
> On Sun, 2011-04-03 at 20:49 +0000, Diallo Mamadou Bobo wrote:
>> Hi There.
>> We need as part of our start-up product to compute "similar user feature". 
>> And we've decided to go with pig for it.
>> I've been learning pig for a few days now and understand how it work.
>> So to start here is how the log file look like.
>>
>> user          url                                             time
>> user1         http://someurl.com              1235416
>> user1         http://anotherlik.com           1255330
>> user2         http://someurl.com              1705012
>> user3         http://something.com            1705042
>> user3         http://someurl.com              1705042
>>
>> As the number of users and url can be huge, we can't use a bruteforce 
>> approach here, so first we need to find the user's that have access at least 
>> to on common url.
>>
>> The algorithm could be splited as bellow:
>>
>> 1. Find all users that has accessed to some common urls.
>> 2. generate pair-wise combination of all users for each resource accessed.
>> 3. for each pair and and url, compute the similarity of those users: the 
>> similarity depend of the timeinterval between the access (so we need to keep 
>> track of the time).
>> 4. sum up for each pair-url the similarity.
>>
>> here is what i've written so far:
>>
>> A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, 
>> time:long);
>> grouped_pos = GROUP A BY ($1);
>>
>> I know it is not much yet, but now i don't know how to generate the pair or 
>> move further.
>> So any help would be appreciated.
>>
>> Thanks.
>
>
>

Re: extract Similar users from logs

Reply via email to