I think the in-memory solution will work at that scale. You may have
to increase the heap to 4GB or more (that is, may have to find a
large-ish machine). But yes you can probably get just fine results by
sampling even a fraction of the input, which definitely fits. That's
the place to start I think

On Fri, Feb 18, 2011 at 6:41 PM, Radek Maciaszek
<[email protected]> wrote:
> Hi Sean,
>
> Thanks for so many ideas, I will look into these. Unfortunately the amount
> of data we are dealing with is quite substantial. There is about 1000+
> groups and about 40 millions of users to analyse. Moreover the business need
> is to have eventually even bigger number of groups. Each user can belong to
> many groups so the number of combinations is rather big. In fact this number
> of combinations is so large I am considering to sample the users and only
> analyse 1 in about 256 users. So essentially I would have about 1000+ groups
> and about 150k users. Since one user can potentially belong to many dozens
> of groups this will easily go into millions of records anyway but perhaps
> will be lower than 100M margin you mentioned.
>
> Yesterday I wasn't sure if my existing cluster is big enough for this and
> now I'm tempted to try to do this on one machine. Nice one.
>
> Cheers,
> Radek
>
> On 18 February 2011 15:13, Sean Owen <[email protected]> wrote:
>
>> This looks like a simple collaborative filtering problem, or at least
>> can be solved that way. It's not even recommendation, just an item
>> similarity problem.
>>
>> Users are users and groups are items. You are just computing item-item
>> similarity based on some metric and there are several implemented in
>> the library.
>>
>> Forget Hadoop for now as I doubt this is nearly of the scale where you
>> need it. For a quick solution, make a file of "userID,groupID" entries
>> for every membership. Create a FileDataModel on top of it. Then
>> instantiate LogLikelihoodtemSimilarity on top of that for example. It
>> will score the "simiarlity" between any two groups based on
>> membership. The result is between 0 and 1.
>>
>> On Thu, Feb 17, 2011 at 2:34 PM, Radek Maciaszek
>> <[email protected]> wrote:
>> > Hello,
>> >
>> > I have a following problem and I am trying to figure out if using Mahout
>> is
>> > a good idea for this or perhaps there may be a much simpler approach.
>> >
>> > Consider I have users who can belong to many groups:
>> > user1: group1, group2
>> > user2: group2
>> > user3: group2, group3
>> > ... and millions more
>> >
>> > I am trying to find a similarities between the groups (not the users).
>> Some
>> > simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
>> close
>> > to 1 very similar) would be ideal. So essentially I need to calculate
>> such a
>> > metric for every pair of groups.
>> >
>> > Is it something Mahout can help me with?
>> >
>> > Many thanks,
>> > Radek
>> >
>>
>

Reply via email to