Thank you so much for the suggestions. It took me sometime to figure things out but I believe I have a pretty good grip on what's need to be done now. My dataset is small enough to fit into a single machine so I am going to use an in memory implementation rather than hadoop. As suggested by both Pat and Manuel, I have a table (in file system) with neighborhoods as rows and amenities as columns. In runtime, I will only load the columns (amenities) correlate to a selected user and do a UserSimilarity operation between each neighborhood and the one the user resides in. After that, I can pick up the NearestNUserNeighborhoods for results.
I gather UserSimilarity is the in-memory equivalent of RowSimilarity (Hadoop) ? It would be great if someone can confirm it! Thanks again Pat and Manuel! Edith On Wed, Jul 2, 2014 at 4:06 PM, Pat Ferrel <[email protected]> wrote: > If you are looking to recommend a similar neighborhood based on the > characteristics of some other neighborhood (the user’s current one) so you > wouldn’t use collaborative filtering. This is a metadata recommender based > on similarity of neighborhoods not a collection of user preferences. > > The easiest and fastest would be to use a search engine but I’ll leave > that for now since it doesn’t account for feature weights as well. > > create a table like this: > Neighborhood Gym Cafe Bookstore > Downtown 15 50 0 > Midtown 30 100 10 > … > > You will need to convert the row IDs into sequential ints, which Mahout > uses for IDs. Then read them into a sequenceFile creating a Distributed Row > Matrix, which has Key - Value pairs. Keys = the integer neighborhood IDs, > the Value is a Vector (a sort of list) of column integer IDs with the > counts. > > Then run rowsimilarity on the DRM. This is the CLI but there is also a > Driver you can call from your code. > > There are some data prep issues you will have since larger neighborhoods > will have higher counts. An easy thing to do would be to normalize the > counts by something like population or physical size so you get cafes per > resident or per sq mile or some other ratio. > > The result of the rowsimilarity job will be another DRM of key = > neightborhood ID, values = Vector of similar neighborhoods (by integer ID) > with a strength of similarity. Sort the vector by strength and you’ll have > an ordered list of similar neighborhoods for each neighborhood. > > On Jun 30, 2014, at 12:48 PM, Edith Au <[email protected]> wrote: > > Hi, > > > I am a newbie and am looking for some guidance to implement my > recommender. Any help would be greatly appreciated. I have a small > data set of location information with the following fields: > neighborhood, amenities, and counts. For example: > > Downtown Gym 15 > Downtown Cafe 50 > … > Midtown Gym 30 > Midtown Cafe 100 > Midtown Bookstore 10 > ... > Financial Dist > … > > > so on and so forth. I want to recommend a neighborhood for a user to > reside base on the amenities (and some other metrics) in his/her > current neighborhood. My understanding is that model-based > recommendation would be a good fit for the job. If I am on the right > track, is there a experimental/beta recommender I can try? > > > If there is no such recommender yet, can I still use Mahout for my > project? For example, can I implement my own Similarity which only > computes the similarity between one user's preference to a set of > neighborhood? If I understand Mahout correctly, User/Item Similarity > would do N x (N-1) pair of comparisons as oppose to 1 x N comparisons. > In my example, User/Item Similarity would compare between Downtown, > Midtown, Fin Dist -- which would be a waste in computation resources > since the comparisons are not needed. > > > Thanks in advance for your help. > > Edith > >
