Thanks for the suggestion Pat. With the search engine approach, I would imagine I could do ...
neighborhoodID<tab>bookstore,30 cafe,50 pet-store,10 pet-park,3 and then I can do a "LIKE" query to pick up the right docID and parse the doc for weights. But once I get a collection of neighborhoodID and weights, I still need some way to compare similarity (between the user existing neighborhood and the search results). Now I am back to Mahout (or some other math package I can use to find the strength)? Thanks for the book recommendation. I finished "Mahout in Action" last week but I am sure the book is pretty out of date by now. On Thu, Jul 10, 2014 at 8:01 PM, Pat Ferrel <[email protected]> wrote: > You need hadoop installed on the machine you run on but don’t need HDFS or > a cluster. This is called local mode where you set MAHOUT_LOCAL=true and > use the local file system. > > If you want to customize the query at runtime I suggest a search engine. > Using rowsimilarity you can only train in batch and can only pre-calculate > recs. If you index the neighborhoods by feature you can construct a query > at runtime and get fast results. So you can say that the user want pets > (even though their current property doesn’t allow them). This would be > something new, not related to their neighborhood. It is easily added to a > search query. Weights are not so easy using a search engine but may not > matter. Imagine indexing > > hieghborhoodID<tab>bookstore cafe pet-store pet-park > > heighborhoodID is the docID, the rest is the document (space delimited set > of tokens). Index this with Solr or something else. Then at runtime do a > fulltext query with “bookstore pet-store pet-park” or however you want to > build the query. This is actually the way we are thinking about the next > gen of recommender—using a search engine even for collaborative filtering > data. > > Look for a book by Ted Dunning called “Practical Machine Learning”, which > talks about this approach. > > On Jul 10, 2014, at 5:07 PM, Edith Au <[email protected]> wrote: > > I was under the impression that I can only run RowSimilarityJob with > Hadoop. I will take a look at that. Thx! > > I see your point about retraining the data over and over. But there are > couple other requirements I left out from my original post. > > 1. There are amenities in a user existing neighborhood where she does not > care for. For example, if she does not have any children, schools and > day-care centers in her existing neighborhood (or suggesting neighborhoods) > should not sway the final score. > > 2. My next feature is to inject customization. For example, she may want > to move to a pet friendly area because her current neighborhood does not > have enough facilities (eg. Vets, pets store, pets friendly parks) for her > pet. > > If I pre-calculate the row matrix of similar neighborhoods, I am not sure > how I can implement the customization (by adding or removing amenities > requirements at runtime). Any thought on that? > > Thanks for the reminder on mahout FastID. It could easily be a newbie > mistake to use a regular int or long for mapping. > > Thanks again for your help. Much appreciated! > > > > > On Thu, Jul 10, 2014 at 1:40 PM, Pat Ferrel <[email protected]> wrote: > > > Doing things this way you are using the neighborhoodID as a proxy for > > userID/rowID in the recommender. I don’t see the benefit of the in-memory > > version here since all output can easily be pre-calculated. Then it will > > only be a lookup at runtime. You can use “rowsimilarity" on a single > > machine without setting up a cluster, just use the local filesystem. This > > is the way I’d do it. > > > > You definitely don’t want to “only load the columns (amenities) correlate > > to a selected user” with an in-memory recommender. This loading of data > > will trigger a retraining of the recommender before you can ask it for > > similar neighborhoods and that will take more time than you want. This > > potentially will happen as each new user visits your app. > > > > However if you train on all amenities for all neighborhoods then the > > in-memory recommender should work and would train only once. Your data > > would look like: (heighborhoodID, cafesID, numberOfCafes) and so on for > > every non-zero cell in the table. And remember that ALL IDs must be > Mahout > > IDs—you can’t use your own IDs. Mahout IDs correspond to matrix > > coordinants, they are ordinal Ints. Think of them as the row and column > > number of the table. > > > > On Jul 10, 2014, at 10:45 AM, Edith Au <[email protected]> wrote: > > > > Thank you so much for the suggestions. It took me sometime to figure > > things out but I believe I have a pretty good grip on what's need to be > > done now. My dataset is small enough to fit into a single machine so I am > > going to use an in memory implementation rather than hadoop. As > suggested > > by both Pat and Manuel, I have a table (in file system) with > neighborhoods > > as rows and amenities as columns. In runtime, I will only load the > columns > > (amenities) correlate to a selected user and do a UserSimilarity > operation > > between each neighborhood and the one the user resides in. After that, I > > can pick up the NearestNUserNeighborhoods for results. > > > > I gather UserSimilarity is the in-memory equivalent of RowSimilarity > > (Hadoop) ? It would be great if someone can confirm it! > > > > Thanks again Pat and Manuel! > > Edith > > > > > > On Wed, Jul 2, 2014 at 4:06 PM, Pat Ferrel <[email protected]> wrote: > > > >> If you are looking to recommend a similar neighborhood based on the > >> characteristics of some other neighborhood (the user’s current one) so > > you > >> wouldn’t use collaborative filtering. This is a metadata recommender > > based > >> on similarity of neighborhoods not a collection of user preferences. > >> > >> The easiest and fastest would be to use a search engine but I’ll leave > >> that for now since it doesn’t account for feature weights as well. > >> > >> create a table like this: > >> Neighborhood Gym Cafe Bookstore > >> Downtown 15 50 0 > >> Midtown 30 100 10 > >> … > >> > >> You will need to convert the row IDs into sequential ints, which Mahout > >> uses for IDs. Then read them into a sequenceFile creating a Distributed > > Row > >> Matrix, which has Key - Value pairs. Keys = the integer neighborhood > > IDs, > >> the Value is a Vector (a sort of list) of column integer IDs with the > >> counts. > >> > >> Then run rowsimilarity on the DRM. This is the CLI but there is also a > >> Driver you can call from your code. > >> > >> There are some data prep issues you will have since larger neighborhoods > >> will have higher counts. An easy thing to do would be to normalize the > >> counts by something like population or physical size so you get cafes > per > >> resident or per sq mile or some other ratio. > >> > >> The result of the rowsimilarity job will be another DRM of key = > >> neightborhood ID, values = Vector of similar neighborhoods (by integer > > ID) > >> with a strength of similarity. Sort the vector by strength and you’ll > > have > >> an ordered list of similar neighborhoods for each neighborhood. > >> > >> On Jun 30, 2014, at 12:48 PM, Edith Au <[email protected]> wrote: > >> > >> Hi, > >> > >> > >> I am a newbie and am looking for some guidance to implement my > >> recommender. Any help would be greatly appreciated. I have a small > >> data set of location information with the following fields: > >> neighborhood, amenities, and counts. For example: > >> > >> Downtown Gym 15 > >> Downtown Cafe 50 > >> … > >> Midtown Gym 30 > >> Midtown Cafe 100 > >> Midtown Bookstore 10 > >> ... > >> Financial Dist > >> … > >> > >> > >> so on and so forth. I want to recommend a neighborhood for a user to > >> reside base on the amenities (and some other metrics) in his/her > >> current neighborhood. My understanding is that model-based > >> recommendation would be a good fit for the job. If I am on the right > >> track, is there a experimental/beta recommender I can try? > >> > >> > >> If there is no such recommender yet, can I still use Mahout for my > >> project? For example, can I implement my own Similarity which only > >> computes the similarity between one user's preference to a set of > >> neighborhood? If I understand Mahout correctly, User/Item Similarity > >> would do N x (N-1) pair of comparisons as oppose to 1 x N comparisons. > >> In my example, User/Item Similarity would compare between Downtown, > >> Midtown, Fin Dist -- which would be a waste in computation resources > >> since the comparisons are not needed. > >> > >> > >> Thanks in advance for your help. > >> > >> Edith > >> > >> > > > > > >
