Am 24.05.2011 11:39, schrieb Sean Owen:
On Tue, May 24, 2011 at 10:17 AM, Uwe Reimann<[email protected]> wrote:
Since the user provides new preferences at a high rate, I expect to change
the neighborhood of an individual user rapidly. Using CachingUserSimilarity
or CachingUserNeighborhood probably won't work here. Using a
ClusteringRecommender seems to be an option here in order to search against
some clusters instead against many users. The cluster should be recalculated
periodically in the background.
(You can have the cache clear just entries for the current user.)
Neighborhoods ought to be stable-ish. I would not expect that one new
data point would significantly change who your most similar users are.
Probably depends on how many data point were available before. I suspect
i.e. the 5th data point having a greater impact than the 105th. Is there
a lower limit (above 1) on the number of data points a user must have
before recommendations make sense?
So you can probably get away with perioidically recomputing these,
perhaps frequently, but not necessarily at every update.
I could trigger the recalculation if the knowledge about the current
user has changed by say 25%. That way the recomputing rate would decrease.
You do need to use the latest preferences in recommendation, of
course, but that's separate from calculating a neighborhood.
Dislikes should be considered during similarity search. I'd like to express
those as negative preference values. PearsonCorrelationSimilarity should be
ok with that, right?
Yes.
Since I expect to have very low overlap in items between (especially new)
users, I'd like to take the item's category into account during similarity
search. User u1, who likes items i1 of category c1 should get item i2 of
category c1 recommended if user u2 likes that. Both users would have a
preference value for category c1 in common. This should clearly be possible
by just providing the calculated preference values for the category items.
You are describing more of an item-based recommender and indeed I
think that could be better here since it avoids cold-start problems
better. (I prefer it as well.) You might instead look at
GenericItemBasedRecommender and ItemSImilarity instead.
I did some testing of the different recommenders on a real data set from
a bookmarking site. GenericBooleanPrefItemBasedRecommender did not work
very well for me. It seemed to recommend the top links. Using
GenericUserBasedRecommender worked way better (after some tweaking),
which recommended links that actually fit my interests. Might need to do
some more testing here.
Your thinking about using Lucene almost surely also applies to
item-item similarity.
I think I need to provide different DataModels to the different stages of
recommendation calculation: 1) one which includes likes and dislike for
items and categories for similarity search, 2) one which includes just the
liked items to pick the recommendations from and 3) one which includes all
items of a user (liked, disliked and skipped ones) for filtering out the
user's items using an IDRescorer.
I think one DataModel is fine. You want to include all data in
similarity calculations (1). It is also good to have all items
available in recommendation (2); you don't want to exclude an item
just because someone didn't like it. And in (3) you do not need to
filter out items the user has rated; that's done already.
(1) would include categories, that should not be recommended, that's why
(2) is being used to pick the recommendations from. (2) would contain
the liked items of every user, that includes items that are disliked by
other users. (3) is for filtering out items that the user has not rated,
but has been presented before.