Thanks for the answer Ted. On 21.11.2011, at 16:20, Ted Dunning wrote:
> Your product is subject to seasonality constraints (which teas are likely > right now) and repeat buying. I would separate out the recommendation of > repeat buys from the separation of new items. Actually I want to generate an email with diverse recommendations. Something like: Your personal top sellers: .. 3 items ... Special Winter Sales: ... 3 items ... This might be interesting for you: ... 6 items ... This is new in our store: ... 3 items ... > > You may also find that item-item links on your web site are helpful. These > are easy to get using this system. Yes, actually the website is already using some very basic item-to-item recommendations. So I am more interested in the newsletter part especially because I can track which items are really attractive and which aren't. /Manuel > > On Mon, Nov 21, 2011 at 11:46 AM, Manuel Blechschmidt < > [email protected]> wrote: > >> Hello Sean, >> >> On 21.11.2011, at 12:16, Sean Owen wrote: >> >>> Yes, because you have fewer items, an item-item-similarity-based >> algorithm >>> probably runs much faster. >> >> Thanks for your blazing fast feedback. >> >>> >>> I would not necessarily use the raw number of kg as a preference. It's >> not >>> really true that someone who buys 10kg of an item likes it 10x more than >>> one he buys 1kg of. Maybe the second spice is much more valuable? I would >>> at least try taking the logarithm of the weight, but, I think this is >> very >>> noisy as a proxy for "preference". It creates illogical leaps -- because >>> one user bought 85kg of X, and Y is "similar" to X, this would conclude >>> that you're somewhat likely to buy 85kg of Y too. I would probably not >> use >>> weight at all this way. >> >> Thanks for this suggestions. I will consider to integrate a logarithmic >> weight into the recommender. At the moment I am more concerned to get the >> user feedback component working. From some manual tests I can already tell >> that the recommendation for some users make sense. >> >> Based on my own profile I can tell that when I buy more of a certain >> product then I also like the product more. >> >> I am also thinking about some seasonal tweaking. Tea is a very seasonal >> product during winter and christmas other flavors are sold then in summer. >> http://diuf.unifr.ch/main/is/sites/diuf.unifr.ch.main.is/files/documents/publications/WS07-08-011.pdf >> >>> >>> It is not therefore surprising that log-likelihood works well, since it >>> ignores this value actually. >>> >>> (You mentioned RMSE but your evaluation metric is >>> average-absolute-difference -- L1, not L2). >> >> You are right RMSE (root-mean-squared-error) is wrong. I think it is MEA >> (mean-avagerage-error). >> >>> >>> This is quite a small data set so you should have no performance issues. >>> Your evaluations, which run over all users in the data set, are taking >> mere >>> seconds. I am sure you could get away with much less memory/processing if >>> you like. >> >> This is by far good enough. The more important part is the newsletter >> sending. I have to generate about 10.000 emails that makes more headache >> then the recommender. >> >> /Manuel >> >>> >>> >>> On Mon, Nov 21, 2011 at 11:06 AM, Manuel Blechschmidt < >>> [email protected]> wrote: >>> >>>> Hello Mahout Team, hello users, >>>> me and a friend are currently evaluating recommendation techniques for >>>> personalizing a newsletter for a company selling tea, spices and some >> other >>>> products. Mahout is such a great product which saves me hours of time >> and >>>> millions of money because I want to give something back I write this >> small >>>> case study to the mailing list. >>>> >>>> I am conducting an offline testing of which recommender is the most >>>> accurate one. Further I am interested in run time behavior like memory >>>> consumption and runtime. >>>> >>>> The data contains implicit feedback. The preferences of the user is the >>>> amount in gramm that he bought from a certain product (453 g ~ 1 >> pound). If >>>> a certain product does not have this data it is replaced with 50. So >>>> basically I want mahout to predict how much of a certain product is a >> user >>>> buying next. This is also helpful for demand planing. I am currently not >>>> using any time data because I did not find a recommender which is using >>>> this data. >>>> >>>> Users: 12858 >>>> Items: 5467 >>>> 121304 preferences >>>> MaxPreference: 85850.0 (Meaning that there is someone who ordered 85 kg >> of >>>> a certain tea or spice) >>>> MinPreference: 50.0 >>>> >>>> Here are the pure benchmarks for accuracy in RMSE. They change during >>>> every run of the evaluation (~15%): >>>> >>>> Evaluation of randomBased (baseline): 43045.380570443434 >>>> (RandomRecommender(model)) (Time: ~0.3 s) (Memory: 16MB) >>>> Evaluation of ItemBased with Pearson Correlation: 315.5804958647985 >>>> (GenericItemBasedRecommender(model, PearsonCorrelationSimilarity(model)) >>>> (Time: ~1s) (Memory: 35MB) >>>> Evaluation of ItemBase with uncentered Cosine: 198.25393235323375 >>>> (GenericItemBasedRecommender(model, UncenteredCosineSimilarity(model))) >>>> (Time: ~1s) (Memory: 32MB) >>>> Evaluation of ItemBase with log likelihood: 176.45243607278724 >>>> (GenericItemBasedRecommender(model, LogLikelihoodSimilarity(model))) >>>> (Time: ~5s) (Memory: 42MB) >>>> Evaluation of UserBased 3 with Pearson Correlation: 1378.1188069379868 >>>> (GenericUserBasedRecommender(model, NearestNUserNeighborhood(3, >>>> PearsonCorrelationSimilarity(model), model), >>>> PearsonCorrelationSimilarity(model))) (Time: ~52s) (Memory: 57MB) >>>> Evaluation of UserBased 20 with Pearson Correlation: 1144.1905989614288 >>>> (GenericUserBasedRecommender(model, NearestNUserNeighborhood(20, >>>> PearsonCorrelationSimilarity(model), model), >>>> PearsonCorrelationSimilarity(model))) (Time: ~51s) (Memory: 57MB) >>>> Evaluation of SlopeOne: 464.8989330869532 (SlopeOneRecommender(model)) >>>> (Time: ~4s) (Memory: 604MB) >>>> Evaluation of SVD based: 326.1050823499026 (ALSWRFactorizer(model, 100, >>>> 0.3, 5)) (Time: ) (Memory: 691MB) >>>> >>>> These were measured with the following method: >>>> >>>> RecommenderEvaluator evaluator = new >>>> AverageAbsoluteDifferenceRecommenderEvaluator(); >>>> double evaluation = evaluator.evaluate(randomBased, null, myModel, >>>> 0.9, 1.0); >>>> >>>> Memory usage was about 50m with the item based case. Slope One and SVD >>>> base seams to use the most memory (615MB & 691MB). >>>> >>>> The performance differs a lot. The fastest ones where the item based. >> They >>>> took about 1 to 5 seconds (PearsonCorrelationSimilarity and >>>> UncenteredCosineSimilarity 1 s, LogLikelihoodSimilarity 5s) >>>> The user based where a lot slower. >>>> >>>> Conclusion is that in my case the item based approach is the fastest, >>>> lowest memory consumption and most accurate one. Further I can use the >>>> recommendedBecause function. >>>> >>>> Here is the spec of the computer: >>>> 2.3GHz Intel Core i5 (4 Cores). 1024 MB for java virtual machine. >>>> >>>> In the next step, probably in the next 2 month. I have to design a >>>> newsletter and send it to the customers. Then I can benchmark the user >>>> acceptance rate of the recommendations. >>>> >>>> Any suggestions for enhancements are appreciated. If anybody is >> interested >>>> in the dataset or the evaluation code send me a private email. I might >> be >>>> able to convince the company to give out the dataset if the person is >> doing >>>> some interesting research. >>>> >>>> /Manuel >>>> -- >>>> Manuel Blechschmidt >>>> Dortustr. 57 >>>> 14467 Potsdam >>>> Mobil: 0173/6322621 >>>> Twitter: http://twitter.com/Manuel_B >>>> >>>> >> >> -- >> Manuel Blechschmidt >> Dortustr. 57 >> 14467 Potsdam >> Mobil: 0173/6322621 >> Twitter: http://twitter.com/Manuel_B >> >> -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
