Hi, I 've used LLR with properties you've suggested. Right now I have a trouble. A trouble: Water heat device ( http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg) is recommedned for iPhone. And it has one of the highest score. good things: iPhone cases ( https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg) are recommedned for iPhone, It's good Other smartphones are recommended to iPhone, it's good Other iPhones are recommedned to iPhone. It's good. 16GB recommended to 32GB, e.t.c.
What could be a reason for recommending "Water heat device " to iPhone? iPhone is one of the most popular item. There should be a lot of people viewing iPhone with "Water heat device "? 2014-08-18 20:15 GMT+04:00 Pat Ferrel <[email protected]>: > Oh, and as to using different algorithms, this is an “ensemble” method. In > the paper they are talking about using widely differing algorithms like ALS > + Cooccurrence + … This technique was used to win the Netflix prize but in > practice the improvements may be to small to warrant running multiple > pipelines. In any case it isn’t the first improvement you may want to try. > For instance your UI will have a drastic effect on how well you recs do, > and there are other much easier techniques that we can talk about once you > get the basics working. > > > On Aug 18, 2014, at 9:04 AM, Pat Ferrel <[email protected]> wrote: > > When beginning to use a recommender from Mahout I always suggest you start > from the defaults. These often give the best results—then tune afterwards > to improve. > > Your intuition is correct that multiple actions can be used to improve > results but get the basics working first. The easiest way to use multiple > actions is to use spark-itemsimilarity so since you are using mapreduce for > now, just use one action. > > I would not try to combine the results from two similarity measures there > is no benefit since LLR is better than any of them, at least I’ve never > seen it loose. Below is my experience with trying many of the similarity > metrics on exactly the same data. I did cross-validation with precision > (MAP, mean average precision). LLR wins in other cases I’ve tried too. So > LLR is the only method presently used in the Spark version of > itemsimilarity. > > <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg> > > If you still get weird results double check your ID mapping. Run a small > bit of data through and spot check the mapping by hand. > > At some point you will want to create a cross-validation test. This is > good as a sort of integration sanity check when making changes to the > recommender. You run cross-validation using standard test data to see if > the score changes drastically between releases. Big changes may indicate a > bug. At the beginning it will help you tune as in the case above where it > helped decide on LLR. > > > > On Aug 18, 2014, at 1:43 AM, Serega Sheypak <[email protected]> > wrote: > > Thank you very much. I'll do what you are sayning in bullets 1...5 and try > again. > > I also tried: > 1. calc data using COUSINE_SIMILARITY > 2. calc the same data using COOCCURENCE_SIMILARTY > 3. join #1 and #2 where COOCURENCE >= $threshold > > Where threshold is some emperical integer value. I've used "2" The idea is > to filter out item pairs which never-ever met together... > Please see this link: > > http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html > > If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this > approach still make sense, or it's useless waste of time? > > "What do you mean the similar items are terrible? How are you measuring > that? " I have eye testing only, > I did automate preparation->calculation->hbase upload-> web-app serving, I > didn't automate testing. > > > > > 2014-08-18 5:16 GMT+04:00 Pat Ferrel <[email protected]>: > > > the things that stand out: > > > > 1) remove your maxSimilaritiesPerItem option! 50000 > maxSimilaritiesPerItem > > will _kill_ performance and give no gain, leave this setting at the > default > > of 500 > > 2) use only one action. What do you want the user to do? Do you want them > > to read a page? Then train on item page views. If those pages lead to a > > purchase then you want to recommend purchases so train on user purchases. > > 3) remove your minPrefsPerUser option, this should never be 0 or it will > > leave users in the training data that have no data and may contribute to > > longer runs with no gain. > > 4) this is a pretty small Hadoop cluster for the size of your data but I > > bet changing #1 will noticeably reduce the runtime > > 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD > > 6) remove your —booleanData option since LLR ignores weights. > > > > Remember that this is not the same as personalized recommendations. This > > method alone will show the same “similar items” for all users. > > > > Sorry but both your “recommendation” types sound like the same thing. > > Using both item page view _and_ clicks on recommended items will both > lead > > to an item page view so you have two actions that lead to the same thing, > > right? Just train on an item page view (unless you really want the user > to > > make a purchase) > > > > What do you mean the similar items are terrible? How are you measuring > > that? Are you doing cross-validation measuring precision or A/B testing? > > What looks bad to you may be good, the eyeball test is not always > reliable. > > If they are coming up completely crazy or random then you may have a bug > in > > your ID translation logic. > > > > It sounds like you have enough data to produce good results. > > > > On Aug 17, 2014, at 11:14 AM, Serega Sheypak <[email protected]> > > wrote: > > > > 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much > > but enough for the start.. > > 2. I run it as oozie action. > > <action name="run-mahout-primary-similarity-ItemSimilarityJob"> > > <java> > > <job-tracker>${jobTracker}</job-tracker> > > <name-node>${nameNode}</name-node> > > <prepare> > > <delete path="${mahoutOutputDir}/primary" /> > > <delete > > path="${tempDir}/run-mahout-ItemSimilarityJob/primary" /> > > </prepare> > > <configuration> > > <property> > > <name>mapred.queue.name</name> > > <value>default</value> > > </property> > > > > </configuration> > > > > > > > <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class> > > <arg>--input</arg> > > <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id, > > item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on > recommendation, > > a kind of try to increase quality of recommender...]--> > > > > <arg>--output</arg> > > <arg>${mahoutOutputDir}/primary</arg> > > > > <arg>--similarityClassname</arg> > > <arg>SIMILARITY_COSINE</arg> > > > > <arg>--maxSimilaritiesPerItem</arg> > > <arg>50000</arg> > > > > <arg>--minPrefsPerUser</arg> > > <arg>0</arg> > > > > <arg>--booleanData</arg> > > <arg>false</arg> > > > > <arg>--tempDir</arg> > > <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg> > > > > </java> > > <ok to="to-narrow-table"/> > > <error to="kill"/> > > </action> > > > > 3) RANK does it, here is a script: > > > > --user, item, pref previously prepared by hive > > user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as > > (user_id:chararray, item_id:long, pref:double); > > > > --get distinct user from the whole input > > distUserId = distinct(FOREACH user_item_pref GENERATE user_id); > > > > --get distinct item from the whole input > > distItemId = distinct(FOREACH user_item_pref GENERATE item_id); > > > > --rank user 1....N > > rankUsers_ = RANK distUserId; > > rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id; > > > > --rank items 1....M > > rankItems_ = RANK distItemId; > > rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id; > > > > --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M > > joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING > > 'skewed'; > > joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by > > item_id using 'replicated'; > > > > projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id > > as user_id, > > rankItems::rank_id > > as item_id, > > joinedUsers::user_item_pref::pref > > as pref; > > > > --store mapping for later remapping from RANK back to natural values > > STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' > using > > PigStorage('\t'); > > STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' > using > > PigStorage('\t'); > > STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into > '$projPrefs' > > using PigStorage('\t'); > > > > 4) I've seen this idea in different discussion, that different weight for > > different actions are not good. Sorry, I don't understand what you do > > suggest. > > I have two kind of actions: user viewed item, user clicked on recommended > > item (recommended item produced by my item similarity system). > > I want to produce two kinds of recommendations: > > 1. current item + recommend other items which other users visit in > > conjuction with current item > > 2. similar item: recommend items similar to current viewed item. > > What can I try? > > LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD? > > > > Right now I do get awful recommendations and I can't understand what can > I > > try next :(((((((((((( > > > > > > 2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>: > > > >> 1) how many cores in the cluster? The whole idea behind mapreduce is you > >> buy more cpus you get nearly linear decrease in runtime. > >> 2) what is your mahout command line with options, or how are you > invoking > >> mahout. I have seen the Mahout mapreduce recommender take this long so > we > >> should check what you are doing with downsampling. > >> 3) do you really need to RANK your ids, that’s a full sort? When using > > pig > >> I usually get DISTINCT ones and assign an incrementing integer as the > >> Mahout ID corresponding > >> 4) your #2 assigning different weights to different actions usually does > >> not work. I’ve done this before and compared offline metrics and seen > >> precision go down. I’d get this working using only your primary actions > >> first. What are you trying to get the user to do? View something, buy > >> something? Use that action as the primary preference and start out with > a > >> weight of 1 using LLR. With LLR the weights are not used anyway so your > >> data may not produce good results with mixed actions. > >> > >> A plug for the (admittedly pre-alpha) spark-itemsimilairty: > >> 1) output from 2 can be directly ingested and will create output. > >> 2) multiple actions can be used with cross-cooccurrence, not by guessing > >> at weights. > >> 3) output has your application specific IDs preserved. > >> 4) its about 10x faster than mapreduce and will do aways with your ID > >> translation steps > >> > >> One caveat is that your cluster machines will need lots of memory. I > have > >> 8-16g on mine. > >> > >> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]> > >> wrote: > >> > >> 1. I do collect preferences for items using 60days sliding window. today > > - > >> 60 days. > >> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for > item > >> view, 5 for clicking recommndation block. The idea is to give more value > >> for recommendations which attact visitor attention). I get ~ 20.000.000 > > of > >> lines with ~1.000.000 distinct items and ~2.000.000 distinct users > >> 3. I do use apache pig RANK function to rank all distinct user_id > >> 4. I do the same for item_id > >> 5. I do join input dataset with ranked datasets and provide input to > > mahout > >> with dense interger user_id, item_id > >> 6. I do get mahout output and join integer item_id back to get natural > > key > >> value. > >> > >> step #1-2 takes ~ 40min > >> step #3-5 takes ~1 hour > >> mahout calc takes ~3hours > >> > >> > >> > >> 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>: > >> > >>> This really doesn't sound right. It should be possible to process > >> almost a > >>> thousand times that much data every night without that much problem. > >>> > >>> How are you preparing the input data? > >>> > >>> How are you converting to Mahout id's? > >>> > >>> Even using python, you should be able to do the conversion in just a > few > >>> minutes without any parallelism whatsoever. > >>> > >>> > >>> > >>> > >>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak < > >> [email protected]> > >>> wrote: > >>> > >>>> Hi, We are trying calculate ItemSimilarity. > >>>> Right now we have 2*10^7 input lines. I do provide input data as raw > >> text > >>>> each day to recalculate item similarities. We do get +100..1000 new > >> items > >>>> each day. > >>>> 1. It takes too much time to prepare input data. > >>>> 2. It takes too much time to convert user_id, item_id to mahout ids > >>>> > >>>> Is there any poissibility to provide data to mahout mapreduce > >>>> ItemSimilarity using some binary format with compression? > >>>> > >>> > >> > >> > > > > > > >
