Hi, what is "emon"? 1. I do create "look-with recommendations". I really it's just "raw" output from itemSimilarityJob with booleanData=true and LLR as similarity function (your suggestion) 2. I do create "similar" recommendations. I do apply category filter before serving recommendations
"look-with", means other users watched iPhone case and other accessory with iphone. I do have accessory for iPhone here, but also water heating device... similar - means show only other smarphones as recommendations to iPhone. Right now the problem is in water heating device in 'look-with' (category filter not applied). How can I put away such kind of recommendations and why Do I get them? 2014-08-19 18:01 GMT+04:00 Pat Ferrel <[email protected]>: > That sounds much better. > > Do you have metadata like product category? Electronics vs. home > appliance? One easy thing to do if you have categories in your catalog is > filter by the same category as the item being viewed. > > BTW it sounds like you have an emon > On Aug 19, 2014, at 12:53 AM, Serega Sheypak <[email protected]> > wrote: > > Hi, I 've used LLR with properties you've suggested. > Right now I have a trouble. > A trouble: > Water heat device ( > http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg > ) > is recommedned for iPhone. And it has one of the highest score. > good things: > iPhone cases ( > > https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg > ) > are recommedned for iPhone, It's good > Other smartphones are recommended to iPhone, it's good > Other iPhones are recommedned to iPhone. It's good. 16GB recommended to > 32GB, e.t.c. > > What could be a reason for recommending "Water heat device " to iPhone? > iPhone is one of the most popular item. There should be a lot of people > viewing iPhone with "Water heat device "? > > > > 2014-08-18 20:15 GMT+04:00 Pat Ferrel <[email protected]>: > > > Oh, and as to using different algorithms, this is an “ensemble” method. > In > > the paper they are talking about using widely differing algorithms like > ALS > > + Cooccurrence + … This technique was used to win the Netflix prize but > in > > practice the improvements may be to small to warrant running multiple > > pipelines. In any case it isn’t the first improvement you may want to > try. > > For instance your UI will have a drastic effect on how well you recs do, > > and there are other much easier techniques that we can talk about once > you > > get the basics working. > > > > > > On Aug 18, 2014, at 9:04 AM, Pat Ferrel <[email protected]> wrote: > > > > When beginning to use a recommender from Mahout I always suggest you > start > > from the defaults. These often give the best results—then tune afterwards > > to improve. > > > > Your intuition is correct that multiple actions can be used to improve > > results but get the basics working first. The easiest way to use multiple > > actions is to use spark-itemsimilarity so since you are using mapreduce > for > > now, just use one action. > > > > I would not try to combine the results from two similarity measures there > > is no benefit since LLR is better than any of them, at least I’ve never > > seen it loose. Below is my experience with trying many of the similarity > > metrics on exactly the same data. I did cross-validation with precision > > (MAP, mean average precision). LLR wins in other cases I’ve tried too. So > > LLR is the only method presently used in the Spark version of > > itemsimilarity. > > > > <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg> > > > > If you still get weird results double check your ID mapping. Run a small > > bit of data through and spot check the mapping by hand. > > > > At some point you will want to create a cross-validation test. This is > > good as a sort of integration sanity check when making changes to the > > recommender. You run cross-validation using standard test data to see if > > the score changes drastically between releases. Big changes may indicate > a > > bug. At the beginning it will help you tune as in the case above where it > > helped decide on LLR. > > > > > > > > On Aug 18, 2014, at 1:43 AM, Serega Sheypak <[email protected]> > > wrote: > > > > Thank you very much. I'll do what you are sayning in bullets 1...5 and > try > > again. > > > > I also tried: > > 1. calc data using COUSINE_SIMILARITY > > 2. calc the same data using COOCCURENCE_SIMILARTY > > 3. join #1 and #2 where COOCURENCE >= $threshold > > > > Where threshold is some emperical integer value. I've used "2" The idea > is > > to filter out item pairs which never-ever met together... > > Please see this link: > > > > > http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html > > > > If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this > > approach still make sense, or it's useless waste of time? > > > > "What do you mean the similar items are terrible? How are you measuring > > that? " I have eye testing only, > > I did automate preparation->calculation->hbase upload-> web-app serving, > I > > didn't automate testing. > > > > > > > > > > 2014-08-18 5:16 GMT+04:00 Pat Ferrel <[email protected]>: > > > >> the things that stand out: > >> > >> 1) remove your maxSimilaritiesPerItem option! 50000 > > maxSimilaritiesPerItem > >> will _kill_ performance and give no gain, leave this setting at the > > default > >> of 500 > >> 2) use only one action. What do you want the user to do? Do you want > them > >> to read a page? Then train on item page views. If those pages lead to a > >> purchase then you want to recommend purchases so train on user > purchases. > >> 3) remove your minPrefsPerUser option, this should never be 0 or it will > >> leave users in the training data that have no data and may contribute to > >> longer runs with no gain. > >> 4) this is a pretty small Hadoop cluster for the size of your data but I > >> bet changing #1 will noticeably reduce the runtime > >> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD > >> 6) remove your —booleanData option since LLR ignores weights. > >> > >> Remember that this is not the same as personalized recommendations. This > >> method alone will show the same “similar items” for all users. > >> > >> Sorry but both your “recommendation” types sound like the same thing. > >> Using both item page view _and_ clicks on recommended items will both > > lead > >> to an item page view so you have two actions that lead to the same > thing, > >> right? Just train on an item page view (unless you really want the user > > to > >> make a purchase) > >> > >> What do you mean the similar items are terrible? How are you measuring > >> that? Are you doing cross-validation measuring precision or A/B testing? > >> What looks bad to you may be good, the eyeball test is not always > > reliable. > >> If they are coming up completely crazy or random then you may have a bug > > in > >> your ID translation logic. > >> > >> It sounds like you have enough data to produce good results. > >> > >> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <[email protected]> > >> wrote: > >> > >> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too > much > >> but enough for the start.. > >> 2. I run it as oozie action. > >> <action name="run-mahout-primary-similarity-ItemSimilarityJob"> > >> <java> > >> <job-tracker>${jobTracker}</job-tracker> > >> <name-node>${nameNode}</name-node> > >> <prepare> > >> <delete path="${mahoutOutputDir}/primary" /> > >> <delete > >> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" /> > >> </prepare> > >> <configuration> > >> <property> > >> <name>mapred.queue.name</name> > >> <value>default</value> > >> </property> > >> > >> </configuration> > >> > >> > >> > > > <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class> > >> <arg>--input</arg> > >> <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id, > >> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on > > recommendation, > >> a kind of try to increase quality of recommender...]--> > >> > >> <arg>--output</arg> > >> <arg>${mahoutOutputDir}/primary</arg> > >> > >> <arg>--similarityClassname</arg> > >> <arg>SIMILARITY_COSINE</arg> > >> > >> <arg>--maxSimilaritiesPerItem</arg> > >> <arg>50000</arg> > >> > >> <arg>--minPrefsPerUser</arg> > >> <arg>0</arg> > >> > >> <arg>--booleanData</arg> > >> <arg>false</arg> > >> > >> <arg>--tempDir</arg> > >> <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg> > >> > >> </java> > >> <ok to="to-narrow-table"/> > >> <error to="kill"/> > >> </action> > >> > >> 3) RANK does it, here is a script: > >> > >> --user, item, pref previously prepared by hive > >> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as > >> (user_id:chararray, item_id:long, pref:double); > >> > >> --get distinct user from the whole input > >> distUserId = distinct(FOREACH user_item_pref GENERATE user_id); > >> > >> --get distinct item from the whole input > >> distItemId = distinct(FOREACH user_item_pref GENERATE item_id); > >> > >> --rank user 1....N > >> rankUsers_ = RANK distUserId; > >> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id; > >> > >> --rank items 1....M > >> rankItems_ = RANK distItemId; > >> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id; > >> > >> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M > >> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING > >> 'skewed'; > >> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by > >> item_id using 'replicated'; > >> > >> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id > >> as user_id, > >> rankItems::rank_id > >> as item_id, > >> joinedUsers::user_item_pref::pref > >> as pref; > >> > >> --store mapping for later remapping from RANK back to natural values > >> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' > > using > >> PigStorage('\t'); > >> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' > > using > >> PigStorage('\t'); > >> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into > > '$projPrefs' > >> using PigStorage('\t'); > >> > >> 4) I've seen this idea in different discussion, that different weight > for > >> different actions are not good. Sorry, I don't understand what you do > >> suggest. > >> I have two kind of actions: user viewed item, user clicked on > recommended > >> item (recommended item produced by my item similarity system). > >> I want to produce two kinds of recommendations: > >> 1. current item + recommend other items which other users visit in > >> conjuction with current item > >> 2. similar item: recommend items similar to current viewed item. > >> What can I try? > >> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD? > >> > >> Right now I do get awful recommendations and I can't understand what can > > I > >> try next :(((((((((((( > >> > >> > >> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>: > >> > >>> 1) how many cores in the cluster? The whole idea behind mapreduce is > you > >>> buy more cpus you get nearly linear decrease in runtime. > >>> 2) what is your mahout command line with options, or how are you > > invoking > >>> mahout. I have seen the Mahout mapreduce recommender take this long so > > we > >>> should check what you are doing with downsampling. > >>> 3) do you really need to RANK your ids, that’s a full sort? When using > >> pig > >>> I usually get DISTINCT ones and assign an incrementing integer as the > >>> Mahout ID corresponding > >>> 4) your #2 assigning different weights to different actions usually > does > >>> not work. I’ve done this before and compared offline metrics and seen > >>> precision go down. I’d get this working using only your primary actions > >>> first. What are you trying to get the user to do? View something, buy > >>> something? Use that action as the primary preference and start out with > > a > >>> weight of 1 using LLR. With LLR the weights are not used anyway so your > >>> data may not produce good results with mixed actions. > >>> > >>> A plug for the (admittedly pre-alpha) spark-itemsimilairty: > >>> 1) output from 2 can be directly ingested and will create output. > >>> 2) multiple actions can be used with cross-cooccurrence, not by > guessing > >>> at weights. > >>> 3) output has your application specific IDs preserved. > >>> 4) its about 10x faster than mapreduce and will do aways with your ID > >>> translation steps > >>> > >>> One caveat is that your cluster machines will need lots of memory. I > > have > >>> 8-16g on mine. > >>> > >>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]> > >>> wrote: > >>> > >>> 1. I do collect preferences for items using 60days sliding window. > today > >> - > >>> 60 days. > >>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for > > item > >>> view, 5 for clicking recommndation block. The idea is to give more > value > >>> for recommendations which attact visitor attention). I get ~ 20.000.000 > >> of > >>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users > >>> 3. I do use apache pig RANK function to rank all distinct user_id > >>> 4. I do the same for item_id > >>> 5. I do join input dataset with ranked datasets and provide input to > >> mahout > >>> with dense interger user_id, item_id > >>> 6. I do get mahout output and join integer item_id back to get natural > >> key > >>> value. > >>> > >>> step #1-2 takes ~ 40min > >>> step #3-5 takes ~1 hour > >>> mahout calc takes ~3hours > >>> > >>> > >>> > >>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>: > >>> > >>>> This really doesn't sound right. It should be possible to process > >>> almost a > >>>> thousand times that much data every night without that much problem. > >>>> > >>>> How are you preparing the input data? > >>>> > >>>> How are you converting to Mahout id's? > >>>> > >>>> Even using python, you should be able to do the conversion in just a > > few > >>>> minutes without any parallelism whatsoever. > >>>> > >>>> > >>>> > >>>> > >>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak < > >>> [email protected]> > >>>> wrote: > >>>> > >>>>> Hi, We are trying calculate ItemSimilarity. > >>>>> Right now we have 2*10^7 input lines. I do provide input data as raw > >>> text > >>>>> each day to recalculate item similarities. We do get +100..1000 new > >>> items > >>>>> each day. > >>>>> 1. It takes too much time to prepare input data. > >>>>> 2. It takes too much time to convert user_id, item_id to mahout ids > >>>>> > >>>>> Is there any poissibility to provide data to mahout mapreduce > >>>>> ItemSimilarity using some binary format with compression? > >>>>> > >>>> > >>> > >>> > >> > >> > > > > > > > >
