Yes, ecommerce. >>#2 data includes #1 data, right? Yes, #1 are "raw" output of ItemSimilarity recommendtions #2 are recommednations #1 with category filter applied.
I can't drop #1 "look-with" since #2(ith category filter) doesn't have accessories. Category filter would remove accessory recommendations for iphone and leave only other iphones. The idea is to provide different "demensions" of data: look-with, similar (look-with with category filter applied), recommnedations based on sale input would have name "also-bought", e.t.c. "cross-cooccurrence" - what does it mean? Run itemsimilairty with "views", then with "sales" and provide merged result? where item->item pairs exist in both outputs? 2014-08-19 20:37 GMT+04:00 Pat Ferrel <[email protected]>: > emon is a typo > > I still don’t understand the difference between these “recommendations” 1) > "look-with recommendations" = recommended items clicked? 2) similar = items > viewed by others? The recommendations clicked will lead to viewing an item > so #2 data includes #1 data, right? I would drop #1 and use only #2 data. > Besides if you only recommend items that have been recommended you will > decrease sales because you will never show other items. Over time the > recommended items will become out of date since you never mix-in new items. > You may always recommend an iPhone 5 even after it has been discontinued. > > If you know the category of an item--filter the recs by that category or > related categories. You are already doing this in #2 below so if you drop > #1 there is no problem, correct? Users will not see the water heater with > the iphone. > > Question) Why do you get a water heater with iphone? Unless there is a bug > somewhere the data says that similar people looked at both. Item view data > is not very predictive and in any case you will get this type if thing if > it exists in user behavior. There may even be a correlation between the > need for an iphone and a water heater that you don’t know about or it may > just be a coincidence. But for now let’s say it’s an anomaly in the data > and just filter those out by category. > > What I was beginning to say is that it sounds like you have an ECOM site. > If so do you have purchase data? Purchase data is usually much, much better > than item view data. People tend to look at a lot of things but when they > purchase something it means a much higher preference than merely looking at > something. > > The first rule of making a good recommender is find the best action, one > that shows a user preference in the strongest possible way. For ecommerce > that usually means a purchase. Then once you have that working you can add > more actions but only with cross-cooccurrence, adding by weighting will not > work with this type of recommender, it will only pollute your strong data > with weaker actions. > > On Aug 19, 2014, at 8:18 AM, Serega Sheypak <[email protected]> > wrote: > > Hi, what is "emon"? > 1. I do create "look-with recommendations". I really it's just "raw" output > from itemSimilarityJob with booleanData=true and LLR as similarity function > (your suggestion) > 2. I do create "similar" recommendations. I do apply category filter before > serving recommendations > > "look-with", means other users watched iPhone case and other accessory with > iphone. I do have accessory for iPhone here, but also water heating > device... > similar - means show only other smarphones as recommendations to iPhone. > > Right now the problem is in water heating device in 'look-with' (category > filter not applied). How can I put away such kind of recommendations and > why Do I get them? > > > > 2014-08-19 18:01 GMT+04:00 Pat Ferrel <[email protected]>: > > > That sounds much better. > > > > Do you have metadata like product category? Electronics vs. home > > appliance? One easy thing to do if you have categories in your catalog is > > filter by the same category as the item being viewed. > > > > BTW it sounds like you have an emon > > On Aug 19, 2014, at 12:53 AM, Serega Sheypak <[email protected]> > > wrote: > > > > Hi, I 've used LLR with properties you've suggested. > > Right now I have a trouble. > > A trouble: > > Water heat device ( > > > http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg > > ) > > is recommedned for iPhone. And it has one of the highest score. > > good things: > > iPhone cases ( > > > > > https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg > > ) > > are recommedned for iPhone, It's good > > Other smartphones are recommended to iPhone, it's good > > Other iPhones are recommedned to iPhone. It's good. 16GB recommended to > > 32GB, e.t.c. > > > > What could be a reason for recommending "Water heat device " to iPhone? > > iPhone is one of the most popular item. There should be a lot of people > > viewing iPhone with "Water heat device "? > > > > > > > > 2014-08-18 20:15 GMT+04:00 Pat Ferrel <[email protected]>: > > > >> Oh, and as to using different algorithms, this is an “ensemble” method. > > In > >> the paper they are talking about using widely differing algorithms like > > ALS > >> + Cooccurrence + … This technique was used to win the Netflix prize but > > in > >> practice the improvements may be to small to warrant running multiple > >> pipelines. In any case it isn’t the first improvement you may want to > > try. > >> For instance your UI will have a drastic effect on how well you recs do, > >> and there are other much easier techniques that we can talk about once > > you > >> get the basics working. > >> > >> > >> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <[email protected]> wrote: > >> > >> When beginning to use a recommender from Mahout I always suggest you > > start > >> from the defaults. These often give the best results—then tune > afterwards > >> to improve. > >> > >> Your intuition is correct that multiple actions can be used to improve > >> results but get the basics working first. The easiest way to use > multiple > >> actions is to use spark-itemsimilarity so since you are using mapreduce > > for > >> now, just use one action. > >> > >> I would not try to combine the results from two similarity measures > there > >> is no benefit since LLR is better than any of them, at least I’ve never > >> seen it loose. Below is my experience with trying many of the similarity > >> metrics on exactly the same data. I did cross-validation with precision > >> (MAP, mean average precision). LLR wins in other cases I’ve tried too. > So > >> LLR is the only method presently used in the Spark version of > >> itemsimilarity. > >> > >> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg> > >> > >> If you still get weird results double check your ID mapping. Run a small > >> bit of data through and spot check the mapping by hand. > >> > >> At some point you will want to create a cross-validation test. This is > >> good as a sort of integration sanity check when making changes to the > >> recommender. You run cross-validation using standard test data to see if > >> the score changes drastically between releases. Big changes may indicate > > a > >> bug. At the beginning it will help you tune as in the case above where > it > >> helped decide on LLR. > >> > >> > >> > >> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <[email protected]> > >> wrote: > >> > >> Thank you very much. I'll do what you are sayning in bullets 1...5 and > > try > >> again. > >> > >> I also tried: > >> 1. calc data using COUSINE_SIMILARITY > >> 2. calc the same data using COOCCURENCE_SIMILARTY > >> 3. join #1 and #2 where COOCURENCE >= $threshold > >> > >> Where threshold is some emperical integer value. I've used "2" The idea > > is > >> to filter out item pairs which never-ever met together... > >> Please see this link: > >> > >> > > > http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html > >> > >> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this > >> approach still make sense, or it's useless waste of time? > >> > >> "What do you mean the similar items are terrible? How are you measuring > >> that? " I have eye testing only, > >> I did automate preparation->calculation->hbase upload-> web-app serving, > > I > >> didn't automate testing. > >> > >> > >> > >> > >> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <[email protected]>: > >> > >>> the things that stand out: > >>> > >>> 1) remove your maxSimilaritiesPerItem option! 50000 > >> maxSimilaritiesPerItem > >>> will _kill_ performance and give no gain, leave this setting at the > >> default > >>> of 500 > >>> 2) use only one action. What do you want the user to do? Do you want > > them > >>> to read a page? Then train on item page views. If those pages lead to a > >>> purchase then you want to recommend purchases so train on user > > purchases. > >>> 3) remove your minPrefsPerUser option, this should never be 0 or it > will > >>> leave users in the training data that have no data and may contribute > to > >>> longer runs with no gain. > >>> 4) this is a pretty small Hadoop cluster for the size of your data but > I > >>> bet changing #1 will noticeably reduce the runtime > >>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD > >>> 6) remove your —booleanData option since LLR ignores weights. > >>> > >>> Remember that this is not the same as personalized recommendations. > This > >>> method alone will show the same “similar items” for all users. > >>> > >>> Sorry but both your “recommendation” types sound like the same thing. > >>> Using both item page view _and_ clicks on recommended items will both > >> lead > >>> to an item page view so you have two actions that lead to the same > > thing, > >>> right? Just train on an item page view (unless you really want the user > >> to > >>> make a purchase) > >>> > >>> What do you mean the similar items are terrible? How are you measuring > >>> that? Are you doing cross-validation measuring precision or A/B > testing? > >>> What looks bad to you may be good, the eyeball test is not always > >> reliable. > >>> If they are coming up completely crazy or random then you may have a > bug > >> in > >>> your ID translation logic. > >>> > >>> It sounds like you have enough data to produce good results. > >>> > >>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <[email protected] > > > >>> wrote: > >>> > >>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too > > much > >>> but enough for the start.. > >>> 2. I run it as oozie action. > >>> <action name="run-mahout-primary-similarity-ItemSimilarityJob"> > >>> <java> > >>> <job-tracker>${jobTracker}</job-tracker> > >>> <name-node>${nameNode}</name-node> > >>> <prepare> > >>> <delete path="${mahoutOutputDir}/primary" /> > >>> <delete > >>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" /> > >>> </prepare> > >>> <configuration> > >>> <property> > >>> <name>mapred.queue.name</name> > >>> <value>default</value> > >>> </property> > >>> > >>> </configuration> > >>> > >>> > >>> > >> > > > <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class> > >>> <arg>--input</arg> > >>> <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id, > >>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on > >> recommendation, > >>> a kind of try to increase quality of recommender...]--> > >>> > >>> <arg>--output</arg> > >>> <arg>${mahoutOutputDir}/primary</arg> > >>> > >>> <arg>--similarityClassname</arg> > >>> <arg>SIMILARITY_COSINE</arg> > >>> > >>> <arg>--maxSimilaritiesPerItem</arg> > >>> <arg>50000</arg> > >>> > >>> <arg>--minPrefsPerUser</arg> > >>> <arg>0</arg> > >>> > >>> <arg>--booleanData</arg> > >>> <arg>false</arg> > >>> > >>> <arg>--tempDir</arg> > >>> <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg> > >>> > >>> </java> > >>> <ok to="to-narrow-table"/> > >>> <error to="kill"/> > >>> </action> > >>> > >>> 3) RANK does it, here is a script: > >>> > >>> --user, item, pref previously prepared by hive > >>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as > >>> (user_id:chararray, item_id:long, pref:double); > >>> > >>> --get distinct user from the whole input > >>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id); > >>> > >>> --get distinct item from the whole input > >>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id); > >>> > >>> --rank user 1....N > >>> rankUsers_ = RANK distUserId; > >>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id; > >>> > >>> --rank items 1....M > >>> rankItems_ = RANK distItemId; > >>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id; > >>> > >>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M > >>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id > USING > >>> 'skewed'; > >>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by > >>> item_id using 'replicated'; > >>> > >>> projPrefs = FOREACH joinedItems GENERATE > joinedUsers::rankUsers::rank_id > >>> as user_id, > >>> rankItems::rank_id > >>> as item_id, > >>> joinedUsers::user_item_pref::pref > >>> as pref; > >>> > >>> --store mapping for later remapping from RANK back to natural values > >>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' > >> using > >>> PigStorage('\t'); > >>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' > >> using > >>> PigStorage('\t'); > >>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into > >> '$projPrefs' > >>> using PigStorage('\t'); > >>> > >>> 4) I've seen this idea in different discussion, that different weight > > for > >>> different actions are not good. Sorry, I don't understand what you do > >>> suggest. > >>> I have two kind of actions: user viewed item, user clicked on > > recommended > >>> item (recommended item produced by my item similarity system). > >>> I want to produce two kinds of recommendations: > >>> 1. current item + recommend other items which other users visit in > >>> conjuction with current item > >>> 2. similar item: recommend items similar to current viewed item. > >>> What can I try? > >>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD? > >>> > >>> Right now I do get awful recommendations and I can't understand what > can > >> I > >>> try next :(((((((((((( > >>> > >>> > >>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>: > >>> > >>>> 1) how many cores in the cluster? The whole idea behind mapreduce is > > you > >>>> buy more cpus you get nearly linear decrease in runtime. > >>>> 2) what is your mahout command line with options, or how are you > >> invoking > >>>> mahout. I have seen the Mahout mapreduce recommender take this long so > >> we > >>>> should check what you are doing with downsampling. > >>>> 3) do you really need to RANK your ids, that’s a full sort? When using > >>> pig > >>>> I usually get DISTINCT ones and assign an incrementing integer as the > >>>> Mahout ID corresponding > >>>> 4) your #2 assigning different weights to different actions usually > > does > >>>> not work. I’ve done this before and compared offline metrics and seen > >>>> precision go down. I’d get this working using only your primary > actions > >>>> first. What are you trying to get the user to do? View something, buy > >>>> something? Use that action as the primary preference and start out > with > >> a > >>>> weight of 1 using LLR. With LLR the weights are not used anyway so > your > >>>> data may not produce good results with mixed actions. > >>>> > >>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty: > >>>> 1) output from 2 can be directly ingested and will create output. > >>>> 2) multiple actions can be used with cross-cooccurrence, not by > > guessing > >>>> at weights. > >>>> 3) output has your application specific IDs preserved. > >>>> 4) its about 10x faster than mapreduce and will do aways with your ID > >>>> translation steps > >>>> > >>>> One caveat is that your cluster machines will need lots of memory. I > >> have > >>>> 8-16g on mine. > >>>> > >>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected] > > > >>>> wrote: > >>>> > >>>> 1. I do collect preferences for items using 60days sliding window. > > today > >>> - > >>>> 60 days. > >>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for > >> item > >>>> view, 5 for clicking recommndation block. The idea is to give more > > value > >>>> for recommendations which attact visitor attention). I get ~ > 20.000.000 > >>> of > >>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users > >>>> 3. I do use apache pig RANK function to rank all distinct user_id > >>>> 4. I do the same for item_id > >>>> 5. I do join input dataset with ranked datasets and provide input to > >>> mahout > >>>> with dense interger user_id, item_id > >>>> 6. I do get mahout output and join integer item_id back to get natural > >>> key > >>>> value. > >>>> > >>>> step #1-2 takes ~ 40min > >>>> step #3-5 takes ~1 hour > >>>> mahout calc takes ~3hours > >>>> > >>>> > >>>> > >>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>: > >>>> > >>>>> This really doesn't sound right. It should be possible to process > >>>> almost a > >>>>> thousand times that much data every night without that much problem. > >>>>> > >>>>> How are you preparing the input data? > >>>>> > >>>>> How are you converting to Mahout id's? > >>>>> > >>>>> Even using python, you should be able to do the conversion in just a > >> few > >>>>> minutes without any parallelism whatsoever. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak < > >>>> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> Hi, We are trying calculate ItemSimilarity. > >>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw > >>>> text > >>>>>> each day to recalculate item similarities. We do get +100..1000 new > >>>> items > >>>>>> each day. > >>>>>> 1. It takes too much time to prepare input data. > >>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids > >>>>>> > >>>>>> Is there any poissibility to provide data to mahout mapreduce > >>>>>> ItemSimilarity using some binary format with compression? > >>>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > >> > > > > > >
