the things that stand out: 1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem will _kill_ performance and give no gain, leave this setting at the default of 500 2) use only one action. What do you want the user to do? Do you want them to read a page? Then train on item page views. If those pages lead to a purchase then you want to recommend purchases so train on user purchases. 3) remove your minPrefsPerUser option, this should never be 0 or it will leave users in the training data that have no data and may contribute to longer runs with no gain. 4) this is a pretty small Hadoop cluster for the size of your data but I bet changing #1 will noticeably reduce the runtime 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD 6) remove your —booleanData option since LLR ignores weights.
Remember that this is not the same as personalized recommendations. This method alone will show the same “similar items” for all users. Sorry but both your “recommendation” types sound like the same thing. Using both item page view _and_ clicks on recommended items will both lead to an item page view so you have two actions that lead to the same thing, right? Just train on an item page view (unless you really want the user to make a purchase) What do you mean the similar items are terrible? How are you measuring that? Are you doing cross-validation measuring precision or A/B testing? What looks bad to you may be good, the eyeball test is not always reliable. If they are coming up completely crazy or random then you may have a bug in your ID translation logic. It sounds like you have enough data to produce good results. On Aug 17, 2014, at 11:14 AM, Serega Sheypak <[email protected]> wrote: 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much but enough for the start.. 2. I run it as oozie action. <action name="run-mahout-primary-similarity-ItemSimilarityJob"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${mahoutOutputDir}/primary" /> <delete path="${tempDir}/run-mahout-ItemSimilarityJob/primary" /> </prepare> <configuration> <property> <name>mapred.queue.name</name> <value>default</value> </property> </configuration> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class> <arg>--input</arg> <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id, item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation, a kind of try to increase quality of recommender...]--> <arg>--output</arg> <arg>${mahoutOutputDir}/primary</arg> <arg>--similarityClassname</arg> <arg>SIMILARITY_COSINE</arg> <arg>--maxSimilaritiesPerItem</arg> <arg>50000</arg> <arg>--minPrefsPerUser</arg> <arg>0</arg> <arg>--booleanData</arg> <arg>false</arg> <arg>--tempDir</arg> <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg> </java> <ok to="to-narrow-table"/> <error to="kill"/> </action> 3) RANK does it, here is a script: --user, item, pref previously prepared by hive user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as (user_id:chararray, item_id:long, pref:double); --get distinct user from the whole input distUserId = distinct(FOREACH user_item_pref GENERATE user_id); --get distinct item from the whole input distItemId = distinct(FOREACH user_item_pref GENERATE item_id); --rank user 1....N rankUsers_ = RANK distUserId; rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id; --rank items 1....M rankItems_ = RANK distItemId; rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id; --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING 'skewed'; joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by item_id using 'replicated'; projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id as user_id, rankItems::rank_id as item_id, joinedUsers::user_item_pref::pref as pref; --store mapping for later remapping from RANK back to natural values STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using PigStorage('\t'); STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using PigStorage('\t'); STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs' using PigStorage('\t'); 4) I've seen this idea in different discussion, that different weight for different actions are not good. Sorry, I don't understand what you do suggest. I have two kind of actions: user viewed item, user clicked on recommended item (recommended item produced by my item similarity system). I want to produce two kinds of recommendations: 1. current item + recommend other items which other users visit in conjuction with current item 2. similar item: recommend items similar to current viewed item. What can I try? LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD? Right now I do get awful recommendations and I can't understand what can I try next :(((((((((((( 2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>: > 1) how many cores in the cluster? The whole idea behind mapreduce is you > buy more cpus you get nearly linear decrease in runtime. > 2) what is your mahout command line with options, or how are you invoking > mahout. I have seen the Mahout mapreduce recommender take this long so we > should check what you are doing with downsampling. > 3) do you really need to RANK your ids, that’s a full sort? When using pig > I usually get DISTINCT ones and assign an incrementing integer as the > Mahout ID corresponding > 4) your #2 assigning different weights to different actions usually does > not work. I’ve done this before and compared offline metrics and seen > precision go down. I’d get this working using only your primary actions > first. What are you trying to get the user to do? View something, buy > something? Use that action as the primary preference and start out with a > weight of 1 using LLR. With LLR the weights are not used anyway so your > data may not produce good results with mixed actions. > > A plug for the (admittedly pre-alpha) spark-itemsimilairty: > 1) output from 2 can be directly ingested and will create output. > 2) multiple actions can be used with cross-cooccurrence, not by guessing > at weights. > 3) output has your application specific IDs preserved. > 4) its about 10x faster than mapreduce and will do aways with your ID > translation steps > > One caveat is that your cluster machines will need lots of memory. I have > 8-16g on mine. > > On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]> > wrote: > > 1. I do collect preferences for items using 60days sliding window. today - > 60 days. > 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item > view, 5 for clicking recommndation block. The idea is to give more value > for recommendations which attact visitor attention). I get ~ 20.000.000 of > lines with ~1.000.000 distinct items and ~2.000.000 distinct users > 3. I do use apache pig RANK function to rank all distinct user_id > 4. I do the same for item_id > 5. I do join input dataset with ranked datasets and provide input to mahout > with dense interger user_id, item_id > 6. I do get mahout output and join integer item_id back to get natural key > value. > > step #1-2 takes ~ 40min > step #3-5 takes ~1 hour > mahout calc takes ~3hours > > > > 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>: > >> This really doesn't sound right. It should be possible to process > almost a >> thousand times that much data every night without that much problem. >> >> How are you preparing the input data? >> >> How are you converting to Mahout id's? >> >> Even using python, you should be able to do the conversion in just a few >> minutes without any parallelism whatsoever. >> >> >> >> >> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak < > [email protected]> >> wrote: >> >>> Hi, We are trying calculate ItemSimilarity. >>> Right now we have 2*10^7 input lines. I do provide input data as raw > text >>> each day to recalculate item similarities. We do get +100..1000 new > items >>> each day. >>> 1. It takes too much time to prepare input data. >>> 2. It takes too much time to convert user_id, item_id to mahout ids >>> >>> Is there any poissibility to provide data to mahout mapreduce >>> ItemSimilarity using some binary format with compression? >>> >> > >
