I think I did explain below. Your IDs must be in the range from 0 to the number of rows - 1 and the same for item IDs. This is done by taking your application specific IDs and mapping them to sequential non-negative Integers. You need to maintain a mapping to/from Mahout IDs somewhere in your own code.
For example imagine input of the form -92, abc, 1.0 75000x, jkl, 2.0 Your first user ID is -92, give it Mahout ID = 0. For your next user ID 75000x give it Mahout ID = 1 Your first item ID is abc, give it Mahout ID = 0. For your next item ID jkl give it Mahout ID = 1 keep doing this the first time you see a unique id from your input. A Map will do this for you. And so on. Then the input to Mahout would be: 0,0,1.0 1,1,2.0 The output will have Mahout IDs too so you need to map recommendations for Mahout User ID 0 back to your User ID of -92, and the same for all item IDs. On Jul 25, 2014, at 11:55 AM, Serega Sheypak <[email protected]> wrote: I'm preparing data using apache hive: user_id:long, item_it:long, preference[1.0, 2.0] I don't understand "For most Mahout jobs you have to prepare you data to have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site docs, I didn't find there something related to mahout ids. Please explain. 2014-07-25 22:39 GMT+04:00 Pat Ferrel <[email protected]>: > Sorry I haven’t read this thread carefully but it looks like you may be > using the wrong IDs. > > For most Mahout jobs you have to prepare you data to have Mahout IDs. You > do this by looking at each datum and as you see a new unique application > specific user or item ID you give it a Mahout ID starting from 0. So Mahout > ID can be thought of as row and column numbers in a matrix. The Mahout IDs > for rows will be 0 thru # of rows-1 same for columns. > > This always requires that you translate into Mahout IDs then after the job > is run translate back into your application IDs. You need a bi-directional > dictionary of some type. I use a HashBiMap from Guava. > > Also I’d avoid the threshold for now. If you get that wrong it will mess > things up badly and is very hard to tune. It’s there for completeness but I > never use it. > > > On Jul 25, 2014, at 12:55 AM, Serega Sheypak <[email protected]> > wrote: > > Hi, nothing helps... > I do use mahout 0.9 compiled for CDH 4.7 > I do provide only positive values > I do use itemsimilarityJob and do get 2000 similarities for 1400 unique > items > Input data is: > 16*10^6 preferences > 4*10^6 users > 0.6*10^ items > I do use perason correlation and preferece vlaues are: 1.0 and 2.0 > > > 2014-07-22 9:32 GMT+04:00 Serega Sheypak <[email protected]>: > >> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening. >> Right now I don't see how can it help me. As far as I know the stuff I > try >> to use is pretty old and stable. >> looks like I do apply it in a wrong way. >> >> There is an option for recommenditembased named "--threshold". I do >> provide data for recommenditembased with preference values in range >> [1.1..2.0]. >> I set --threshold to 1.2 >> --threshold is absolute and can be from [1.1 . .2+] or it's relative and >> can be [0.0 .. 0.99999]? >> >> >> 2014-07-22 3:54 GMT+04:00 Ted Dunning <[email protected]>: >> >> That version is no longer supported. You should upgrade to 0.9 >>> >>> >>> >>> >>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak < >>> [email protected]> >>> wrote: >>> >>>> 0.7-cdh4.7.0 >>>> Anyway, recommenditembased does produce these catalogs: >>>> >>>> /recommenditembased/temp/maxValues.bin >>>> /recommenditembased/temp/norms.bin >>>> /recommenditembased/temp/numNonZeroEntries.bin >>>> /recommenditembased/temp/pairwiseSimilarity >>>> /recommenditembased/temp/partialMultiply >>>> /recommenditembased/temp/prePartialMultiply1 >>>> /recommenditembased/temp/prePartialMultiply2 >>>> /recommenditembased/temp/preparePreferenceMatrix >>>> /recommenditembased/temp/similarityMatrix >>>> /recommenditembased/temp/weights >>>> >>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing >>> In >>>> eed. Right now I try to read it using >>>> >>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING >>>> com.twitter.elephantbird.pig.load.SequenceFileLoader( >>>> '-c com.twitter.elephantbird.pig.util.IntWritableConverter', >>>> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' >>>> ) as (intId: int, vector:tuple(cardinality:int, >>>> entries:bag{t:tuple(some_id:long, some_value:double)})); >>>> >>>> >>>> Looks like the vector is empty... Or i do something wrong. >>>> >>>> >>>> >>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <[email protected]>: >>>> >>>>> Which version of Mahout? >>>>> >>>>> >>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak < >>>> [email protected] >>>>>> >>>>> wrote: >>>>> >>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while >>>>> processing >>>>>> Job-Specific >>>>>> >>>>>> sudo -u hdfs hadoop fs -rm -r >>>>> hdfs://nameservice1/recommenditembased/output >>>>>> sudo -u hdfs hadoop fs -rm -r >>>> hdfs://nameservice1/recommenditembased/temp >>>>>> sudo -u oozie mahout recommenditembased \ >>>>>> --input \ >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> > hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks >>>>>> \ >>>>>> --output \ >>>>>> hdfs://nameservice1/recommenditembased/output \ >>>>>> --similarityClassname \ >>>>>> SIMILARITY_LOGLIKELIHOOD \ >>>>>> --numRecommendations \ >>>>>> 500 \ >>>>>> --booleanData \ >>>>>> false \ >>>>>> --maxPrefsPerUser \ >>>>>> 1000 \ >>>>>> --maxSimilaritiesPerItem \ >>>>>> 1000 \ >>>>>> --minPrefsPerUser \ >>>>>> 5 \ >>>>>> --maxPrefsPerUserInItemSimilarity \ >>>>>> 30 \ >>>>>> --threshold \ >>>>>> 1.1 \ >>>>>> --tempDir \ >>>>>> hdfs://nameservice1/recommenditembased/temp \ >>>>>> --outputPathForSimilarityMatrix \ >>>>>> >>> hdfs://nameservice1/recommenditembased/sim_matrix >>>>>> >>>>>> >>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported. >>>>>> >>>>>> >>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <[email protected]>: >>>>>> >>>>>>> Serega, >>>>>>> >>>>>>> See the last line on how to pass outputPathForSimilarityMatrix >>>> options >>>>> to >>>>>>> the recommenditembased command: >>>>>>> >>>>>>> sudo -u oozie mahout recommenditembased \ >>>>>>> --input visited_items_with_inverted_items \ >>>>>>> >>>>>>> --output result \ >>>>>>> --similarityClassname SIMILARITY_LOGLIKELIHOOD >>> \ >>>>>>> --usersFile inverted_items \ >>>>>>> --numRecommendations 500 \ >>>>>>> --booleanData false \ >>>>>>> --maxPrefsPerUser 100 \ >>>>>>> --maxSimilaritiesPerItem 500 \ >>>>>>> --minPrefsPerUser 0\ >>>>>>> --maxPrefsPerUserInItemSimilarity 30 \ >>>>>>> --threshold 0.91 \ >>>>>>> --tempDir temp \ >>>>>>> --outputPathForSimilarityMatrix >>> similarityMatri \ >>>>>>> >>>>>>> >>>>>>> Peng Zhang >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak < >>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I've inspected the code, our approach wouldn't work with >>>>>>> booleanData=false. >>>>>>>> We do calcualte imte similarity in the wrong way...((( >>>>>>>> Thank you >>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to >>>> get >>>>>>>> recommendations for "fake user_id, where user_id is a negative >>>>> item_id. >>>>>>> It >>>>>>>> worked when we did provide user_id->item_id pairs without >>>> preference. >>>>>>>> 2. Our target is to get item similarities. We tried >>>>>>>> >>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob >>>>> but >>>>>>> it >>>>>>>> returns bad result comparing to RecommenderJob with our "fake" >>>>> user_id >>>>>>>> (inverted item_id) >>>>>>>> >>>>>>>> 1. I'll try the option you provided. >>>>>>>> 2. I will remove input with fake user_id and usersFile with >>> these >>>>> fake >>>>>>> ids >>>>>>>> >>>>>>>> 3. >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> > https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java >>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix >>>>> option >>>>>> to >>>>>>>> RecommenderJob >>>>>>>> >>>>>>>> >>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <[email protected]>: >>>>>>>> >>>>>>>>> Seraga, >>>>>>>>> >>>>>>>>> I have two comments: >>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as >>> well >>>> as >>>>>>> item >>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as >>> ids >>>>>>>>> 2. If you want to get the item similarity information, you can >>> use >>>>>>>>> --outputPathForSimilarityMatrix in the command >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Peng Zhang >>>>>>>>> M: +86 186-1658-7856 >>>>>>>>> [email protected] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak < >>>>> [email protected] >>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> All bad things happen here: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Name >>>>>>>>>> >>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer >>>>>>>>>> >>>>>>>>>> User >>>>>>>>>> >>>>>>>>>> oozie >>>>>>>>>> >>>>>>>>>> Process User >>>>>>>>>> >>>>>>>>>> oozie >>>>>>>>>> >>>>>>>>>> Group >>>>>>>>>> >>>>>>>>>> oozie >>>>>>>>>> >>>>>>>>>> Mapper Class >>>>>>>>>> >>>>>>>>>> PartialMultiplyMapper >>>>>>>>>> >>>>>>>>>> Reducer Class >>>>>>>>>> >>>>>>>>>> AggregateAndRecommendReducer >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Job Input Directory >>>>>>>>>> >>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply >>>>>>>>>> >>>>>>>>>> Job Output Directory >>>>>>>>>> >>>>>>>>>> hdfs://nameservice1/itemrec/output/ >>>>>>>>>> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Map input >>>>>> records=3312879 >>>>>>>>>> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Map output >>>>>> records=3313251 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce input >>>>>>> records=3313251 >>>>>>>>>> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce output >>>>> records=0 >>>>>>>>>> >>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true >>>>>>>>> (preferences >>>>>>>>>> are ignored...?) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak < >>>>> [email protected] >>>>>>> : >>>>>>>>>> >>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40 >>>>>>>>>>> users_file: >>>>>>>>>>> --inverted_item_id >>>>>>>>>>> -1 >>>>>>>>>>> -2 >>>>>>>>>>> -3 >>>>>>>>>>> -4 >>>>>>>>>>> >>>>>>>>>>> users_items_prefs >>>>>>>>>>> --inverted item_id >>>>>>>>>>> -1 1 1.0 >>>>>>>>>>> -2 2 1.0 >>>>>>>>>>> -3 3 1.0 >>>>>>>>>>> -4 4 1.0 >>>>>>>>>>> --user_id item_id pref_value >>>>>>>>>>> 11 1 1.6 >>>>>>>>>>> 11 2 1.6 >>>>>>>>>>> 123 3 2.0 >>>>>>>>>>> 123 4 2.0 >>>>>>>>>>> 333 1 2.0 >>>>>>>>>>> 333 2 1.6 >>>>>>>>>>> --e.t.c. >>>>>>>>>>> >>>>>>>>>>> if I set --booleanData true >>>>>>>>>>> then mahout returns the result. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman < >>>>>>> [email protected] >>>>>>>>>> : >>>>>>>>>>> >>>>>>>>>>> I'm confused about how you're constructing the user file, and >>>> why >>>>>>> there >>>>>>>>>>>> are negated item ids here. >>>>>>>>>>>> >>>>>>>>>>>> Can you post some more details please, including Mahout >>> version >>>>> and >>>>>>>>> some >>>>>>>>>>>> sample data sets? >>>>>>>>>>>> >>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak < >>>>>>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, I'm trying to create item similarity. >>>>>>>>>>>>> I gather items which users visit during shopping and then >>>>> create a >>>>>>>>> file: >>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, >>>> 1.9], >>>>>>>>> depends >>>>>>>>>>>> on >>>>>>>>>>>>> user action type and data source) >>>>>>>>>>>>> UNION >>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary) >>>>>>>>>>>>> >>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id >>>>>>>>>>>>> >>>>>>>>>>>>> The idea is to get item similary. If any user visits item >>>> named >>>>>>> "A", i >>>>>>>>>>>> want >>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of >>> other >>>>>> users. >>>>>>>>>>>>> >>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0 >>>> rows: >>>>>>>>>>>>> >>>>>>>>>>>>> Here are my settings: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \ >>>>>>>>>>>>> --input visited_items_with_inverted_items >>> \ >>>>>>>>>>>>> >>>>>>>>>>>>> --output result \ >>>>>>>>>>>>> --similarityClassname >>>> SIMILARITY_LOGLIKELIHOOD >>>>> \ >>>>>>>>>>>>> --usersFile inverted_items \ >>>>>>>>>>>>> --numRecommendations 500 \ >>>>>>>>>>>>> --booleanData false \ >>>>>>>>>>>>> --maxPrefsPerUser 100 \ >>>>>>>>>>>>> --maxSimilaritiesPerItem 500 \ >>>>>>>>>>>>> --minPrefsPerUser 0\ >>>>>>>>>>>>> --maxPrefsPerUserInItemSimilarity 30 \ >>>>>>>>>>>>> --threshold 0.91 \ >>>>>>>>>>>>> --tempDir temp \ >>>>>>>>>>>>> >>>>>>>>>>>>> Some counters... I don't get what do they mean.... >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: >>>>>>>>>>>>> >>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: USERS=7528530 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> > org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >>>>>>>>>>>> USER_RATINGS_USED=12,429,693 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: ROWS=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >>>>>> COOCCURRENCES=35882374 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >>>>>> PRUNED_COOCCURRENCES=0 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map input >>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map output >>>>>>>>> records=17570268 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce input >>>>>>>>>>>> records=5221907 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce output >>>>>>>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input >>>>>>>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output >>>>>>>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input >>>>>>>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output >>>>>>>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map input >>>>>>> records=7528530 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map output >>>>>>>>> records=3313251 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce input >>>>>>>>>>>> records=3313251 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce output >>>>>>>>>>>> records=3313251 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map input >>>>>>> records=6626130 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map output >>>>>>>>> records=6626130 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce input >>>>>>>>>>>> records=6626130 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce output >>>>>>>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map input >>>>>>> records=3312879 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map output >>>>>>>>> records=3313251 >>>>>>>>>>>>> >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce input >>>>>>>>>>>> records=3313251 >>>>>>>>>>>>> >>>>>>>>>>>>> -------- >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce output >>>>>> records=0 >>>>>>>>>>>>> -------- >>>>>>>>>>>>> >>>>>>>>>>>>> why 0??? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> > >
