I'm preparing data using apache hive: user_id:long, item_it:long, preference[1.0, 2.0] I don't understand "For most Mahout jobs you have to prepare you data to have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site docs, I didn't find there something related to mahout ids. Please explain.
2014-07-25 22:39 GMT+04:00 Pat Ferrel <[email protected]>: > Sorry I haven’t read this thread carefully but it looks like you may be > using the wrong IDs. > > For most Mahout jobs you have to prepare you data to have Mahout IDs. You > do this by looking at each datum and as you see a new unique application > specific user or item ID you give it a Mahout ID starting from 0. So Mahout > ID can be thought of as row and column numbers in a matrix. The Mahout IDs > for rows will be 0 thru # of rows-1 same for columns. > > This always requires that you translate into Mahout IDs then after the job > is run translate back into your application IDs. You need a bi-directional > dictionary of some type. I use a HashBiMap from Guava. > > Also I’d avoid the threshold for now. If you get that wrong it will mess > things up badly and is very hard to tune. It’s there for completeness but I > never use it. > > > On Jul 25, 2014, at 12:55 AM, Serega Sheypak <[email protected]> > wrote: > > Hi, nothing helps... > I do use mahout 0.9 compiled for CDH 4.7 > I do provide only positive values > I do use itemsimilarityJob and do get 2000 similarities for 1400 unique > items > Input data is: > 16*10^6 preferences > 4*10^6 users > 0.6*10^ items > I do use perason correlation and preferece vlaues are: 1.0 and 2.0 > > > 2014-07-22 9:32 GMT+04:00 Serega Sheypak <[email protected]>: > > > Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening. > > Right now I don't see how can it help me. As far as I know the stuff I > try > > to use is pretty old and stable. > > looks like I do apply it in a wrong way. > > > > There is an option for recommenditembased named "--threshold". I do > > provide data for recommenditembased with preference values in range > > [1.1..2.0]. > > I set --threshold to 1.2 > > --threshold is absolute and can be from [1.1 . .2+] or it's relative and > > can be [0.0 .. 0.99999]? > > > > > > 2014-07-22 3:54 GMT+04:00 Ted Dunning <[email protected]>: > > > > That version is no longer supported. You should upgrade to 0.9 > >> > >> > >> > >> > >> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak < > >> [email protected]> > >> wrote: > >> > >>> 0.7-cdh4.7.0 > >>> Anyway, recommenditembased does produce these catalogs: > >>> > >>> /recommenditembased/temp/maxValues.bin > >>> /recommenditembased/temp/norms.bin > >>> /recommenditembased/temp/numNonZeroEntries.bin > >>> /recommenditembased/temp/pairwiseSimilarity > >>> /recommenditembased/temp/partialMultiply > >>> /recommenditembased/temp/prePartialMultiply1 > >>> /recommenditembased/temp/prePartialMultiply2 > >>> /recommenditembased/temp/preparePreferenceMatrix > >>> /recommenditembased/temp/similarityMatrix > >>> /recommenditembased/temp/weights > >>> > >>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing > >> In > >>> eed. Right now I try to read it using > >>> > >>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING > >>> com.twitter.elephantbird.pig.load.SequenceFileLoader( > >>> '-c com.twitter.elephantbird.pig.util.IntWritableConverter', > >>> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > >>> ) as (intId: int, vector:tuple(cardinality:int, > >>> entries:bag{t:tuple(some_id:long, some_value:double)})); > >>> > >>> > >>> Looks like the vector is empty... Or i do something wrong. > >>> > >>> > >>> > >>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <[email protected]>: > >>> > >>>> Which version of Mahout? > >>>> > >>>> > >>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak < > >>> [email protected] > >>>>> > >>>> wrote: > >>>> > >>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while > >>>> processing > >>>>> Job-Specific > >>>>> > >>>>> sudo -u hdfs hadoop fs -rm -r > >>>> hdfs://nameservice1/recommenditembased/output > >>>>> sudo -u hdfs hadoop fs -rm -r > >>> hdfs://nameservice1/recommenditembased/temp > >>>>> sudo -u oozie mahout recommenditembased \ > >>>>> --input \ > >>>>> > >>>>> > >>>>> > >>>> > >>> > >> > hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks > >>>>> \ > >>>>> --output \ > >>>>> hdfs://nameservice1/recommenditembased/output \ > >>>>> --similarityClassname \ > >>>>> SIMILARITY_LOGLIKELIHOOD \ > >>>>> --numRecommendations \ > >>>>> 500 \ > >>>>> --booleanData \ > >>>>> false \ > >>>>> --maxPrefsPerUser \ > >>>>> 1000 \ > >>>>> --maxSimilaritiesPerItem \ > >>>>> 1000 \ > >>>>> --minPrefsPerUser \ > >>>>> 5 \ > >>>>> --maxPrefsPerUserInItemSimilarity \ > >>>>> 30 \ > >>>>> --threshold \ > >>>>> 1.1 \ > >>>>> --tempDir \ > >>>>> hdfs://nameservice1/recommenditembased/temp \ > >>>>> --outputPathForSimilarityMatrix \ > >>>>> > >> hdfs://nameservice1/recommenditembased/sim_matrix > >>>>> > >>>>> > >>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported. > >>>>> > >>>>> > >>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <[email protected]>: > >>>>> > >>>>>> Serega, > >>>>>> > >>>>>> See the last line on how to pass outputPathForSimilarityMatrix > >>> options > >>>> to > >>>>>> the recommenditembased command: > >>>>>> > >>>>>> sudo -u oozie mahout recommenditembased \ > >>>>>> --input visited_items_with_inverted_items \ > >>>>>> > >>>>>> --output result \ > >>>>>> --similarityClassname SIMILARITY_LOGLIKELIHOOD > >> \ > >>>>>> --usersFile inverted_items \ > >>>>>> --numRecommendations 500 \ > >>>>>> --booleanData false \ > >>>>>> --maxPrefsPerUser 100 \ > >>>>>> --maxSimilaritiesPerItem 500 \ > >>>>>> --minPrefsPerUser 0\ > >>>>>> --maxPrefsPerUserInItemSimilarity 30 \ > >>>>>> --threshold 0.91 \ > >>>>>> --tempDir temp \ > >>>>>> --outputPathForSimilarityMatrix > >> similarityMatri \ > >>>>>> > >>>>>> > >>>>>> Peng Zhang > >>>>>> [email protected] > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak < > >>> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> I've inspected the code, our approach wouldn't work with > >>>>>> booleanData=false. > >>>>>>> We do calcualte imte similarity in the wrong way...((( > >>>>>>> Thank you > >>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to > >>> get > >>>>>>> recommendations for "fake user_id, where user_id is a negative > >>>> item_id. > >>>>>> It > >>>>>>> worked when we did provide user_id->item_id pairs without > >>> preference. > >>>>>>> 2. Our target is to get item similarities. We tried > >>>>>>> > >> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob > >>>> but > >>>>>> it > >>>>>>> returns bad result comparing to RecommenderJob with our "fake" > >>>> user_id > >>>>>>> (inverted item_id) > >>>>>>> > >>>>>>> 1. I'll try the option you provided. > >>>>>>> 2. I will remove input with fake user_id and usersFile with > >> these > >>>> fake > >>>>>> ids > >>>>>>> > >>>>>>> 3. > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java > >>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix > >>>> option > >>>>> to > >>>>>>> RecommenderJob > >>>>>>> > >>>>>>> > >>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <[email protected]>: > >>>>>>> > >>>>>>>> Seraga, > >>>>>>>> > >>>>>>>> I have two comments: > >>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as > >> well > >>> as > >>>>>> item > >>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as > >> ids > >>>>>>>> 2. If you want to get the item similarity information, you can > >> use > >>>>>>>> --outputPathForSimilarityMatrix in the command > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Peng Zhang > >>>>>>>> M: +86 186-1658-7856 > >>>>>>>> [email protected] > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak < > >>>> [email protected] > >>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> All bad things happen here: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Name > >>>>>>>>> > >>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer > >>>>>>>>> > >>>>>>>>> User > >>>>>>>>> > >>>>>>>>> oozie > >>>>>>>>> > >>>>>>>>> Process User > >>>>>>>>> > >>>>>>>>> oozie > >>>>>>>>> > >>>>>>>>> Group > >>>>>>>>> > >>>>>>>>> oozie > >>>>>>>>> > >>>>>>>>> Mapper Class > >>>>>>>>> > >>>>>>>>> PartialMultiplyMapper > >>>>>>>>> > >>>>>>>>> Reducer Class > >>>>>>>>> > >>>>>>>>> AggregateAndRecommendReducer > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Job Input Directory > >>>>>>>>> > >>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply > >>>>>>>>> > >>>>>>>>> Job Output Directory > >>>>>>>>> > >>>>>>>>> hdfs://nameservice1/itemrec/output/ > >>>>>>>>> > >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Map input > >>>>> records=3312879 > >>>>>>>>> > >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Map output > >>>>> records=3313251 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce input > >>>>>> records=3313251 > >>>>>>>>> > >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce output > >>>> records=0 > >>>>>>>>> > >>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true > >>>>>>>> (preferences > >>>>>>>>> are ignored...?) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak < > >>>> [email protected] > >>>>>> : > >>>>>>>>> > >>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40 > >>>>>>>>>> users_file: > >>>>>>>>>> --inverted_item_id > >>>>>>>>>> -1 > >>>>>>>>>> -2 > >>>>>>>>>> -3 > >>>>>>>>>> -4 > >>>>>>>>>> > >>>>>>>>>> users_items_prefs > >>>>>>>>>> --inverted item_id > >>>>>>>>>> -1 1 1.0 > >>>>>>>>>> -2 2 1.0 > >>>>>>>>>> -3 3 1.0 > >>>>>>>>>> -4 4 1.0 > >>>>>>>>>> --user_id item_id pref_value > >>>>>>>>>> 11 1 1.6 > >>>>>>>>>> 11 2 1.6 > >>>>>>>>>> 123 3 2.0 > >>>>>>>>>> 123 4 2.0 > >>>>>>>>>> 333 1 2.0 > >>>>>>>>>> 333 2 1.6 > >>>>>>>>>> --e.t.c. > >>>>>>>>>> > >>>>>>>>>> if I set --booleanData true > >>>>>>>>>> then mahout returns the result. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman < > >>>>>> [email protected] > >>>>>>>>> : > >>>>>>>>>> > >>>>>>>>>> I'm confused about how you're constructing the user file, and > >>> why > >>>>>> there > >>>>>>>>>>> are negated item ids here. > >>>>>>>>>>> > >>>>>>>>>>> Can you post some more details please, including Mahout > >> version > >>>> and > >>>>>>>> some > >>>>>>>>>>> sample data sets? > >>>>>>>>>>> > >>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak < > >>>>>>>> [email protected]> > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Hi, I'm trying to create item similarity. > >>>>>>>>>>>> I gather items which users visit during shopping and then > >>>> create a > >>>>>>>> file: > >>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, > >>> 1.9], > >>>>>>>> depends > >>>>>>>>>>> on > >>>>>>>>>>>> user action type and data source) > >>>>>>>>>>>> UNION > >>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary) > >>>>>>>>>>>> > >>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id > >>>>>>>>>>>> > >>>>>>>>>>>> The idea is to get item similary. If any user visits item > >>> named > >>>>>> "A", i > >>>>>>>>>>> want > >>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of > >> other > >>>>> users. > >>>>>>>>>>>> > >>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0 > >>> rows: > >>>>>>>>>>>> > >>>>>>>>>>>> Here are my settings: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> sudo -u oozie mahout recommenditembased \ > >>>>>>>>>>>> --input visited_items_with_inverted_items > >> \ > >>>>>>>>>>>> > >>>>>>>>>>>> --output result \ > >>>>>>>>>>>> --similarityClassname > >>> SIMILARITY_LOGLIKELIHOOD > >>>> \ > >>>>>>>>>>>> --usersFile inverted_items \ > >>>>>>>>>>>> --numRecommendations 500 \ > >>>>>>>>>>>> --booleanData false \ > >>>>>>>>>>>> --maxPrefsPerUser 100 \ > >>>>>>>>>>>> --maxSimilaritiesPerItem 500 \ > >>>>>>>>>>>> --minPrefsPerUser 0\ > >>>>>>>>>>>> --maxPrefsPerUserInItemSimilarity 30 \ > >>>>>>>>>>>> --threshold 0.91 \ > >>>>>>>>>>>> --tempDir temp \ > >>>>>>>>>>>> > >>>>>>>>>>>> Some counters... I don't get what do they mean.... > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: > >>>>>>>>>>>> > >>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: USERS=7528530 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > >>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > >>>>>>>>>>> USER_RATINGS_USED=12,429,693 > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: ROWS=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > >>>>> COOCCURRENCES=35882374 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > >>>>> PRUNED_COOCCURRENCES=0 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map input > >>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map output > >>>>>>>> records=17570268 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce input > >>>>>>>>>>> records=5221907 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce output > >>>>>>>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input > >>>>>>>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output > >>>>>>>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input > >>>>>>>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output > >>>>>>>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map input > >>>>>> records=7528530 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map output > >>>>>>>> records=3313251 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce input > >>>>>>>>>>> records=3313251 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce output > >>>>>>>>>>> records=3313251 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map input > >>>>>> records=6626130 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map output > >>>>>>>> records=6626130 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce input > >>>>>>>>>>> records=6626130 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce output > >>>>>>>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map input > >>>>>> records=3312879 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map output > >>>>>>>> records=3313251 > >>>>>>>>>>>> > >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce input > >>>>>>>>>>> records=3313251 > >>>>>>>>>>>> > >>>>>>>>>>>> -------- > >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce output > >>>>> records=0 > >>>>>>>>>>>> -------- > >>>>>>>>>>>> > >>>>>>>>>>>> why 0??? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > > >
