Thank you for your input.
2014-07-21 12:00 GMT+04:00 Peng Zhang <[email protected]>: > My personal comments: > 1. Data cleansing. One beautiful characteristic of Mahout’s CF > recommendation is the simplicity of input data, often times just three > columns (user, item, preference). If any value is missing, just don’t put > the record in the input file. Therefore I don’t see there is any need to do > data cleaning given that the application has recorded user-item-preference > correctly and you have translated user-id and item-id properly. > 2. Oftentimes Loglikelihood has a better performance than > PearsonCorrelation in Mahout’s Collaborative Filtering. The former is > focused on discrete values and the latter is focused on continuous values. > Refer to Ted’s popular post Surprise and Coincidence about the former. > > > Peng Zhang > [email protected] > > > > > > On Jul 21, 2014, at 3:37 PM, Serega Sheypak <[email protected]> > wrote: > > > Thanks! I'll report this evening. > > > > Are there any articles about data preparation for mahout item > > recommendation? There are many books but most of them are copy-paste of > > javadoc and guides from mahout site. > > I'm -1 at math, my challenges are: > > > > 1. approaches for data cleaning, do I have to apply dead-simple > statisical > > rules? > > "The empirical rule also states that approximately 95 percent of the data > > values will fall within two standard deviations from the mean." > > So If my user visits are described as normal distirbution Does it make > > sense? The idea is to put away all noise. > > > > 2. similarityClassname - don't have any intuition here... I see that > people > > use SIMILARITY_LOGLIKELIHOOD and PEARSON > > > > > > 2014-07-21 11:18 GMT+04:00 Peng Zhang <[email protected]>: > > > >> Serega, > >> > >> See the last line on how to pass outputPathForSimilarityMatrix options > to > >> the recommenditembased command: > >> > >> sudo -u oozie mahout recommenditembased \ > >> --input visited_items_with_inverted_items \ > >> > >> --output result \ > >> --similarityClassname SIMILARITY_LOGLIKELIHOOD \ > >> --usersFile inverted_items \ > >> --numRecommendations 500 \ > >> --booleanData false \ > >> --maxPrefsPerUser 100 \ > >> --maxSimilaritiesPerItem 500 \ > >> --minPrefsPerUser 0\ > >> --maxPrefsPerUserInItemSimilarity 30 \ > >> --threshold 0.91 \ > >> --tempDir temp \ > >> --outputPathForSimilarityMatrix similarityMatri \ > >> > >> > >> Peng Zhang > >> [email protected] > >> > >> > >> > >> > >> > >> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <[email protected]> > >> wrote: > >> > >>> I've inspected the code, our approach wouldn't work with > >> booleanData=false. > >>> We do calcualte imte similarity in the wrong way...((( > >>> Thank you > >>> 1. We provide "fake" user_id and provide --usersFile in order to get > >>> recommendations for "fake user_id, where user_id is a negative item_id. > >> It > >>> worked when we did provide user_id->item_id pairs without preference. > >>> 2. Our target is to get item similarities. We tried > >>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but > >> it > >>> returns bad result comparing to RecommenderJob with our "fake" user_id > >>> (inverted item_id) > >>> > >>> 1. I'll try the option you provided. > >>> 2. I will remove input with fake user_id and usersFile with these fake > >> ids > >>> > >>> 3. > >>> > >> > https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java > >>> I don't understand how to pass ---outputPathForSimilarityMatrix option > to > >>> RecommenderJob > >>> > >>> > >>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <[email protected]>: > >>> > >>>> Seraga, > >>>> > >>>> I have two comments: > >>>> 1. Don’t use negative user ids. Since Mahout uses user id as well as > >> item > >>>> id as the row/column index, you’d better use 0, 1, 2, etc as ids > >>>> 2. If you want to get the item similarity information, you can use > >>>> --outputPathForSimilarityMatrix in the command > >>>> > >>>> Regards, > >>>> Peng Zhang > >>>> M: +86 186-1658-7856 > >>>> [email protected] > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <[email protected] > > > >>>> wrote: > >>>> > >>>>> All bad things happen here: > >>>>> > >>>>> > >>>>> > >>>>> Name > >>>>> > >>>>> RecommenderJob-PartialMultiplyMapper-Reducer > >>>>> > >>>>> User > >>>>> > >>>>> oozie > >>>>> > >>>>> Process User > >>>>> > >>>>> oozie > >>>>> > >>>>> Group > >>>>> > >>>>> oozie > >>>>> > >>>>> Mapper Class > >>>>> > >>>>> PartialMultiplyMapper > >>>>> > >>>>> Reducer Class > >>>>> > >>>>> AggregateAndRecommendReducer > >>>>> > >>>>> > >>>>> Job Input Directory > >>>>> > >>>>> hdfs://nameservice1/itemrec/temp/partialMultiply > >>>>> > >>>>> Job Output Directory > >>>>> > >>>>> hdfs://nameservice1/itemrec/output/ > >>>>> > >>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Map input > records=3312879 > >>>>> > >>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Map output > records=3313251 > >>>>> > >>>>> > >>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce input > >> records=3313251 > >>>>> > >>>>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce output records=0 > >>>>> > >>>>> Why does mahout returns 0 rows? it works when booleanData=true > >>>> (preferences > >>>>> are ignored...?) > >>>>> > >>>>> > >>>>> > >>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <[email protected] > >: > >>>>> > >>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40 > >>>>>> users_file: > >>>>>> --inverted_item_id > >>>>>> -1 > >>>>>> -2 > >>>>>> -3 > >>>>>> -4 > >>>>>> > >>>>>> users_items_prefs > >>>>>> --inverted item_id > >>>>>> -1 1 1.0 > >>>>>> -2 2 1.0 > >>>>>> -3 3 1.0 > >>>>>> -4 4 1.0 > >>>>>> --user_id item_id pref_value > >>>>>> 11 1 1.6 > >>>>>> 11 2 1.6 > >>>>>> 123 3 2.0 > >>>>>> 123 4 2.0 > >>>>>> 333 1 2.0 > >>>>>> 333 2 1.6 > >>>>>> --e.t.c. > >>>>>> > >>>>>> if I set --booleanData true > >>>>>> then mahout returns the result. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman < > >> [email protected] > >>>>> : > >>>>>> > >>>>>> I'm confused about how you're constructing the user file, and why > >> there > >>>>>>> are negated item ids here. > >>>>>>> > >>>>>>> Can you post some more details please, including Mahout version and > >>>> some > >>>>>>> sample data sets? > >>>>>>> > >>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak < > >>>> [email protected]> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi, I'm trying to create item similarity. > >>>>>>>> I gather items which users visit during shopping and then create a > >>>> file: > >>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], > >>>> depends > >>>>>>> on > >>>>>>>> user action type and data source) > >>>>>>>> UNION > >>>>>>>> -item_id, item_id, 1.0 (from items dictionary) > >>>>>>>> > >>>>>>>> and I do provide a userFile, where user_id = -item_id > >>>>>>>> > >>>>>>>> The idea is to get item similary. If any user visits item named > >> "A", i > >>>>>>> want > >>>>>>>> to show him items "B", "c", "xxx" using preferences of other > users. > >>>>>>>> > >>>>>>>> The problem is that the last (???) mapreduce job returns 0 rows: > >>>>>>>> > >>>>>>>> Here are my settings: > >>>>>>>> > >>>>>>>> > >>>>>>>> sudo -u oozie mahout recommenditembased \ > >>>>>>>> --input visited_items_with_inverted_items \ > >>>>>>>> > >>>>>>>> --output result \ > >>>>>>>> --similarityClassname SIMILARITY_LOGLIKELIHOOD \ > >>>>>>>> --usersFile inverted_items \ > >>>>>>>> --numRecommendations 500 \ > >>>>>>>> --booleanData false \ > >>>>>>>> --maxPrefsPerUser 100 \ > >>>>>>>> --maxSimilaritiesPerItem 500 \ > >>>>>>>> --minPrefsPerUser 0\ > >>>>>>>> --maxPrefsPerUserInItemSimilarity 30 \ > >>>>>>>> --threshold 0.91 \ > >>>>>>>> --tempDir temp \ > >>>>>>>> > >>>>>>>> Some counters... I don't get what do they mean.... > >>>>>>>> > >>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: > >>>>>>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters > >>>>>>>> > >>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: USERS=7528530 > >>>>>>>> > >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > >>>>>>>> > >>>>>>> > >>>> > >> > org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements > >>>>>>>> > >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > >>>>>>>> USER_RATINGS_NEGLECTED=1,798,738 > >>>>>>>> > >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > >>>>>>> USER_RATINGS_USED=12,429,693 > >>>>>>>> > >>>>>>>> > >>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: > >>>>>>>> > >>>>>>> > >>>> > >> > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters > >>>>>>>> > >>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: ROWS=3312879 > >>>>>>>> > >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > >>>>>>>> > >>>>>>> > >>>> > >> > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters > >>>>>>>> > >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > COOCCURRENCES=35882374 > >>>>>>>> > >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > PRUNED_COOCCURRENCES=0 > >>>>>>>> > >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map input > >> records=3312879 > >>>>>>>> > >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map output > >>>> records=17570268 > >>>>>>>> > >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce input > >>>>>>> records=5221907 > >>>>>>>> > >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce output > >>>>>>> records=3312879 > >>>>>>>> > >>>>>>>> > >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input > >>>>>>> records=3312879 > >>>>>>>> > >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output > >>>>>>> records=3312879 > >>>>>>>> > >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input > >>>>>>> records=3312879 > >>>>>>>> > >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output > >>>>>>> records=3312879 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map input > >> records=7528530 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map output > >>>> records=3313251 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce input > >>>>>>> records=3313251 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce output > >>>>>>> records=3313251 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map input > >> records=6626130 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map output > >>>> records=6626130 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce input > >>>>>>> records=6626130 > >>>>>>>> > >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce output > >>>>>>> records=3312879 > >>>>>>>> > >>>>>>>> > >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map input > >> records=3312879 > >>>>>>>> > >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map output > >>>> records=3313251 > >>>>>>>> > >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce input > >>>>>>> records=3313251 > >>>>>>>> > >>>>>>>> -------- > >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce output > records=0 > >>>>>>>> -------- > >>>>>>>> > >>>>>>>> why 0??? > >>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >> > >> > >
