I've parsed it via java, matrix is empty. why?
2014-07-21 22:41 GMT+04:00 Serega Sheypak <[email protected]>: > 0.7-cdh4.7.0 > Anyway, recommenditembased does produce these catalogs: > > /recommenditembased/temp/maxValues.bin > /recommenditembased/temp/norms.bin > /recommenditembased/temp/numNonZeroEntries.bin > /recommenditembased/temp/pairwiseSimilarity > /recommenditembased/temp/partialMultiply > /recommenditembased/temp/prePartialMultiply1 > /recommenditembased/temp/prePartialMultiply2 > /recommenditembased/temp/preparePreferenceMatrix > /recommenditembased/temp/similarityMatrix > /recommenditembased/temp/weights > > I suppose that "/recommenditembased/temp/similarityMatrix" is the thing > In eed. Right now I try to read it using > > matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING > com.twitter.elephantbird.pig.load.SequenceFileLoader( > '-c com.twitter.elephantbird.pig.util.IntWritableConverter', > '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > ) as (intId: int, vector:tuple(cardinality:int, > entries:bag{t:tuple(some_id:long, some_value:double)})); > > > Looks like the vector is empty... Or i do something wrong. > > > > 2014-07-21 22:09 GMT+04:00 Ted Dunning <[email protected]>: > > Which version of Mahout? >> >> >> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak < >> [email protected]> >> wrote: >> >> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while >> processing >> > Job-Specific >> > >> > sudo -u hdfs hadoop fs -rm -r >> hdfs://nameservice1/recommenditembased/output >> > sudo -u hdfs hadoop fs -rm -r >> hdfs://nameservice1/recommenditembased/temp >> > sudo -u oozie mahout recommenditembased \ >> > --input \ >> > >> > >> > >> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks >> > \ >> > --output \ >> > hdfs://nameservice1/recommenditembased/output \ >> > --similarityClassname \ >> > SIMILARITY_LOGLIKELIHOOD \ >> > --numRecommendations \ >> > 500 \ >> > --booleanData \ >> > false \ >> > --maxPrefsPerUser \ >> > 1000 \ >> > --maxSimilaritiesPerItem \ >> > 1000 \ >> > --minPrefsPerUser \ >> > 5 \ >> > --maxPrefsPerUserInItemSimilarity \ >> > 30 \ >> > --threshold \ >> > 1.1 \ >> > --tempDir \ >> > hdfs://nameservice1/recommenditembased/temp \ >> > --outputPathForSimilarityMatrix \ >> > hdfs://nameservice1/recommenditembased/sim_matrix >> > >> > >> > I'm on Cloudera cdh 4.7, looks like this feature is not supported. >> > >> > >> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <[email protected]>: >> > >> > > Serega, >> > > >> > > See the last line on how to pass outputPathForSimilarityMatrix >> options to >> > > the recommenditembased command: >> > > >> > > sudo -u oozie mahout recommenditembased \ >> > > --input visited_items_with_inverted_items \ >> > > >> > > --output result \ >> > > --similarityClassname SIMILARITY_LOGLIKELIHOOD \ >> > > --usersFile inverted_items \ >> > > --numRecommendations 500 \ >> > > --booleanData false \ >> > > --maxPrefsPerUser 100 \ >> > > --maxSimilaritiesPerItem 500 \ >> > > --minPrefsPerUser 0\ >> > > --maxPrefsPerUserInItemSimilarity 30 \ >> > > --threshold 0.91 \ >> > > --tempDir temp \ >> > > --outputPathForSimilarityMatrix similarityMatri \ >> > > >> > > >> > > Peng Zhang >> > > [email protected] >> > > >> > > >> > > >> > > >> > > >> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <[email protected] >> > >> > > wrote: >> > > >> > > > I've inspected the code, our approach wouldn't work with >> > > booleanData=false. >> > > > We do calcualte imte similarity in the wrong way...((( >> > > > Thank you >> > > > 1. We provide "fake" user_id and provide --usersFile in order to get >> > > > recommendations for "fake user_id, where user_id is a negative >> item_id. >> > > It >> > > > worked when we did provide user_id->item_id pairs without >> preference. >> > > > 2. Our target is to get item similarities. We tried >> > > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob >> but >> > > it >> > > > returns bad result comparing to RecommenderJob with our "fake" >> user_id >> > > > (inverted item_id) >> > > > >> > > > 1. I'll try the option you provided. >> > > > 2. I will remove input with fake user_id and usersFile with these >> fake >> > > ids >> > > > >> > > > 3. >> > > > >> > > >> > >> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java >> > > > I don't understand how to pass ---outputPathForSimilarityMatrix >> option >> > to >> > > > RecommenderJob >> > > > >> > > > >> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <[email protected]>: >> > > > >> > > >> Seraga, >> > > >> >> > > >> I have two comments: >> > > >> 1. Don’t use negative user ids. Since Mahout uses user id as well >> as >> > > item >> > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids >> > > >> 2. If you want to get the item similarity information, you can use >> > > >> --outputPathForSimilarityMatrix in the command >> > > >> >> > > >> Regards, >> > > >> Peng Zhang >> > > >> M: +86 186-1658-7856 >> > > >> [email protected] >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak < >> [email protected] >> > > >> > > >> wrote: >> > > >> >> > > >>> All bad things happen here: >> > > >>> >> > > >>> >> > > >>> >> > > >>> Name >> > > >>> >> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer >> > > >>> >> > > >>> User >> > > >>> >> > > >>> oozie >> > > >>> >> > > >>> Process User >> > > >>> >> > > >>> oozie >> > > >>> >> > > >>> Group >> > > >>> >> > > >>> oozie >> > > >>> >> > > >>> Mapper Class >> > > >>> >> > > >>> PartialMultiplyMapper >> > > >>> >> > > >>> Reducer Class >> > > >>> >> > > >>> AggregateAndRecommendReducer >> > > >>> >> > > >>> >> > > >>> Job Input Directory >> > > >>> >> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply >> > > >>> >> > > >>> Job Output Directory >> > > >>> >> > > >>> hdfs://nameservice1/itemrec/output/ >> > > >>> >> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map input >> > records=3312879 >> > > >>> >> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map output >> > records=3313251 >> > > >>> >> > > >>> >> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce input >> > > records=3313251 >> > > >>> >> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce output >> records=0 >> > > >>> >> > > >>> Why does mahout returns 0 rows? it works when booleanData=true >> > > >> (preferences >> > > >>> are ignored...?) >> > > >>> >> > > >>> >> > > >>> >> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak < >> [email protected] >> > >: >> > > >>> >> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40 >> > > >>>> users_file: >> > > >>>> --inverted_item_id >> > > >>>> -1 >> > > >>>> -2 >> > > >>>> -3 >> > > >>>> -4 >> > > >>>> >> > > >>>> users_items_prefs >> > > >>>> --inverted item_id >> > > >>>> -1 1 1.0 >> > > >>>> -2 2 1.0 >> > > >>>> -3 3 1.0 >> > > >>>> -4 4 1.0 >> > > >>>> --user_id item_id pref_value >> > > >>>> 11 1 1.6 >> > > >>>> 11 2 1.6 >> > > >>>> 123 3 2.0 >> > > >>>> 123 4 2.0 >> > > >>>> 333 1 2.0 >> > > >>>> 333 2 1.6 >> > > >>>> --e.t.c. >> > > >>>> >> > > >>>> if I set --booleanData true >> > > >>>> then mahout returns the result. >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman < >> > > [email protected] >> > > >>> : >> > > >>>> >> > > >>>> I'm confused about how you're constructing the user file, and why >> > > there >> > > >>>>> are negated item ids here. >> > > >>>>> >> > > >>>>> Can you post some more details please, including Mahout version >> and >> > > >> some >> > > >>>>> sample data sets? >> > > >>>>> >> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak < >> > > >> [email protected]> >> > > >>>>> wrote: >> > > >>>>>> >> > > >>>>>> Hi, I'm trying to create item similarity. >> > > >>>>>> I gather items which users visit during shopping and then >> create a >> > > >> file: >> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], >> > > >> depends >> > > >>>>> on >> > > >>>>>> user action type and data source) >> > > >>>>>> UNION >> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary) >> > > >>>>>> >> > > >>>>>> and I do provide a userFile, where user_id = -item_id >> > > >>>>>> >> > > >>>>>> The idea is to get item similary. If any user visits item named >> > > "A", i >> > > >>>>> want >> > > >>>>>> to show him items "B", "c", "xxx" using preferences of other >> > users. >> > > >>>>>> >> > > >>>>>> The problem is that the last (???) mapreduce job returns 0 >> rows: >> > > >>>>>> >> > > >>>>>> Here are my settings: >> > > >>>>>> >> > > >>>>>> >> > > >>>>>> sudo -u oozie mahout recommenditembased \ >> > > >>>>>> --input visited_items_with_inverted_items \ >> > > >>>>>> >> > > >>>>>> --output result \ >> > > >>>>>> --similarityClassname >> SIMILARITY_LOGLIKELIHOOD \ >> > > >>>>>> --usersFile inverted_items \ >> > > >>>>>> --numRecommendations 500 \ >> > > >>>>>> --booleanData false \ >> > > >>>>>> --maxPrefsPerUser 100 \ >> > > >>>>>> --maxSimilaritiesPerItem 500 \ >> > > >>>>>> --minPrefsPerUser 0\ >> > > >>>>>> --maxPrefsPerUserInItemSimilarity 30 \ >> > > >>>>>> --threshold 0.91 \ >> > > >>>>>> --tempDir temp \ >> > > >>>>>> >> > > >>>>>> Some counters... I don't get what do they mean.... >> > > >>>>>> >> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: >> > > >>>>>> >> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters >> > > >>>>>> >> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: USERS=7528530 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >> > > >>>>>> >> > > >>>>> >> > > >> >> > > >> > >> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements >> > > >>>>>> >> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >> > > >>>>>> USER_RATINGS_NEGLECTED=1,798,738 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >> > > >>>>> USER_RATINGS_USED=12,429,693 >> > > >>>>>> >> > > >>>>>> >> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: >> > > >>>>>> >> > > >>>>> >> > > >> >> > > >> > >> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters >> > > >>>>>> >> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: ROWS=3312879 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >> > > >>>>>> >> > > >>>>> >> > > >> >> > > >> > >> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters >> > > >>>>>> >> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >> > COOCCURRENCES=35882374 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >> > PRUNED_COOCCURRENCES=0 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map input >> > > records=3312879 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map output >> > > >> records=17570268 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce input >> > > >>>>> records=5221907 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce output >> > > >>>>> records=3312879 >> > > >>>>>> >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input >> > > >>>>> records=3312879 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output >> > > >>>>> records=3312879 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input >> > > >>>>> records=3312879 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output >> > > >>>>> records=3312879 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map input >> > > records=7528530 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map output >> > > >> records=3313251 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce input >> > > >>>>> records=3313251 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce output >> > > >>>>> records=3313251 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map input >> > > records=6626130 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map output >> > > >> records=6626130 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce input >> > > >>>>> records=6626130 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce output >> > > >>>>> records=3312879 >> > > >>>>>> >> > > >>>>>> >> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map input >> > > records=3312879 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map output >> > > >> records=3313251 >> > > >>>>>> >> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce input >> > > >>>>> records=3313251 >> > > >>>>>> >> > > >>>>>> -------- >> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce output >> > records=0 >> > > >>>>>> -------- >> > > >>>>>> >> > > >>>>>> why 0??? >> > > >>>>> >> > > >>>> >> > > >>>> >> > > >> >> > > >> >> > > >> > > >> > >> > >
