The code snippet:
@Test//(enabled = false)
void testReadAll(){
(0..5).each {
def pathToFile = new Path('matrixSim/part-r-0000$it")
println pathToFile
def reader = new SequenceFile.Reader(new Configuration(),
SequenceFile.Reader.file(pathToFile));
IntWritable key = new IntWritable();
VectorWritable value = new VectorWritable();
while(reader.next(key, value)){
def itr = value.get().iterateNonZero()
while(itr.hasNext()){
println itr.next()
}
}
reader.close();
}
}
2014-07-21 23:46 GMT+04:00 Serega Sheypak <[email protected]>:
> I've parsed it via java, matrix is empty. why?
>
>
> 2014-07-21 22:41 GMT+04:00 Serega Sheypak <[email protected]>:
>
> 0.7-cdh4.7.0
>> Anyway, recommenditembased does produce these catalogs:
>>
>> /recommenditembased/temp/maxValues.bin
>> /recommenditembased/temp/norms.bin
>> /recommenditembased/temp/numNonZeroEntries.bin
>> /recommenditembased/temp/pairwiseSimilarity
>> /recommenditembased/temp/partialMultiply
>> /recommenditembased/temp/prePartialMultiply1
>> /recommenditembased/temp/prePartialMultiply2
>> /recommenditembased/temp/preparePreferenceMatrix
>> /recommenditembased/temp/similarityMatrix
>> /recommenditembased/temp/weights
>>
>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>> In eed. Right now I try to read it using
>>
>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
>> '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> ) as (intId: int, vector:tuple(cardinality:int,
>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>
>>
>> Looks like the vector is empty... Or i do something wrong.
>>
>>
>>
>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <[email protected]>:
>>
>> Which version of Mahout?
>>>
>>>
>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>> [email protected]>
>>> wrote:
>>>
>>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>> processing
>>> > Job-Specific
>>> >
>>> > sudo -u hdfs hadoop fs -rm -r
>>> hdfs://nameservice1/recommenditembased/output
>>> > sudo -u hdfs hadoop fs -rm -r
>>> hdfs://nameservice1/recommenditembased/temp
>>> > sudo -u oozie mahout recommenditembased \
>>> > --input \
>>> >
>>> >
>>> >
>>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>> > \
>>> > --output \
>>> > hdfs://nameservice1/recommenditembased/output \
>>> > --similarityClassname \
>>> > SIMILARITY_LOGLIKELIHOOD \
>>> > --numRecommendations \
>>> > 500 \
>>> > --booleanData \
>>> > false \
>>> > --maxPrefsPerUser \
>>> > 1000 \
>>> > --maxSimilaritiesPerItem \
>>> > 1000 \
>>> > --minPrefsPerUser \
>>> > 5 \
>>> > --maxPrefsPerUserInItemSimilarity \
>>> > 30 \
>>> > --threshold \
>>> > 1.1 \
>>> > --tempDir \
>>> > hdfs://nameservice1/recommenditembased/temp \
>>> > --outputPathForSimilarityMatrix \
>>> > hdfs://nameservice1/recommenditembased/sim_matrix
>>> >
>>> >
>>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>> >
>>> >
>>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <[email protected]>:
>>> >
>>> > > Serega,
>>> > >
>>> > > See the last line on how to pass outputPathForSimilarityMatrix
>>> options to
>>> > > the recommenditembased command:
>>> > >
>>> > > sudo -u oozie mahout recommenditembased \
>>> > > --input visited_items_with_inverted_items \
>>> > >
>>> > > --output result \
>>> > > --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>> > > --usersFile inverted_items \
>>> > > --numRecommendations 500 \
>>> > > --booleanData false \
>>> > > --maxPrefsPerUser 100 \
>>> > > --maxSimilaritiesPerItem 500 \
>>> > > --minPrefsPerUser 0\
>>> > > --maxPrefsPerUserInItemSimilarity 30 \
>>> > > --threshold 0.91 \
>>> > > --tempDir temp \
>>> > > --outputPathForSimilarityMatrix similarityMatri \
>>> > >
>>> > >
>>> > > Peng Zhang
>>> > > [email protected]
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>> [email protected]>
>>> > > wrote:
>>> > >
>>> > > > I've inspected the code, our approach wouldn't work with
>>> > > booleanData=false.
>>> > > > We do calcualte imte similarity in the wrong way...(((
>>> > > > Thank you
>>> > > > 1. We provide "fake" user_id and provide --usersFile in order to
>>> get
>>> > > > recommendations for "fake user_id, where user_id is a negative
>>> item_id.
>>> > > It
>>> > > > worked when we did provide user_id->item_id pairs without
>>> preference.
>>> > > > 2. Our target is to get item similarities. We tried
>>> > > >
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>>> > > it
>>> > > > returns bad result comparing to RecommenderJob with our "fake"
>>> user_id
>>> > > > (inverted item_id)
>>> > > >
>>> > > > 1. I'll try the option you provided.
>>> > > > 2. I will remove input with fake user_id and usersFile with these
>>> fake
>>> > > ids
>>> > > >
>>> > > > 3.
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>>> option
>>> > to
>>> > > > RecommenderJob
>>> > > >
>>> > > >
>>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <[email protected]>:
>>> > > >
>>> > > >> Seraga,
>>> > > >>
>>> > > >> I have two comments:
>>> > > >> 1. Don’t use negative user ids. Since Mahout uses user id as well
>>> as
>>> > > item
>>> > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>>> > > >> 2. If you want to get the item similarity information, you can use
>>> > > >> --outputPathForSimilarityMatrix in the command
>>> > > >>
>>> > > >> Regards,
>>> > > >> Peng Zhang
>>> > > >> M: +86 186-1658-7856
>>> > > >> [email protected]
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>> [email protected]
>>> > >
>>> > > >> wrote:
>>> > > >>
>>> > > >>> All bad things happen here:
>>> > > >>>
>>> > > >>>
>>> > > >>>
>>> > > >>> Name
>>> > > >>>
>>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>>> > > >>>
>>> > > >>> User
>>> > > >>>
>>> > > >>> oozie
>>> > > >>>
>>> > > >>> Process User
>>> > > >>>
>>> > > >>> oozie
>>> > > >>>
>>> > > >>> Group
>>> > > >>>
>>> > > >>> oozie
>>> > > >>>
>>> > > >>> Mapper Class
>>> > > >>>
>>> > > >>> PartialMultiplyMapper
>>> > > >>>
>>> > > >>> Reducer Class
>>> > > >>>
>>> > > >>> AggregateAndRecommendReducer
>>> > > >>>
>>> > > >>>
>>> > > >>> Job Input Directory
>>> > > >>>
>>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>> > > >>>
>>> > > >>> Job Output Directory
>>> > > >>>
>>> > > >>> hdfs://nameservice1/itemrec/output/
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map input
>>> > records=3312879
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map output
>>> > records=3313251
>>> > > >>>
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce input
>>> > > records=3313251
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce output
>>> records=0
>>> > > >>>
>>> > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>>> > > >> (preferences
>>> > > >>> are ignored...?)
>>> > > >>>
>>> > > >>>
>>> > > >>>
>>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>> [email protected]
>>> > >:
>>> > > >>>
>>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>> > > >>>> users_file:
>>> > > >>>> --inverted_item_id
>>> > > >>>> -1
>>> > > >>>> -2
>>> > > >>>> -3
>>> > > >>>> -4
>>> > > >>>>
>>> > > >>>> users_items_prefs
>>> > > >>>> --inverted item_id
>>> > > >>>> -1 1 1.0
>>> > > >>>> -2 2 1.0
>>> > > >>>> -3 3 1.0
>>> > > >>>> -4 4 1.0
>>> > > >>>> --user_id item_id pref_value
>>> > > >>>> 11 1 1.6
>>> > > >>>> 11 2 1.6
>>> > > >>>> 123 3 2.0
>>> > > >>>> 123 4 2.0
>>> > > >>>> 333 1 2.0
>>> > > >>>> 333 2 1.6
>>> > > >>>> --e.t.c.
>>> > > >>>>
>>> > > >>>> if I set --booleanData true
>>> > > >>>> then mahout returns the result.
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>> > > [email protected]
>>> > > >>> :
>>> > > >>>>
>>> > > >>>> I'm confused about how you're constructing the user file, and
>>> why
>>> > > there
>>> > > >>>>> are negated item ids here.
>>> > > >>>>>
>>> > > >>>>> Can you post some more details please, including Mahout
>>> version and
>>> > > >> some
>>> > > >>>>> sample data sets?
>>> > > >>>>>
>>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>> > > >> [email protected]>
>>> > > >>>>> wrote:
>>> > > >>>>>>
>>> > > >>>>>> Hi, I'm trying to create item similarity.
>>> > > >>>>>> I gather items which users visit during shopping and then
>>> create a
>>> > > >> file:
>>> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>> 1.9],
>>> > > >> depends
>>> > > >>>>> on
>>> > > >>>>>> user action type and data source)
>>> > > >>>>>> UNION
>>> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>> > > >>>>>>
>>> > > >>>>>> and I do provide a userFile, where user_id = -item_id
>>> > > >>>>>>
>>> > > >>>>>> The idea is to get item similary. If any user visits item
>>> named
>>> > > "A", i
>>> > > >>>>> want
>>> > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
>>> > users.
>>> > > >>>>>>
>>> > > >>>>>> The problem is that the last (???) mapreduce job returns 0
>>> rows:
>>> > > >>>>>>
>>> > > >>>>>> Here are my settings:
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> sudo -u oozie mahout recommenditembased \
>>> > > >>>>>> --input visited_items_with_inverted_items \
>>> > > >>>>>>
>>> > > >>>>>> --output result \
>>> > > >>>>>> --similarityClassname
>>> SIMILARITY_LOGLIKELIHOOD \
>>> > > >>>>>> --usersFile inverted_items \
>>> > > >>>>>> --numRecommendations 500 \
>>> > > >>>>>> --booleanData false \
>>> > > >>>>>> --maxPrefsPerUser 100 \
>>> > > >>>>>> --maxSimilaritiesPerItem 500 \
>>> > > >>>>>> --minPrefsPerUser 0\
>>> > > >>>>>> --maxPrefsPerUserInItemSimilarity 30 \
>>> > > >>>>>> --threshold 0.91 \
>>> > > >>>>>> --tempDir temp \
>>> > > >>>>>>
>>> > > >>>>>> Some counters... I don't get what do they mean....
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: USERS=7528530
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > > >>>>>
>>> > > >>
>>> > >
>>> >
>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>> > > >>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>> > > >>>>> USER_RATINGS_USED=12,429,693
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > > >>>>>
>>> > > >>
>>> > >
>>> >
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: ROWS=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > > >>>>>
>>> > > >>
>>> > >
>>> >
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>> > COOCCURRENCES=35882374
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>> > PRUNED_COOCCURRENCES=0
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map input
>>> > > records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map output
>>> > > >> records=17570268
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce input
>>> > > >>>>> records=5221907
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map input
>>> > > records=7528530
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map output
>>> > > >> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce input
>>> > > >>>>> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce output
>>> > > >>>>> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map input
>>> > > records=6626130
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map output
>>> > > >> records=6626130
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce input
>>> > > >>>>> records=6626130
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map input
>>> > > records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map output
>>> > > >> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce input
>>> > > >>>>> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> --------
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce output
>>> > records=0
>>> > > >>>>>> --------
>>> > > >>>>>>
>>> > > >>>>>> why 0???
>>> > > >>>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>
>>> > > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>