Re: mapreduce ItemSimilarity input optimization

Pat Ferrel Mon, 18 Aug 2014 09:06:12 -0700

When beginning to use a recommender from Mahout I always suggest you start from 
the defaults. These often give the best results—then tune afterwards to improve.


Your intuition is correct that multiple actions can be used to improve results 
but get the basics working first. The easiest way to use multiple actions is to 
use spark-itemsimilarity so since you are using mapreduce for now, just use one 
action. 

I would not try to combine the results from two similarity measures there is no 
benefit since LLR is better than any of them, at least I’ve never seen it 
loose. Below is my experience with trying many of the similarity metrics on 
exactly the same data. I did cross-validation with precision (MAP, mean average 
precision). LLR wins in other cases I’ve tried too. So LLR is the only method 
presently used in the Spark version of itemsimilarity.



If you still get weird results double check your ID mapping. Run a small bit of 
data through and spot check the mapping by hand.

At some point you will want to create a cross-validation test. This is good as 
a sort of integration sanity check when making changes to the recommender. You 
run cross-validation using standard test data to see if the score changes 
drastically between releases. Big changes may indicate a bug. At the beginning 
it will help you tune as in the case above where it helped decide on LLR.



On Aug 18, 2014, at 1:43 AM, Serega Sheypak <[email protected]> wrote:

Thank you very much. I'll do what you are sayning in bullets 1...5 and try
again.

I also tried:
1. calc data using COUSINE_SIMILARITY
2. calc the same data using COOCCURENCE_SIMILARTY
3. join #1 and #2 where COOCURENCE >= $threshold

Where threshold is some emperical integer value. I've used  "2" The idea is
to filter out item pairs which never-ever met together...
Please see this link:
http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html

If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
approach still make sense, or it's useless waste of time?

"What do you mean the similar items are terrible? How are you measuring
that? " I have eye testing only,
I did automate preparation->calculation->hbase upload-> web-app serving, I
didn't automate testing.




2014-08-18 5:16 GMT+04:00 Pat Ferrel <[email protected]>:

> the things that stand out:
> 
> 1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem
> will _kill_ performance and give no gain, leave this setting at the default
> of 500
> 2) use only one action. What do you want the user to do? Do you want them
> to read a page? Then train on item page views. If those pages lead to a
> purchase then you want to recommend purchases so train on user purchases.
> 3) remove your minPrefsPerUser option, this should never be 0 or it will
> leave users in the training data that have no data and may contribute to
> longer runs with no gain.
> 4) this is a pretty small Hadoop cluster for the size of your data but I
> bet changing #1 will noticeably reduce the runtime
> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> 6) remove your —booleanData option since LLR ignores weights.
> 
> Remember that this is not the same as personalized recommendations. This
> method alone will show the same “similar items” for all users.
> 
> Sorry but both your “recommendation” types sound like the same thing.
> Using both item page view  _and_ clicks on recommended items will both lead
> to an item page view so you have two actions that lead to the same thing,
> right? Just train on an item page view (unless you really want the user to
> make a purchase)
> 
> What do you mean the similar items are terrible? How are you measuring
> that? Are you doing cross-validation measuring precision or A/B testing?
> What looks bad to you may be good, the eyeball test is not always reliable.
> If they are coming up completely crazy or random then you may have a bug in
> your ID translation logic.
> 
> It sounds like you have enough data to produce good results.
> 
> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <[email protected]>
> wrote:
> 
> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> but enough for the start..
> 2. I run it as oozie action.
> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>       <java>
>           <job-tracker>${jobTracker}</job-tracker>
>           <name-node>${nameNode}</name-node>
>           <prepare>
>               <delete path="${mahoutOutputDir}/primary" />
>               <delete
> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>           </prepare>
>           <configuration>
>               <property>
>                   <name>mapred.queue.name</name>
>                   <value>default</value>
>               </property>
> 
>           </configuration>
> 
> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>           <arg>--input</arg>
>           <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
> a kind of try to increase quality of recommender...]-->
> 
>           <arg>--output</arg>
>           <arg>${mahoutOutputDir}/primary</arg>
> 
>           <arg>--similarityClassname</arg>
>           <arg>SIMILARITY_COSINE</arg>
> 
>           <arg>--maxSimilaritiesPerItem</arg>
>           <arg>50000</arg>
> 
>           <arg>--minPrefsPerUser</arg>
>           <arg>0</arg>
> 
>           <arg>--booleanData</arg>
>           <arg>false</arg>
> 
>           <arg>--tempDir</arg>
>           <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> 
>       </java>
>       <ok to="to-narrow-table"/>
>       <error to="kill"/>
>   </action>
> 
> 3) RANK does it, here is a script:
> 
> --user, item, pref previously prepared by hive
> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> (user_id:chararray, item_id:long, pref:double);
> 
> --get distinct user from the whole input
> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> 
> --get distinct item from the whole input
> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> 
> --rank user 1....N
> rankUsers_ = RANK distUserId;
> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> 
> --rank items 1....M
> rankItems_ = RANK distItemId;
> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> 
> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> 'skewed';
> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> item_id using 'replicated';
> 
> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> as user_id,
>                                        rankItems::rank_id
> as item_id,
>                                        joinedUsers::user_item_pref::pref
> as pref;
> 
> --store mapping for later remapping from RANK back to natural values
> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
> PigStorage('\t');
> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
> PigStorage('\t');
> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
> using PigStorage('\t');
> 
> 4) I've seen this idea in different discussion, that different weight for
> different actions are not good. Sorry, I don't understand what you do
> suggest.
> I have two kind of actions: user viewed item, user clicked on recommended
> item (recommended item produced by my item similarity system).
> I want to produce two kinds of recommendations:
> 1. current item + recommend other items which other users visit in
> conjuction with current item
> 2. similar item: recommend items similar to current viewed item.
> What can I try?
> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> 
> Right now I do get awful recommendations and I can't understand what can I
> try next :((((((((((((
> 
> 
> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>:
> 
>> 1) how many cores in the cluster? The whole idea behind mapreduce is you
>> buy more cpus you get nearly linear decrease in runtime.
>> 2) what is your mahout command line with options, or how are you invoking
>> mahout. I have seen the Mahout mapreduce recommender take this long so we
>> should check what you are doing with downsampling.
>> 3) do you really need to RANK your ids, that’s a full sort? When using
> pig
>> I usually get DISTINCT ones and assign an incrementing integer as the
>> Mahout ID corresponding
>> 4) your #2 assigning different weights to different actions usually does
>> not work. I’ve done this before and compared offline metrics and seen
>> precision go down. I’d get this working using only your primary actions
>> first. What are you trying to get the user to do? View something, buy
>> something? Use that action as the primary preference and start out with a
>> weight of 1 using LLR. With LLR the weights are not used anyway so your
>> data may not produce good results with mixed actions.
>> 
>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>> 1) output from 2 can be directly ingested and will create output.
>> 2) multiple actions can be used with cross-cooccurrence, not by guessing
>> at weights.
>> 3) output has your application specific IDs preserved.
>> 4) its about 10x faster than mapreduce and will do aways with your ID
>> translation steps
>> 
>> One caveat is that your cluster machines will need lots of memory. I have
>> 8-16g on mine.
>> 
>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]>
>> wrote:
>> 
>> 1. I do collect preferences for items using 60days sliding window. today
> -
>> 60 days.
>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
>> view, 5 for clicking recommndation block. The idea is to give more value
>> for recommendations which attact visitor attention). I get ~ 20.000.000
> of
>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>> 3. I do use apache pig RANK function to rank all distinct user_id
>> 4. I do the same for item_id
>> 5. I do join input dataset with ranked datasets and provide input to
> mahout
>> with dense interger user_id, item_id
>> 6. I do get mahout output and join integer item_id back to get natural
> key
>> value.
>> 
>> step #1-2 takes ~ 40min
>> step #3-5 takes ~1 hour
>> mahout calc takes ~3hours
>> 
>> 
>> 
>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>:
>> 
>>> This really doesn't sound right.  It should be possible to process
>> almost a
>>> thousand times that much data every night without that much problem.
>>> 
>>> How are you preparing the input data?
>>> 
>>> How are you converting to Mahout id's?
>>> 
>>> Even using python, you should be able to do the conversion in just a few
>>> minutes without any parallelism whatsoever.
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>> [email protected]>
>>> wrote:
>>> 
>>>> Hi, We are trying calculate ItemSimilarity.
>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>> text
>>>> each day to recalculate item similarities. We do get +100..1000 new
>> items
>>>> each day.
>>>> 1. It takes too much time to prepare input data.
>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>> 
>>>> Is there any poissibility to provide data to mahout mapreduce
>>>> ItemSimilarity using some binary format with compression?
>>>> 
>>> 
>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to