Re: mapreduce ItemSimilarity input optimization

Pat Ferrel Sun, 17 Aug 2014 18:17:38 -0700

the things that stand out:

1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem will 
_kill_ performance and give no gain, leave this setting at the default of 500
2) use only one action. What do you want the user to do? Do you want them to 
read a page? Then train on item page views. If those pages lead to a purchase 
then you want to recommend purchases so train on user purchases.
3) remove your minPrefsPerUser option, this should never be 0 or it will leave 
users in the training data that have no data and may contribute to longer runs 
with no gain.
4) this is a pretty small Hadoop cluster for the size of your data but I bet 
changing #1 will noticeably reduce the runtime
5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
6) remove your —booleanData option since LLR ignores weights.


Remember that this is not the same as personalized recommendations. This method 
alone will show the same “similar items” for all users.

Sorry but both your “recommendation” types sound like the same thing. Using 
both item page view  _and_ clicks on recommended items will both lead to an 
item page view so you have two actions that lead to the same thing, right? Just 
train on an item page view (unless you really want the user to make a purchase) 

What do you mean the similar items are terrible? How are you measuring that? 
Are you doing cross-validation measuring precision or A/B testing? What looks 
bad to you may be good, the eyeball test is not always reliable. If they are 
coming up completely crazy or random then you may have a bug in your ID 
translation logic.

It sounds like you have enough data to produce good results.

On Aug 17, 2014, at 11:14 AM, Serega Sheypak <[email protected]> wrote:

1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
but enough for the start..
2. I run it as oozie action.
<action name="run-mahout-primary-similarity-ItemSimilarityJob">
       <java>
           <job-tracker>${jobTracker}</job-tracker>
           <name-node>${nameNode}</name-node>
           <prepare>
               <delete path="${mahoutOutputDir}/primary" />
               <delete
path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
           </prepare>
           <configuration>
               <property>
                   <name>mapred.queue.name</name>
                   <value>default</value>
               </property>

           </configuration>

<main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
           <arg>--input</arg>
           <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
a kind of try to increase quality of recommender...]-->

           <arg>--output</arg>
           <arg>${mahoutOutputDir}/primary</arg>

           <arg>--similarityClassname</arg>
           <arg>SIMILARITY_COSINE</arg>

           <arg>--maxSimilaritiesPerItem</arg>
           <arg>50000</arg>

           <arg>--minPrefsPerUser</arg>
           <arg>0</arg>

           <arg>--booleanData</arg>
           <arg>false</arg>

           <arg>--tempDir</arg>
           <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>

       </java>
       <ok to="to-narrow-table"/>
       <error to="kill"/>
   </action>

3) RANK does it, here is a script:

--user, item, pref previously prepared by hive
user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
(user_id:chararray, item_id:long, pref:double);

--get distinct user from the whole input
distUserId = distinct(FOREACH user_item_pref GENERATE user_id);

--get distinct item from the whole input
distItemId = distinct(FOREACH user_item_pref GENERATE item_id);

--rank user 1....N
rankUsers_ = RANK distUserId;
rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;

--rank items 1....M
rankItems_ = RANK distItemId;
rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;

--join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
'skewed';
joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
item_id using 'replicated';

projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
as user_id,
                                        rankItems::rank_id
as item_id,
                                        joinedUsers::user_item_pref::pref
as pref;

--store mapping for later remapping from RANK back to natural values
STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
PigStorage('\t');
STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
PigStorage('\t');
STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
using PigStorage('\t');

4) I've seen this idea in different discussion, that different weight for
different actions are not good. Sorry, I don't understand what you do
suggest.
I have two kind of actions: user viewed item, user clicked on recommended
item (recommended item produced by my item similarity system).
I want to produce two kinds of recommendations:
1. current item + recommend other items which other users visit in
conjuction with current item
2. similar item: recommend items similar to current viewed item.
What can I try?
LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?

Right now I do get awful recommendations and I can't understand what can I
try next :((((((((((((


2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>:

> 1) how many cores in the cluster? The whole idea behind mapreduce is you
> buy more cpus you get nearly linear decrease in runtime.
> 2) what is your mahout command line with options, or how are you invoking
> mahout. I have seen the Mahout mapreduce recommender take this long so we
> should check what you are doing with downsampling.
> 3) do you really need to RANK your ids, that’s a full sort? When using pig
> I usually get DISTINCT ones and assign an incrementing integer as the
> Mahout ID corresponding
> 4) your #2 assigning different weights to different actions usually does
> not work. I’ve done this before and compared offline metrics and seen
> precision go down. I’d get this working using only your primary actions
> first. What are you trying to get the user to do? View something, buy
> something? Use that action as the primary preference and start out with a
> weight of 1 using LLR. With LLR the weights are not used anyway so your
> data may not produce good results with mixed actions.
> 
> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> 1) output from 2 can be directly ingested and will create output.
> 2) multiple actions can be used with cross-cooccurrence, not by guessing
> at weights.
> 3) output has your application specific IDs preserved.
> 4) its about 10x faster than mapreduce and will do aways with your ID
> translation steps
> 
> One caveat is that your cluster machines will need lots of memory. I have
> 8-16g on mine.
> 
> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]>
> wrote:
> 
> 1. I do collect preferences for items using 60days sliding window. today -
> 60 days.
> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
> view, 5 for clicking recommndation block. The idea is to give more value
> for recommendations which attact visitor attention). I get ~ 20.000.000 of
> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> 3. I do use apache pig RANK function to rank all distinct user_id
> 4. I do the same for item_id
> 5. I do join input dataset with ranked datasets and provide input to mahout
> with dense interger user_id, item_id
> 6. I do get mahout output and join integer item_id back to get natural key
> value.
> 
> step #1-2 takes ~ 40min
> step #3-5 takes ~1 hour
> mahout calc takes ~3hours
> 
> 
> 
> 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>:
> 
>> This really doesn't sound right.  It should be possible to process
> almost a
>> thousand times that much data every night without that much problem.
>> 
>> How are you preparing the input data?
>> 
>> How are you converting to Mahout id's?
>> 
>> Even using python, you should be able to do the conversion in just a few
>> minutes without any parallelism whatsoever.
>> 
>> 
>> 
>> 
>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> [email protected]>
>> wrote:
>> 
>>> Hi, We are trying calculate ItemSimilarity.
>>> Right now we have 2*10^7 input lines. I do provide input data as raw
> text
>>> each day to recalculate item similarities. We do get +100..1000 new
> items
>>> each day.
>>> 1. It takes too much time to prepare input data.
>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>> 
>>> Is there any poissibility to provide data to mahout mapreduce
>>> ItemSimilarity using some binary format with compression?
>>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to