Re: mapreduce ItemSimilarity input optimization

Serega Sheypak Sun, 17 Aug 2014 11:15:11 -0700

1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
but enough for the start..
2. I run it as oozie action.
 <action name="run-mahout-primary-similarity-ItemSimilarityJob">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${mahoutOutputDir}/primary" />
                <delete
path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
            </prepare>
            <configuration>
                <property>
                    <name>mapred.queue.name</name>
                    <value>default</value>
                </property>


            </configuration>

<main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
            <arg>--input</arg>
            <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
a kind of try to increase quality of recommender...]-->

            <arg>--output</arg>
            <arg>${mahoutOutputDir}/primary</arg>

            <arg>--similarityClassname</arg>
            <arg>SIMILARITY_COSINE</arg>

            <arg>--maxSimilaritiesPerItem</arg>
            <arg>50000</arg>

            <arg>--minPrefsPerUser</arg>
            <arg>0</arg>

            <arg>--booleanData</arg>
            <arg>false</arg>

            <arg>--tempDir</arg>
            <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>

        </java>
        <ok to="to-narrow-table"/>
        <error to="kill"/>
    </action>

3) RANK does it, here is a script:

--user, item, pref previously prepared by hive
user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
(user_id:chararray, item_id:long, pref:double);

--get distinct user from the whole input
distUserId = distinct(FOREACH user_item_pref GENERATE user_id);

--get distinct item from the whole input
distItemId = distinct(FOREACH user_item_pref GENERATE item_id);

--rank user 1....N
rankUsers_ = RANK distUserId;
rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;

--rank items 1....M
rankItems_ = RANK distItemId;
rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;

--join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
'skewed';
joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
item_id using 'replicated';

projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
as user_id,
                                         rankItems::rank_id
 as item_id,
                                         joinedUsers::user_item_pref::pref
as pref;

--store mapping for later remapping from RANK back to natural values
STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
PigStorage('\t');
STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
PigStorage('\t');
STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
using PigStorage('\t');

4) I've seen this idea in different discussion, that different weight for
different actions are not good. Sorry, I don't understand what you do
suggest.
I have two kind of actions: user viewed item, user clicked on recommended
item (recommended item produced by my item similarity system).
I want to produce two kinds of recommendations:
1. current item + recommend other items which other users visit in
conjuction with current item
2. similar item: recommend items similar to current viewed item.
What can I try?
LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?

Right now I do get awful recommendations and I can't understand what can I
try next :((((((((((((


2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>:

> 1) how many cores in the cluster? The whole idea behind mapreduce is you
> buy more cpus you get nearly linear decrease in runtime.
> 2) what is your mahout command line with options, or how are you invoking
> mahout. I have seen the Mahout mapreduce recommender take this long so we
> should check what you are doing with downsampling.
> 3) do you really need to RANK your ids, that’s a full sort? When using pig
> I usually get DISTINCT ones and assign an incrementing integer as the
> Mahout ID corresponding
> 4) your #2 assigning different weights to different actions usually does
> not work. I’ve done this before and compared offline metrics and seen
> precision go down. I’d get this working using only your primary actions
> first. What are you trying to get the user to do? View something, buy
> something? Use that action as the primary preference and start out with a
> weight of 1 using LLR. With LLR the weights are not used anyway so your
> data may not produce good results with mixed actions.
>
> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> 1) output from 2 can be directly ingested and will create output.
> 2) multiple actions can be used with cross-cooccurrence, not by guessing
> at weights.
> 3) output has your application specific IDs preserved.
> 4) its about 10x faster than mapreduce and will do aways with your ID
> translation steps
>
> One caveat is that your cluster machines will need lots of memory. I have
> 8-16g on mine.
>
> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]>
> wrote:
>
> 1. I do collect preferences for items using 60days sliding window. today -
> 60 days.
> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
> view, 5 for clicking recommndation block. The idea is to give more value
> for recommendations which attact visitor attention). I get ~ 20.000.000 of
> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> 3. I do use apache pig RANK function to rank all distinct user_id
> 4. I do the same for item_id
> 5. I do join input dataset with ranked datasets and provide input to mahout
> with dense interger user_id, item_id
> 6. I do get mahout output and join integer item_id back to get natural key
> value.
>
> step #1-2 takes ~ 40min
> step #3-5 takes ~1 hour
> mahout calc takes ~3hours
>
>
>
> 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>:
>
> > This really doesn't sound right.  It should be possible to process
> almost a
> > thousand times that much data every night without that much problem.
> >
> > How are you preparing the input data?
> >
> > How are you converting to Mahout id's?
> >
> > Even using python, you should be able to do the conversion in just a few
> > minutes without any parallelism whatsoever.
> >
> >
> >
> >
> > On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> [email protected]>
> > wrote:
> >
> >> Hi, We are trying calculate ItemSimilarity.
> >> Right now we have 2*10^7 input lines. I do provide input data as raw
> text
> >> each day to recalculate item similarities. We do get +100..1000 new
> items
> >> each day.
> >> 1. It takes too much time to prepare input data.
> >> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>
> >> Is there any poissibility to provide data to mahout mapreduce
> >> ItemSimilarity using some binary format with compression?
> >>
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to