The distributed job only computes the diffs as I recall. Then you copy the
model and use it locally.

On Hadoop you generally don't read two big things at the same time and
perform some computation. If you're on Hadoop they're probably way too big.
You would split them up and join bits of them at a time to perform the
computation. That is most of the work you see it trying to do.

You can write more MR jobs to finish the recommend computation if you like.
You would join rows of the diff matrix with each user's pref row, and
compute one user-item score. Those are then aggregated, sorted and output
in a later job.

The output has 5 cols, not 6, and the 5th is indeed the standard deviation.

I think you'd have to be more specific in the last part. You look at all
item-item pairs for one user and output what they contribute to item-item
diffs. It's slow if you have a lot of items.

I don't understand the question about the table. Which table, and why do
you expect more than one?

I am not sure I would recommend slope-one if you have any significant
number of items.



On Thu, Nov 15, 2012 at 6:16 AM, Steven Zheng <[email protected]> wrote:

> Hello,
>
> I have some questions about the SlopeOneRecommender And Distributed
> SlopeOne.
>
> First one, is an old question someone posted before:
>
>  I'm confusing, how can I read the users' profile as well as the
> diff-matrix at the same time(they are at different location in my HDFS) to
> predict a specific user's ratings? I've already checked the mahout
> implementation of Slopeone with hadoop, but that one just did the
> calculation of diff-matrix.. and no prediction part is included... Anyone
> can help me? How to read two kinds of data in Hadoop program at the same
> time?
>
> Well, if ran
> org.apache.mahout.cf.taste.hadoop.slopeone.SlopeOneAverageDiffsJob on
> hadoop, finally it generates the Diff file with 6 columns, like:
>
>     1 18 -0.55439756 1967 -0.55439756 3739.179461  1 19 -1.310974583 5941
> -1.310974583 10706.22446  1 20 -1.184633028 1308 -1.184633028 1933.661124
>  1
> 21 -0.407834403 7633 -0.407834403 9899.411503
> The first two ones are itemA and itemB pair. The third one is diff, the
> forth one is count, what does the last two ones mean? the stdDev?
>
> If possible, could you please explain a little bit about the
> SlopeOneDiffsToAveragesReducer? The PrefsToDiffs is easy to be understood,
> just process per user, to generate item-item difference pairs. How about
> the SlopeOneDiffsToAveragesReducer? why on Hadoop it is so slow. Why
> finally the DiffStorage is just one single table like the table above?
>
> Thanks,
>
> Steven
>

Reply via email to