Ok, so that was a good clarification, which lead me to new questions :)

The system I need should of course give the recommendation itself in no time.
And as Sean said, it need to have some real time components to enable a 
different recommendation after the user interact with the application.
But because I'm talking about very large scales, I guess that I want to push 
much of my model computation to offline mode (which will be refreshed every X 
minutes).

So my options are like that (considering I want to build a real scalable 
solution):
Use the non-distributed \ distributed code to compute some of my model in 
advance (for example similarity between items \ KNN for each users) --> I guess 
that for that part, considering I'm offline, the mapreduce code is idle, 
because of his scalability.
Than use a non-distributed online code to calculate the final recommendations 
based on the pre computed part and do some final computation (weighting the KNN 
ratings for items my user didn't experienced yet)
In order to be able to do so, I will probably need a machine that have high 
memory capacity to contain all the calculations inside the memory.
I can even go further and prepare a cached recommender that will be refreshed 
whenever I really want my recommendations to be updated.
Am I right here?

I know the "glue" between the 2 parts is not quite there (as Sean said), but my 
question is, how much does the current framework support this kind of 
architecture? Meaning what kind of actions can I really prepare in advance 
before continuing to the final computation? If so, beside of co-occurrence 
matrix and matrix factorization what other computations are available to me to 
do in a mapreduce manner? Does it mean I will have 2 separate machines for that 
case, one as an Hadoop cluster for the offline computation and an online one 
that will use the distributed output to do final recommendations (but then it 
mean I need to move data between machines, which is not so idle...)?

Also, as I mentioned earlier I might need to store my data in a SQL machine. If 
so, what drivers are currently supported? I saw only JDBC & PostgreSQL, is 
there anyone else?
As you said in the book, using a SQL machine will probably slow things down 
because of the data movement using the drivers... Could you estimate how much 
slower is it comparing to using a file? Again I might do the reading from the 
DB offline so I'm not too afraid from losing some of my speed...


-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Sunday, March 25, 2012 21:35
To: [email protected]
Subject: Re: Mahout beginner questions...

Not really.  See my previous posting.

The best way to get fast recommendations is to use an item-based
recommender.  Pre-computing recommendations for all users is not usually a
win because you wind up doing a lot of wasted work and you still don't have
anything for new users who appear between refreshes.  If you build up a
service to handle the new users, you might as well just serve all users
from that service so that you get up to date recommendations for everyone.

There IS a large off-line computation.  But that doesn't produce
recommendations for USER's.  It typically produces recommendations for
ITEM's.  Then those item-item recommendations are combined to produce
recommendations for users.

On Sun, Mar 25, 2012 at 12:28 PM, Razon, Oren <[email protected]> wrote:

> Correct me if I'm wrong but a good way to boost up speed could be to use
> caching recommender, meaning computing the recommendations in advanced
> (refresh it every X min\hours) and always recommend using the most updated
> recommendations, right?!
>
> -----Original Message-----
> From: Sean Owen [mailto:[email protected]]
> Sent: Sunday, March 25, 2012 21:25
> To: [email protected]
> Subject: Re: Mahout beginner questions...
>
> It is memory. You will need a pretty large heap to put 100M data in memory
> -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
> You can go bigger if you have more memory but that size seems about the
> biggest to reasonably assume people have.
>
> Of course more data slows things down and past about 10M data points you
> need to tune things to sample data rather than try every possibility. This
> is most of what CandidateItemStrategy has to do with. It is relatively easy
> to tune this though so speed doesn't have to ben an issue.
>
> Again you can go bigger and tune it to down-sample more; somehow I stil
> believe that 100M is a crude but useful rule of thumb, as to the point
> beyond which it's just hard to get good speed and quality.
>
> Sean
>
> On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[email protected]> wrote:
>
> > Thanks for the detailed answer Sean.
> > I want to understand more clearly the non-distributed code limitations.
> > I saw that you advise that for more than 100,000,000 ratings the
> > non-distributed engine won't do the job.
> > The question is why? Is it memory issue (and then if I will have a bigger
> > machine, meaning I could scale up), or is it because of the
> recommendation
> > time it takes?
> >
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Reply via email to