Re: Recommender Engines on top of Storm

Ted Dunning Sat, 11 Jan 2014 11:59:18 -0800

Slope one is one of the few algorithms that performs so uniformly poorly
that it is being removed from Mahout.  I wouldn't recommend it for any
applications.

In general, recommendation isn't particularly well suited for on-line
operation since there is part of the computation that substantially
benefits from a large scale computation and part that really must be
interactive, but is best suited to a service oriented architecture rather
than a record streaming architecture.

There are two general branches of recommendation implementations.  One is
based on matrix factorization methods.  The other is based on cooccurrence
analysis and is focused on sparsification of the cooccurrence matrix to
produce an indicator matrix.  Lately, I strongly recommend that almost all
small or new projects use the cooccurrence based methods since they can be
deployed using search engines.  More on that in a bit.

For the large scale computation, matrix factorization techniques have some
on-line approximations, but the benefit in terms of accuracy and cost in
terms of resources of simply redoing the entire computation at roughly one
day tempo generally makes doing just that preferable.  Especially with
multi-modal recommendations, there is little cost for cold-starts.  The
interactive response part could easily be implemented in Storm, but there
is little benefit over an in-memory implementation based on well
established servers such as netty since there is no significant sense of
function composition in this part of the computation.

For more discrete algorithms for cooccurrence analysis such as the
indicator-matrix techniques, you could definitely accumulate the
cooccurrence statistics in an on-line fashion, but I am not sure I see
significant benefit.  One of the key performance features in such
algorithms is adaptive down-sampling.  On-line variants of that would not
easily be able to have flexibility in the choice of sampling since they
would almost inevitably be biased towards early samples.  It could be done,
but the off-line approaches are a really excellent match for map-reduce and
performance is pretty good.

The interactive component for the indicator matrix form of recommendation
algorithms is almost identical to what a text search engine already does.
 This makes it very desirable to simply deploy the recommendation model as
a search index.  Operationally, this has massive benefits since much of the
necessary business logic is already provided by the capabilities of common
search engines such as Solr or Elastic Search.  Moreover, the very
considerable operational history of these solutions make it desirable to
use them without any additional coding.

The research literature suggests that discrete algorithms may produce
slightly inferior results than the matrix factorization results.  Whether
this is true or not at realistic scale is not at all clear, since all of
the published research is done at relatively small scale.

My own experience is that the discrete implementations massively
out-perform the matrix factorization approaches in practical settings
simply because they are so simple to implement that they free up resources
to go about the work of finding more and better data for the engine to make
use of.  Finding better interaction data can result in improvements of
200-500% while tweaking algorithms rarely results in improvements of more
than 10% even at small scales.  Diverting development resources away from
the high value work is generally disastrously bad for performance unless
you have a really enormous team.

For other systems which are very much like recommendations such as ad
targeting, it is often worthwhile to models specifically for each ad that
make use of content, context and user characteristics as well as
interactions of these.  For that work, on-line algorithms are very much
worthwhile.  The AdPredictor system paper [1] is a great intro to that
area.  I have also collected other related references to do with the
general field of Bayesian approaches to the multi-armed bandit [2], [3].
 You can also use these bandits for adaptation of ranking, though I doubt
that parallelism is useful for this since the computation is so simple [4].

To summarize,

- yes, you can implement algorithms like slope-one in an online framework,
but I wouldn't recommend wasting your time

- yes, you can implement approximations of matrix factorization in on-line
form.

- no, that probably isn't worthwhile

- yes, you can build on-line versions of cooccurrence counters and analyzers

- no, that probably isn't worthwhile

- no, it probably isn't a great idea to use Storm for the interactive
computation of recommendations.

- yes, there are other situations where on-line update of recommendation
models is worthwhile.  Storm might play well there.

Here are the links:

[1] http://research.microsoft.com/apps/pubs/default.aspx?id=122779
[2] http://tdunning.blogspot.com/2012/02/bayesian-bandits.html
[3]
http://tdunning.blogspot.com/2012/10/references-for-on-line-algorithms.html
[4]
http://tdunning.blogspot.com/2013/04/learning-to-rank-in-very-bayesian-way.html

On Sat, Jan 11, 2014 at 5:19 AM, Rafik NACCACHE <[email protected]>wrote:

> Thanks Klausen,
>
> It is not possible to use slope One ? As far as I know, it is adapted for
> online recommending stuff,
>
> Any Thoughts about that ?
>
> Thanks,
>
> Regards
>
>
> 2014/1/11 Klausen Schaefersinho <[email protected]>
>
>> Hi,
>>
>> to the best of my knowledge there is no publicly available Recommeder
>> Engine for Storm. You could try to integrate some java based RS systems
>> like taste (also used in Hadoop Mahout). However the classic (user-item abd
>> item.item) algorithms do not work well in a streaming architecture as you
>> would have to update some distance matrix for every event you observe. This
>> might be very expensive as the matrices might get quite big and you have to
>> share a constant state over the entire set of worker bolts.
>>
>> Cheers,
>>
>> Klausen
>>
>>
>> On Sat, Jan 11, 2014 at 11:05 AM, Rafik NACCACHE <
>> [email protected]> wrote:
>>
>>> Hi All,
>>>
>>> It is not probably the best place to ask it,
>>>
>>> But does anyone mind sharing pointers to any recommender systems
>>> implemented on top of storm ?
>>>
>>> Sure there is trident-ML, but I did not see any collaborative filtering
>>> methods...
>>>
>>> Thank you for your advice guys,
>>>
>>> Regards
>>>
>>
>>
>

Re: Recommender Engines on top of Storm

Reply via email to