Re: Which database should I use with Mahout

Johannes Schulte Mon, 20 May 2013 22:15:56 -0700

I think Pat is just saying that
time(history_lookup) (1) + time (recommendation_calculation) (2) >
time(precalc_lookop) (3)


since 1 and 3 are assumed to be served by the same system class (key value
store, db) with a single key and 2 > 0.

ed is using a lot of information that is available at recommendation time
and not fetched from a somewhere ("context of delivery", geolocation). The
question remaining is why the recent history is available without a lookup,
which can only be the case if the recommendation calculation is embedded in
a bigger request cycle the history is loaded somewhere else, or it's just
stored in the browser.

if you would store the classical (netflix/mahout) user-item history in the
browser and use a disk matrix thing like lucene for calculation you would
end up in the same range.

I think the points are more:


1. Having more input's than the classical item-interactions
(geolocation->item,search_term->item ..) can be very easily carried out
with search index storing this precalculated "association rules"

2. Precalculation per user is heavyweight, stale and hard to do if the
context also plays a role (site the use is on e.g because you have to have
the cartesian product of recommendations prepared for every user), while
"real time" approach can handle it





On Tue, May 21, 2013 at 2:00 AM, Ted Dunning <[email protected]> wrote:

> Inline answers.
>
>
> On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <[email protected]> wrote:
>
> > ...
> > You use the user history vector as a query?
>
>
> The most recent suffix of the history vector.  How much is used varies by
> the purpose.
>
>
> > This will be a list of item IDs and strength-of-preference values (maybe
> > 1s for purchases).
>
>
> Just a list of item x action codes.  No strength needed.  If you have 5
> point ratings, then you can have 5 actions for each item.  The weighting
> for each action can be learned.
>
>
> > The cooccurrence matrix has columns treated like terms and rows treated
> > like documents though both are really items.
>
>
> Well, they are different.  The rows are fields within documents associated
> with an item.  Other fields include ID and other things.  The contents of
> the field are the codes associated with the item-action pairs for each
> non-null column.  Usually there is only one action so this reduces to a
> single column per item.
>
>
>
>
> > Does Solr support weighted term lists as queries or do you have to throw
> > out strength-of-preference?
>
>
> I prefer to throw it out even though Solr would not require me to do so.
>  They weights that I want can be encoded in the document index in any case.
>
>
> > I ask because there are cases where the query will have non '1.0' values.
> > When the strength values are just 1 the vector is really only a list or
> > terms (items IDs).
> >
>
> I really don't know of any cases where this is really true.  There are
> actions that are categorical.  I like to separate them out or to reduce to
> a binary case.
>
>
> >
> > This technique seems like using a doc as a query but you have reduced the
> > doc to the form of a vector of weighted terms. I was unaware that Solr
> > allowed weighted term queries. This is really identical to using Solr for
> > fast doc similarity queries.
> >
>
> It is really more like an ordinary query.  Typical recommendation queries
> are short since they are only recent history.
>
>
> >
> > ...
> >
> > Seems like you'd rule out browser based storage because you need the
> > history to train your next model.
>
>
> Nothing says that we can't store data in two places according to use.
>  Browser history is good for the part of the history that becomes the
> query.  Central storage is good for the mass of history that becomes input
> for analytics.
>
> At least it would be in addition to a server based storage of history.
>
>
> Yes.  In addition to.
>
>
> > Another reason you wouldn't rely only on a browser storage is that it
> will
> > be occasionally destroyed. Users span multiple devices these days too.
> >
>
> This can be dealt with using cookie resurrection techniques.  Or by letting
> the user destroy their copy of the history if they like.
>
> The user history matrix will be quite a bit larger than the user
> > recommendation matrix, maybe and order or two larger.
>
>
> I don't think so.  And it doesn't matter since this is reduced to
> significant cooccurrence and that is typically quite small compared to a
> list of recommendations for all users.
>
> I have 20 recs for me stored but I've purchases 100's of items, and have
> > viewed 1000's.
> >
>
> 20 recs is not sufficient.  Typically you need 300 for any given context
> and you need to recompute those very frequently.  If you use geo-specific
> recommendations, you may need thousands of recommendations to have enough
> geo-dispersion.  The search engine approach can handle all of that on the
> fly.
>
> Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> item-item cooccurrence matrix is item x 50.  Moreover, search engines are
> very good at compression.  If users >> items, then item x 50 is much
> smaller, especially after high quality compression (6:1 is a common
> compression ratio).
>
> >
> > Given that you have to have the entire user history vector to do the
> query
> > and given that this is still a lookup from an even larger matrix than the
> > recs/user matrix and given that you have to do the lookup before the Solr
> > query It can't be faster than just looking up pre-calculated recs.
>
>
> None of this applies.  There is an item x 50 sized search index.  There is
> a recent history that is available without a lookup.  All that is required
> is a single Solr query and that can handle multiple kinds of history and
> geo-location and user search terms all in a single step.
>
>
>
> > In other words the query to produce the query will be more problematic
> > than the query to produce the result, right?
> >
>
> Nope.  No such thing, therefore cost = 0.
>
>
> > Something here may be "orders of magnitude" faster, but it isn't the
> total
> > elapsed time to return recs at runtime, right?
>
>
> Actually, it is.  Round trip of less than 10ms is common.  Precalculation
> goes away.  Export of recs nearly goes away.  Currency of recommendations
> is much higher.
>
>
> > Maybe what you are saying is the time to pre-calculate the recs is 0
> since
> > they are calculated at runtime but you still have to create the
> > cooccurrence matrix so you still need something like mahout hadoop to
> > produce a model and you still need to index the model with Solr and you
> > still need to lookup user history at runtime. Indexing with Solr is
> faster
> > than loading a db (8 hours? They are doing something wrong) but the query
> > side will be slower unless I've missed something.
> >
>
> I am pretty sure you have.  The customer are definitely not dopes.  Th
> eproblem is that precalculated recs are much, much bigger due to geo
> constraints.
>
>
> >
> > In any case you *have* introduced a realtime rec calculation. This is
> able
> > to use user history that may be seconds old and not yet reflected in the
> > training data (the cooccurrence matrix) and this is very interesting!
> >
> > >>
> > >> This will scale to thousands or tens of thousands of recommendations
> per
> > >> second against 10's of millions of items.  The number of users doesn't
> > >> matter.
> > >>
> > >
> >
> > Yes, no doubt, but the history lookup is still an issue unless I've
> missed
> > something. The NoSQL queries will scale to tens of thousands of recs
> > against 10s of millions of items but perhaps with larger more complex
> > infrastructure? Not sure how Solr scales.
> >
> > Being semi-ignorant of Solr intuition says that it's doing something to
> > speed things up like using only part of the data somewhere to do
> > approximations. Have there been any performance comparisons of say
> > precision of one approach vs the other or do they return identical
> results?
> >
> >
> >
>

Re: Which database should I use with Mahout

Reply via email to