we have a "cross recommender" in production for about 3 month now, with the
difference that we use lucene to build indices from map reduce directly
plus we do the same thing for 30+ customers, most of them with different
input data structure (field names, values).

we had something similar before (lucene, multiple cross relations) but also
used the similarity score (llr) with a custom similarity and payloads but
switched tp pure "tedism" after some helpful comments here. therefore i
read this thread with a lot of interest.

what i can add from my experiences:

1. i find it way easier to not talk about in this in matrix multiplication
language but with contigency tables ( a and b, a and not b, not a and b,
not a and not b), and also find the usage of the classical mahout
similarity jobs hard. this is probably because of my basic matrix math
skills, but also because using matrices leads to id usage and often the
extracted items are text (search term, country, page section). thinking of
this as related terms automatically gives a "document" view on the item to
be recommended (the lucene doc) where description, name and everything is
also just a field.

2. when doing a simple table it's just cooccurrences, marginals and totals.
since the dimension of marginals is often not too big (items, browser,
terms), we right now accumulate the counts in memory. maybe the
 RowSimilarityJob is working the same way. This can be changed to a
different implementaton like on disk hash table or even count min sketch,
if the number of items is too large. Main point is that the counting of
marginals can be done on the fly when emitting all ooccurrences.

3. above in the thread there was a tip on approaching similarity scores
with repeating terms. payloads are a better way for this and with lucene
4's doc values capability, there shouldn't be any mahout similarity not
expressible by a lucene similarity. maybe it would be helpful to provide a
lucene delivery system also for the "classic" mahout recommender package.
it adds soo many possibilities for filtering and takes away a lot of point
like caching etc.

4. a big question is the frequency of rebuilding. while the relations can
often stay untouched for a day, the item data may change way more often
(item churn, new items). it is therefore beneficial to separate those and
have the possibility to rebuild the final index without calulating all
similarities again (for very critical things this often leads to a lucene
filter querying some external source to build up a lucene filter that
restricts the index)

besides that, i am very happy to see the ongoing effort on this topic and
hope that i can contribute with something someday.

Cheers,
Johannes




On Mon, Aug 5, 2013 at 10:27 PM, Ted Dunning <[email protected]> wrote:

> On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel <[email protected]> wrote:
>
> > Yeah thought of that one too but it still requires each be ordered by
> Key,
> > in which case simultaneous iteration works in one pass I think.
> >
>
> Multipass does not require ordering by key.  Solr documents can be updated
> in any order.
>
>
> > If the DRMs are always sorted by Key you can iterate through each at the
> > same time, writing only when you have both fields or know there is a
> field
> > missing from one DRM. If you get the same key you write a combined doc,
> if
> > you have different ones, write out one sided until it catches up to the
> > other.
> >
>
> Yes.  Merge will work when files are ordered and split consistently.  I
> don't think we should be making that assumption.
>
>
> > Every DRM I've examined seems to be ordered by key and I assume that is
> > not an artifact of seqdumper. I'm using SequenceFileDirIterator so the
> part
> > file splits aren't a problem.
> >
>
> But with the co- and cross- occurrence stuff, file splits could be a
> problem.
>
>
> > A m/r join is pretty simple too but I'll go with non-m/r unless there is
> a
> > problem above.
> >
>
> The simplest join is to use Solr updates.  This would require a minimal
> amount of programming, but less than writing a merge program.
>
>
> > BTW the schema for the Solr csv is:
> > id,b_b_links,b_a_links
> > item1,"itemX itemY","itemZ"
> >
> > am I missing some "normal metadata"?
> >
>
> An item description is nice.
>

Reply via email to