On Sun, Feb 10, 2013 at 3:39 PM, Johannes Schulte <
[email protected]> wrote:

> ...
> i am currently implementing a system of the same kind, LLR sparsified
> "term"-cooccurrence vectors in lucene (since not a day goes by where i see
> Ted praising this).
>

(turns red)


> There are not only views and purchases, but also search terms, facets and a
> lot more textual information to be included in the cooccurrence matrix (as
> "input").
> That's why i went with the feature hashing framework in mahout. This gives
> small (hd/mem) user profiles and allows for reusing the vectors for click
> prediction and/or clustering.


This is a reasonable choice.  For recommendations, you might want to use
direct encoding since it can be simpler to build a search index for
recommending.


> The main difference is that there's only two
> fields in lucene with a lot of terms (Numbers), representing the features.
> Two fields because i think predicting views (besides purchases) might in
> some cases be better than predicting nothing.
>

OK.


> I don't think it  should make a big differing in scoring because in a
> vector space model used by most engines it's just, well a vector space and
> i don't know if the field norm make sense after stripping values from the
> term vectors with the LLR threshold.
>

Having separate fields is going to give separate total term counts.  That
seems better to me, but I have to confess I have never rigorously tested
that.


> @Ted
> > It is handy to simply use the binary values of the sparsified versions of
> >these and let the search engine handle the weighting of different
> >components at query time.
>
> Do you really want to omit the cooccurrence counts which would become the
> term frequecies? how would the engine then weight different inputs against
> each other?
>

I like to threshold with LLR.  That gives me a binary matrix.  Then I
directly index that.

The search engine provides very nice weights at this point.  I don't feel
the need to adjust those weights because they have roughly the same form as
learned weights are likely to have and because learning those weights would
almost certainly result in over-fitting unless I go to quite a lot of
trouble.

Also, I have heard that at least one head-to-head test found that the
native Solr term weighting actually out-performed several more intricate
and explicit weighting schemes.  That can't be taken as evidence that
Solr's weightings would perform better than whatever you have in mind, but
it does provide interesting meta-evidence that the probability that a
reasonably smart dev team is definitely not guaranteed to beat Solr's
weighting by a large margin.  When you sit down to architect your system,
you need to make decisions about where to spend your time and evidence like
that is helpful to guess how much effort it would take to achieve different
levels of performance.

And, if anyone knows a
> 1. smarter way to index the cooccurrence counts in lucene than a
> tokenstream that emits a word k times for a cooccurrence count of k
>

You can use payloads or you can boost individual terms.


> 2. way to avoid treating the (hashed) vector column indices as terms but
> reusing them? It's a bit weird hashing to an int and then having the lucene
> term dictionary treating them as string, mapping to another int
>

Why do we care about this?  These tokens get put onto documents that have
additional data to help them make sense, but why do we care if the tokens
look like numbers?

Reply via email to