On Sun, Feb 10, 2013 at 3:39 PM, Johannes Schulte < [email protected]> wrote:
> ... > i am currently implementing a system of the same kind, LLR sparsified > "term"-cooccurrence vectors in lucene (since not a day goes by where i see > Ted praising this). > (turns red) > There are not only views and purchases, but also search terms, facets and a > lot more textual information to be included in the cooccurrence matrix (as > "input"). > That's why i went with the feature hashing framework in mahout. This gives > small (hd/mem) user profiles and allows for reusing the vectors for click > prediction and/or clustering. This is a reasonable choice. For recommendations, you might want to use direct encoding since it can be simpler to build a search index for recommending. > The main difference is that there's only two > fields in lucene with a lot of terms (Numbers), representing the features. > Two fields because i think predicting views (besides purchases) might in > some cases be better than predicting nothing. > OK. > I don't think it should make a big differing in scoring because in a > vector space model used by most engines it's just, well a vector space and > i don't know if the field norm make sense after stripping values from the > term vectors with the LLR threshold. > Having separate fields is going to give separate total term counts. That seems better to me, but I have to confess I have never rigorously tested that. > @Ted > > It is handy to simply use the binary values of the sparsified versions of > >these and let the search engine handle the weighting of different > >components at query time. > > Do you really want to omit the cooccurrence counts which would become the > term frequecies? how would the engine then weight different inputs against > each other? > I like to threshold with LLR. That gives me a binary matrix. Then I directly index that. The search engine provides very nice weights at this point. I don't feel the need to adjust those weights because they have roughly the same form as learned weights are likely to have and because learning those weights would almost certainly result in over-fitting unless I go to quite a lot of trouble. Also, I have heard that at least one head-to-head test found that the native Solr term weighting actually out-performed several more intricate and explicit weighting schemes. That can't be taken as evidence that Solr's weightings would perform better than whatever you have in mind, but it does provide interesting meta-evidence that the probability that a reasonably smart dev team is definitely not guaranteed to beat Solr's weighting by a large margin. When you sit down to architect your system, you need to make decisions about where to spend your time and evidence like that is helpful to guess how much effort it would take to achieve different levels of performance. And, if anyone knows a > 1. smarter way to index the cooccurrence counts in lucene than a > tokenstream that emits a word k times for a cooccurrence count of k > You can use payloads or you can boost individual terms. > 2. way to avoid treating the (hashed) vector column indices as terms but > reusing them? It's a bit weird hashing to an int and then having the lucene > term dictionary treating them as string, mapping to another int > Why do we care about this? These tokens get put onto documents that have additional data to help them make sense, but why do we care if the tokens look like numbers?
