@Ted: Thanks for the input! @Ken I think the boosting is buried in getSpanScore() calling super.score() but a quick look at the code didn't give me the exact reason. I remember me setting includeSpanScore to false and the boost disappearing. The use of payloads however was only necessary because i put the similarity scores in there and then summed them up. In the "new" system i am gonna use no payloads (so far).
Thanks for the hint with the int field. I am using plain lucene but the semantics should be the same. I didnt know you could turn off the bonus features On Mon, Feb 11, 2013 at 7:27 PM, Ken Krugler <[email protected]>wrote: > > On Feb 11, 2013, at 1:57am, Johannes Schulte wrote: > > > @Ken > > Thanks for the hints... > > I am coming from a payload based system so I am aware if them, however in > > the lucene 3.6 branch boosting and payloads didn't work together (if you > > set PayloadTermQuery.setIncludeSpanScore to false, they were ignored) > > I assume you're talking about passing false for the includeSpanScore > parameter in the PayloadTermQuery constructor, yes? > > Anyway, I'm surprised you ran into this issue. In the 3.6.0 source for > PayloadTermQuery, the getScore() method is: > > @Override > public float score() throws IOException { > > return includeSpanScore ? getSpanScore() * getPayloadScore() > : getPayloadScore(); > } > > So I would assume that you'd get the payload score (as expected). But I > haven't actually tried to validate this. > > > Besides that, there is no performance issue here so far so it's probably > a > > fine way to go, i was just curious...as for the IntField / TrieIntField, > > all the range query / ordering benefits of it are overhead since the > > integers just represent random indices into a vector. I might look into > > indexing the integer bytes rather than the string separation… > > I was proposing you use: > > <fieldType name="int" class="solr.TrieIntField" precisionStep="0" > positionIncrementGap="0"/> > > Which doesn't generate the extra values that make range queries faster, > but should store the data more efficiently. > > -- Ken > > > @Ted > > You are probably right with chosing 1 as term frequency, i forgot that > the > > most interesting information comes from the idf probably and using > > cooccurrence counts as term frequency might make the combination with > text > > searches infeasible since the values lie in some totally different range. > > Also I forgot that idf is per field so i might go for separating the > hashed > > values into their originating fields (search tern, item_id, category_id) > . > > This would still alllow to recombine them later when a use profile has to > > be constructed. > > > > I like to threshold with LLR. That gives me a binary matrix. Then I > >> directly index that. > >> The search engine provides very nice weights at this point. I don't > feel > >> the need to adjust those weights because they have roughly the same > form as > >> learned weights are likely to have and because learning those weights > would > >> almost certainly result in over-fitting unless I go to quite a lot of > >> trouble. > >> Also, I have heard that at least one head-to-head test found that the > >> native Solr term weighting actually out-performed several more intricate > >> and explicit weighting schemes. That can't be taken as evidence that > >> Solr's weightings would perform better than whatever you have in mind, > but > >> it does provide interesting meta-evidence that the probability that a > >> reasonably smart dev team is definitely not guaranteed to beat Solr's > >> weighting by a large margin. When you sit down to architect your > system, > >> you need to make decisions about where to spend your time and evidence > like > >> that is helpful to guess how much effort it would take to achieve > different > >> levels of performance. > > > > > > > > I am also thresholding the counts with LLR. Every time i do this I take a > > threshold of 10 since I loosely remember it being about the 99% margin > of > > confidence in the chi square distribution. I got no clue however if > anybody > > wants something like 99% for recommendations or if 50% might be a better > > value. What's your experience on that? > > > > And do you apply a limit on the total number of docs per term, since > there > > could be big boolean queries tearing down the performance? > > > > Thanks for all the input! > > > > > > > > On Mon, Feb 11, 2013 at 7:20 AM, Ted Dunning <[email protected]> > wrote: > > > >> On Sun, Feb 10, 2013 at 3:39 PM, Johannes Schulte < > >> [email protected]> wrote: > >> > >>> ... > >>> i am currently implementing a system of the same kind, LLR sparsified > >>> "term"-cooccurrence vectors in lucene (since not a day goes by where i > >> see > >>> Ted praising this). > >>> > >> > >> (turns red) > >> > >> > >>> There are not only views and purchases, but also search terms, facets > >> and a > >>> lot more textual information to be included in the cooccurrence matrix > >> (as > >>> "input"). > >>> That's why i went with the feature hashing framework in mahout. This > >> gives > >>> small (hd/mem) user profiles and allows for reusing the vectors for > click > >>> prediction and/or clustering. > >> > >> > >> This is a reasonable choice. For recommendations, you might want to use > >> direct encoding since it can be simpler to build a search index for > >> recommending. > >> > >> > >>> The main difference is that there's only two > >>> fields in lucene with a lot of terms (Numbers), representing the > >> features. > >>> Two fields because i think predicting views (besides purchases) might > in > >>> some cases be better than predicting nothing. > >>> > >> > >> OK. > >> > >> > >>> I don't think it should make a big differing in scoring because in a > >>> vector space model used by most engines it's just, well a vector space > >> and > >>> i don't know if the field norm make sense after stripping values from > the > >>> term vectors with the LLR threshold. > >>> > >> > >> Having separate fields is going to give separate total term counts. > That > >> seems better to me, but I have to confess I have never rigorously tested > >> that. > >> > >> > >>> @Ted > >>>> It is handy to simply use the binary values of the sparsified versions > >> of > >>>> these and let the search engine handle the weighting of different > >>>> components at query time. > >>> > >>> Do you really want to omit the cooccurrence counts which would become > the > >>> term frequecies? how would the engine then weight different inputs > >> against > >>> each other? > >>> > >> > >> I like to threshold with LLR. That gives me a binary matrix. Then I > >> directly index that. > >> > >> The search engine provides very nice weights at this point. I don't > feel > >> the need to adjust those weights because they have roughly the same > form as > >> learned weights are likely to have and because learning those weights > would > >> almost certainly result in over-fitting unless I go to quite a lot of > >> trouble. > >> > >> Also, I have heard that at least one head-to-head test found that the > >> native Solr term weighting actually out-performed several more intricate > >> and explicit weighting schemes. That can't be taken as evidence that > >> Solr's weightings would perform better than whatever you have in mind, > but > >> it does provide interesting meta-evidence that the probability that a > >> reasonably smart dev team is definitely not guaranteed to beat Solr's > >> weighting by a large margin. When you sit down to architect your > system, > >> you need to make decisions about where to spend your time and evidence > like > >> that is helpful to guess how much effort it would take to achieve > different > >> levels of performance. > >> > >> And, if anyone knows a > >>> 1. smarter way to index the cooccurrence counts in lucene than a > >>> tokenstream that emits a word k times for a cooccurrence count of k > >>> > >> > >> You can use payloads or you can boost individual terms. > >> > >> > >>> 2. way to avoid treating the (hashed) vector column indices as terms > but > >>> reusing them? It's a bit weird hashing to an int and then having the > >> lucene > >>> term dictionary treating them as string, mapping to another int > >>> > >> > >> Why do we care about this? These tokens get put onto documents that > have > >> additional data to help them make sense, but why do we care if the > tokens > >> look like numbers? > >> > > -------------------------------------------- > http://about.me/kkrugler > +1 530-210-6378 > > > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > >
