Re: Replacing payloads for per-document-per-keyword scores
: Hoss guessed that we could override Term Frequency with PreAnalyzedField[1] : for the per-keyword scores, since keywords (tags) always have a Term : Frequency of 1 and the TF calculation is very fast. However it turns out : that you can't[2] specify TF in the PreAnalyzedField. Yeah ... sorry for stearing you in the wrong direction there. Mikhail's suggesting is dead on what i thought you could already do with PreAnalyzedField... : if manipulating tf is a possible approach, why don't extend : KeywordTokenizer to make it work in the following manner: : : 3|wheel - {wheel,wheel,wheel} : : it will allow supply your per-term-per-doc boosts as a prefixes for field : values and multiply them during indexing internally. ..to be clear, this won't/shouldn't be as inefficient and memory bloated as it sounds because you don't actaully have to copy the Term N times -- You should just be able to have the TokenStream you return from your Tokenizer implement incrementToken() by simply incrementing a counter and returning true until it's been called N times, w/o modifying any other state. Or at least ... that's my theory ... i've been wrong before. -Hoss
Replacing payloads for per-document-per-keyword scores
Hello Hoss and the list, We are currently using Lucene payloads to store per-document-per-keyword scores for our dataset. Our dataset consists of photos with keywords assigned (only once each) to them. The index is about 90 GB, running on 24-core machines with dedicated 10k SAS drives, and 16/32 GB allocated to the JVM. When searching the payloads field, our 98 percentile query time is at 2 seconds even with trivially low queries per second. I have asked several Lucene committers about this and it's believed that the implementation of payloads being so general is the cause of the slowness. Hoss guessed that we could override Term Frequency with PreAnalyzedField[1] for the per-keyword scores, since keywords (tags) always have a Term Frequency of 1 and the TF calculation is very fast. However it turns out that you can't[2] specify TF in the PreAnalyzedField. Is there any other way to override Term Frequency during index time? If not, where in the code could this be implemented? An obvious option is to repeat the keyword as many times as its payload score, but that would drastically increase the amount of data per document sent during index time. I'd welcome any other per-document-per-keyword score solutions, or some way to speed up searching a payload field. Thanks, - Neil [1] https://issues.apache.org/jira/browse/SOLR-1535 [2] https://issues.apache.org/jira/browse/SOLR-1535?focusedCommentId=13273501#comment-13273501
Re: Replacing payloads for per-document-per-keyword scores
Hello Neil, if manipulating tf is a possible approach, why don't extend KeywordTokenizer to make it work in the following manner: 3|wheel - {wheel,wheel,wheel} it will allow supply your per-term-per-doc boosts as a prefixes for field values and multiply them during indexing internally. The second consideration is - have you considered Click Scoring Tools from lucidworks as a relevant approach? Regards On Wed, May 16, 2012 at 12:02 AM, Neil Hooey nho...@gmail.com wrote: Hello Hoss and the list, We are currently using Lucene payloads to store per-document-per-keyword scores for our dataset. Our dataset consists of photos with keywords assigned (only once each) to them. The index is about 90 GB, running on 24-core machines with dedicated 10k SAS drives, and 16/32 GB allocated to the JVM. When searching the payloads field, our 98 percentile query time is at 2 seconds even with trivially low queries per second. I have asked several Lucene committers about this and it's believed that the implementation of payloads being so general is the cause of the slowness. Hoss guessed that we could override Term Frequency with PreAnalyzedField[1] for the per-keyword scores, since keywords (tags) always have a Term Frequency of 1 and the TF calculation is very fast. However it turns out that you can't[2] specify TF in the PreAnalyzedField. Is there any other way to override Term Frequency during index time? If not, where in the code could this be implemented? An obvious option is to repeat the keyword as many times as its payload score, but that would drastically increase the amount of data per document sent during index time. I'd welcome any other per-document-per-keyword score solutions, or some way to speed up searching a payload field. Thanks, - Neil [1] https://issues.apache.org/jira/browse/SOLR-1535 [2] https://issues.apache.org/jira/browse/SOLR-1535?focusedCommentId=13273501#comment-13273501 -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com