Re: Replacing payloads for per-document-per-keyword scores

2012-06-01 Thread Chris Hostetter
:  Hoss guessed that we could override Term Frequency with PreAnalyzedField[1]
:  for the per-keyword scores, since keywords (tags) always have a Term
:  Frequency of 1 and the TF calculation is very fast. However it turns out
:  that you can't[2] specify TF in the PreAnalyzedField.

Yeah ... sorry for stearing you in the wrong direction there.

Mikhail's suggesting is dead on what i thought you could 
already do with PreAnalyzedField...

: if manipulating tf is a possible approach, why don't extend
: KeywordTokenizer to make it work in the following manner:
: 
: 3|wheel - {wheel,wheel,wheel}
: 
: it will allow supply your per-term-per-doc boosts as a prefixes for field
: values and multiply them during indexing internally.

..to be clear, this won't/shouldn't be as inefficient and memory bloated 
as it sounds because you don't actaully have to copy the Term N times --  
You should just be able to have the TokenStream you return from your 
Tokenizer implement incrementToken() by simply incrementing a counter and 
returning true until it's been called N times, w/o modifying any other 
state.

Or at least ... that's my theory ... i've been wrong before.

-Hoss


Replacing payloads for per-document-per-keyword scores

2012-05-15 Thread Neil Hooey
Hello Hoss and the list,

We are currently using Lucene payloads to store per-document-per-keyword
scores for our dataset. Our dataset consists of photos with keywords
assigned (only once each) to them. The index is about 90 GB, running on
24-core machines with dedicated 10k SAS drives, and 16/32 GB allocated to
the JVM.

When searching the payloads field, our 98 percentile query time is at 2
seconds even with trivially low queries per second. I have asked several
Lucene committers about this and it's believed that the implementation of
payloads being so general is the cause of the slowness.

Hoss guessed that we could override Term Frequency with PreAnalyzedField[1]
for the per-keyword scores, since keywords (tags) always have a Term
Frequency of 1 and the TF calculation is very fast. However it turns out
that you can't[2] specify TF in the PreAnalyzedField.

Is there any other way to override Term Frequency during index time? If
not, where in the code could this be implemented?

An obvious option is to repeat the keyword as many times as its payload
score, but that would drastically increase the amount of data per document
sent during index time.

I'd welcome any other per-document-per-keyword score solutions, or some way
to speed up searching a payload field.

Thanks,

- Neil

[1] https://issues.apache.org/jira/browse/SOLR-1535
[2]
https://issues.apache.org/jira/browse/SOLR-1535?focusedCommentId=13273501#comment-13273501


Re: Replacing payloads for per-document-per-keyword scores

2012-05-15 Thread Mikhail Khludnev
Hello Neil,

if manipulating tf is a possible approach, why don't extend
KeywordTokenizer to make it work in the following manner:

3|wheel - {wheel,wheel,wheel}

it will allow supply your per-term-per-doc boosts as a prefixes for field
values and multiply them during indexing internally.

The second consideration is - have you considered Click Scoring Tools from
lucidworks as a relevant approach?

Regards

On Wed, May 16, 2012 at 12:02 AM, Neil Hooey nho...@gmail.com wrote:

 Hello Hoss and the list,

 We are currently using Lucene payloads to store per-document-per-keyword
 scores for our dataset. Our dataset consists of photos with keywords
 assigned (only once each) to them. The index is about 90 GB, running on
 24-core machines with dedicated 10k SAS drives, and 16/32 GB allocated to
 the JVM.

 When searching the payloads field, our 98 percentile query time is at 2
 seconds even with trivially low queries per second. I have asked several
 Lucene committers about this and it's believed that the implementation of
 payloads being so general is the cause of the slowness.

 Hoss guessed that we could override Term Frequency with PreAnalyzedField[1]
 for the per-keyword scores, since keywords (tags) always have a Term
 Frequency of 1 and the TF calculation is very fast. However it turns out
 that you can't[2] specify TF in the PreAnalyzedField.

 Is there any other way to override Term Frequency during index time? If
 not, where in the code could this be implemented?

 An obvious option is to repeat the keyword as many times as its payload
 score, but that would drastically increase the amount of data per document
 sent during index time.

 I'd welcome any other per-document-per-keyword score solutions, or some way
 to speed up searching a payload field.

 Thanks,

 - Neil

 [1] https://issues.apache.org/jira/browse/SOLR-1535
 [2]

 https://issues.apache.org/jira/browse/SOLR-1535?focusedCommentId=13273501#comment-13273501




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com