Re: EdgeNGramTokenFilter, term position?

2007-09-17 Thread Chris Hostetter
: Should the EdgeNGramFilter use the same term position for the ngrams within a
: single token?

i can see the argument going both ways ... imagine a hypothetical 
CharSplitterTokenFilter that takes replaces each token in the stream with 
one token per character in the orriginal token (ie: hello becomes 
h,e,l,l,o) ... should those tokens all have the same position?  the have a 
logical ordered flow to them, so in theory they are sequential ... but 
they did occupy the same space in the orriginal token stream.

when in doubt: make it an option



-Hoss



Re: EdgeNGramTokenFilter, term position?

2007-09-17 Thread Yonik Seeley
On 9/16/07, Ryan McKinley [EMAIL PROTECTED] wrote:
 Should the EdgeNGramFilter use the same term position for the ngrams
 within a single token?

It feels like that is the right approach.
I don't see value in having them sequential, and I can think of uses
for having them overlap.

-Yonik


EdgeNGramTokenFilter, term position?

2007-09-16 Thread Ryan McKinley
Should the EdgeNGramFilter use the same term position for the ngrams 
within a single token?


As is, the EdgeNGramTokenFilter increments the term position for each 
character.  In analysis.jsp, with the input hello, I get:


term position   1   2   3   4   5
term text   h   he  hel hellhello
term type   wordwordwordwordword
start,end   0,1 0,2 0,3 0,4 0,5


I would expect something more like what is generated from SOLR-357:

term position   1
term text   hello
hell
hel
he
h
term type   word
prefix
prefix
prefix
prefix
start,end   0,5
0,4
0,3
0,2
0,1

This seems like it would affect slop queries, but I don't really 
understand them yet.


thanks
ryan