subject:"Question about Solr Fieldtypes, Chaining of Tokenizers"

Re: Question about Solr Fieldtypes, Chaining of Tokenizers

2010-12-06 Thread Matthew Hall

Yes, that's my conclusion as well Grant.

As for the example output:

The symposium of TgThe(RX3fg+and) gene studies

Should end up tokenizing to:

symposium tg the rx3fg and gene studi

Assuming I guessed right on the stemming.

Anyhow, thanks for the confirmation guys.

Matt

On 12/4/2010 8:18 PM, Grant Ingersoll wrote:

Could you expand on your example and show the output you want? FWIW, you could
simply write a token filter that does the same thing as the WhitespaceTokenizer.

-Grant

On Dec 3, 2010, at 1:14 PM, Matthew Hall wrote:

Hey folks, I'm working with a fairly specific set of requirements for our
corpus that needs a somewhat tricky text type for both indexing and searching.

The chain currently looks like this:

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=(.*?)(\p{Punct}*)$
replacement=$1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
filter class=solr.PatternReplaceFilterFactory
pattern=\p{Punct}
replacement= /
tokenizer class=solr.WhitespaceTokenizerFactory/

Now you will notice that I'm trying to add in a second tokenizer to this chain
at the very end, this is due to the final replacement of punctuation to
whitespace. At that point I'd like to further break up these tokens to smaller
tokens.

The reason for this is that we have a mixed normal english word and scientific corpus. For
example you could expect string like The symposium of TgThe(RX3fg+and) gene
studies being added to the index, and parts of those phrases being searched on.

We want to be able to remove the stopwords in the mostly english parts of these
types of statements, which the whitespace tokenizer, followed by removing
trailing punctuation, followed by the stopfilter takes care of. We do not
want to remove references to genetic information contained in allele symbols
and the like.

Sadly as far as I can tell, you cannot chain tokenizers in the schema.xml, so
does anyone have some suggestions on how this could be accomplished?

Oh, and let me add that the WordDelimiterFilter comes really close to what I want, but since we are
unwilling to promote our solr version to the trunk (we are on the 1.4x) version atm, the inability
to turn off the automatic phrase queries makes it a no go. We need to be able to make searches on
left/right match right/left.

My searches through the old material on this subject isn't really showing me
much except some advice on using the copyField attribute. But my understanding
is that this will simply take your original input to the field, and then
analyze it in two different ways depending on the field definitions. It would
be very nice if it were copying the already analyzed version of the text... but
that's not what its doing, right?

Thanks for any advice on this matter.

Matt

--
Grant Ingersoll
http://www.lucidimagination.com

Re: Question about Solr Fieldtypes, Chaining of Tokenizers

2010-12-04 Thread Grant Ingersoll

Could you expand on your example and show the output you want?  FWIW, you could 
simply write a token filter that does the same thing as the WhitespaceTokenizer.

-Grant

On Dec 3, 2010, at 1:14 PM, Matthew Hall wrote:

 Hey folks, I'm working with a fairly specific set of requirements for our 
 corpus that needs a somewhat tricky text type for both indexing and searching.
 
 The chain currently looks like this:
 
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.PatternReplaceFilterFactory
   pattern=(.*?)(\p{Punct}*)$
   replacement=$1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
 filter class=solr.SnowballPorterFilterFactory language=English 
 protected=protwords.txt/
 filter class=solr.PatternReplaceFilterFactory
   pattern=\p{Punct}
   replacement= /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 
 Now you will notice that I'm trying to add in a second tokenizer to this 
 chain at the very end, this is due to the final replacement of punctuation to 
 whitespace.  At that point I'd like to further break up these tokens to 
 smaller tokens.
 
 The reason for this is that we have a mixed normal english word and 
 scientific corpus.  For example you could expect string like The symposium 
 of TgThe(RX3fg+and) gene studies being added to the index, and parts of 
 those phrases being searched on.
 
 We want to be able to remove the stopwords in the mostly english parts of 
 these types of statements, which the whitespace tokenizer, followed by 
 removing trailing punctuation,  followed by the stopfilter takes care of.  We 
 do not want to remove references to genetic information contained in allele 
 symbols and the like.
 
 Sadly as far as I can tell, you cannot chain tokenizers in the schema.xml, so 
 does anyone have some suggestions on how this could be accomplished?
 
 Oh, and let me add that the WordDelimiterFilter comes really close to what I 
 want, but since we are unwilling to promote our solr version to the trunk (we 
 are on the 1.4x) version atm, the inability to turn off the automatic phrase 
 queries makes it a no go.  We need to be able to make searches on 
 left/right match right/left.
 
 My searches through the old material on this subject isn't really showing me 
 much except some advice on using the copyField attribute.  But my 
 understanding is that this will simply take your original input to the field, 
 and then analyze it in two different ways depending on the field definitions. 
  It would be very nice if it were copying the already analyzed version of the 
 text... but that's not what its doing, right?
 
 Thanks for any advice on this matter.
 
 Matt
 
 

--
Grant Ingersoll
http://www.lucidimagination.com

Re: Question about Solr Fieldtypes, Chaining of Tokenizers

2010-12-04 Thread Robert Muir

On Fri, Dec 3, 2010 at 1:14 PM, Matthew Hall mh...@informatics.jax.org wrote:
 Oh, and let me add that the WordDelimiterFilter comes really close to what I
 want, but since we are unwilling to promote our solr version to the trunk
 (we are on the 1.4x) version atm, the inability to turn off the automatic
 phrase queries makes it a no go.  We need to be able to make searches on
 left/right match right/left.


if this is the case, it doesnt matter what your analysis does, it won't work.

your only workaround if you cannot upgrade, is to use PositionFilter
at query-time... but then you cannot use phrasequeries at all.

Question about Solr Fieldtypes, Chaining of Tokenizers

2010-12-03 Thread Matthew Hall

Hey folks, I'm working with a fairly specific set of requirements for 
our corpus that needs a somewhat tricky text type for both indexing and 
searching.


The chain currently looks like this:

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory
   pattern=(.*?)(\p{Punct}*)$
   replacement=$1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/

filter class=solr.PatternReplaceFilterFactory
   pattern=\p{Punct}
   replacement= /
tokenizer class=solr.WhitespaceTokenizerFactory/

Now you will notice that I'm trying to add in a second tokenizer to this 
chain at the very end, this is due to the final replacement of 
punctuation to whitespace.  At that point I'd like to further break up 
these tokens to smaller tokens.


The reason for this is that we have a mixed normal english word and 
scientific corpus.  For example you could expect string like The 
symposium of TgThe(RX3fg+and) gene studies being added to the index, 
and parts of those phrases being searched on.


We want to be able to remove the stopwords in the mostly english parts 
of these types of statements, which the whitespace tokenizer, followed 
by removing trailing punctuation,  followed by the stopfilter takes care 
of.  We do not want to remove references to genetic information 
contained in allele symbols and the like.


Sadly as far as I can tell, you cannot chain tokenizers in the 
schema.xml, so does anyone have some suggestions on how this could be 
accomplished?


Oh, and let me add that the WordDelimiterFilter comes really close to 
what I want, but since we are unwilling to promote our solr version to 
the trunk (we are on the 1.4x) version atm, the inability to turn off 
the automatic phrase queries makes it a no go.  We need to be able to 
make searches on left/right match right/left.


My searches through the old material on this subject isn't really 
showing me much except some advice on using the copyField attribute.  
But my understanding is that this will simply take your original input 
to the field, and then analyze it in two different ways depending on the 
field definitions.  It would be very nice if it were copying the already 
analyzed version of the text... but that's not what its doing, right?


Thanks for any advice on this matter.

Matt

Re: Question about Solr Fieldtypes, Chaining of Tokenizers

Re: Question about Solr Fieldtypes, Chaining of Tokenizers

Re: Question about Solr Fieldtypes, Chaining of Tokenizers

Question about Solr Fieldtypes, Chaining of Tokenizers

4 matches

Site Navigation

Mail list logo

Footer information