Re: Letter-number transitions - can this be turned off

Mike Klaas Mon, 01 Oct 2007 11:18:26 -0700

On 30-Sep-07, at 12:47 PM, F Knudson wrote:


Is there a flag to disable the letter-number transition in the

solr.WordDelimiterFilterFactory? We are indexing category codes,thesaurus

codes for which this letter number transition makes no sense.  It is
bloating the indexing (which is already large).


Have you considered using a different analyzer?

If you want to continue using WDF, you could make a quick changearound since 320:


            if (splitOnCaseChange == 0 &&
                (lastType & ALPHA) != 0 && (type & ALPHA) != 0) {
              // ALPHA->ALPHA: always ignore if case isn't considered.

            } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) {
              // UPPER->LOWER: Don't split
            } else {

            ...

by adding a clause that catches ALPHA -> NUMERIC (and vice versa) andignores it.

Another approach that I am using locally is to maintain thetransitions, but force tokens to be a minimum size (so r2d2 doesn'ttokenize to four tokens but arrr2222deee2222 does).


There is a patch here: http://issues.apache.org/jira/browse/SOLR-293

If you vote for it, I promise to get it in for 1.3 <g>

-Mike

Re: Letter-number transitions - can this be turned off

Reply via email to