On 30-Sep-07, at 12:47 PM, F Knudson wrote:


Is there a flag to disable the letter-number transition in the
solr.WordDelimiterFilterFactory? We are indexing category codes, thesaurus
codes for which this letter number transition makes no sense.  It is
bloating the indexing (which is already large).

Have you considered using a different analyzer?

If you want to continue using WDF, you could make a quick change around since 320:

            if (splitOnCaseChange == 0 &&
                (lastType & ALPHA) != 0 && (type & ALPHA) != 0) {
              // ALPHA->ALPHA: always ignore if case isn't considered.

            } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) {
              // UPPER->LOWER: Don't split
            } else {

            ...

by adding a clause that catches ALPHA -> NUMERIC (and vice versa) and ignores it.

Another approach that I am using locally is to maintain the transitions, but force tokens to be a minimum size (so r2d2 doesn't tokenize to four tokens but arrr2222deee2222 does).

There is a patch here: http://issues.apache.org/jira/browse/SOLR-293

If you vote for it, I promise to get it in for 1.3 <g>

-Mike

Reply via email to