I'm attempting to make use of PatternReplaceCharFilterFactory, but am running into issues on both 1.4.1 ( I ported it) and on nightly (4.0-2010-07-27). It seems that on a real query the charFilter isn't executed prior to the tokenizer.
I modified the example configuration included in the distribution with the following fieldType in schema.xml and mapped a new field to it. <!-- Field defintion for name text field --> <fieldtype name="nameText" class="solr.TextField"> <analyzer> <!-- Replace (char & char) or (char and char) with (char&char) --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(.*?)(\b(\w) (&|and) (\w))(.*?)" replacement="$1$3&$5$6"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> </analyzer> </fieldtype> <field name="name" type="nameText" indexed="true" stored="true" required="false" omitNorms="true" /> I vaildated that the regex works properly outside of Solr using just Java. The regex attempts to normalize single word characters around an '&' into something consistent for searching. For example, it will turn "A & B Company" into "A&B Company". The user can then search on "A&B", "A and B", or "A & B" and the proper result will be located. However, when I import a document with "A & B Company" I can't ever locate it with "A & B" query. It can be located with "A&B" query. When I run analysis.jsp it works properly and it will match using any of the combinations. So from this I concluded that it was being indexed properly, but for some reason the query wasn't applying the regex properly. I hooked up a debugger and could see a difference in how the analyzer was applying the charFilter and how the query was applying the charFilter. When the analyzer invoked PatternReplaceCharFilterFactory.create(CharStream) the entire field was provided in a single call. When the query invoked PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times with 3 seperate tokens (A, &, B). Because of this the regex won't ever locate the full string in the field. I'm using the following encoded URL to perform the query. This works http://localhost:8983/solr/select?q=name:%28a%26b%29 But this doesn't http://localhost:8983/solr/select?q=name:%28a+%26+b%29 Why is the query parser tokenizing the name field prior to the charFilter getting a chance to perform processing?