Field getting tokenized prior to charFilter on select query

Andrew Chalupa Thu, 12 Aug 2010 11:54:24 -0700

I'm attempting to make use of PatternReplaceCharFilterFactory, but am running 
into issues on both 1.4.1 ( I ported it) and on nightly (4.0-2010-07-27).  It 
seems that on a real query the charFilter isn't executed prior to the 
tokenizer.


I modified the example configuration included in the distribution with the 
following fieldType in schema.xml and mapped a new field to it. 
    <!-- Field defintion for name text field -->
    <fieldtype name="nameText" class="solr.TextField">
      <analyzer>
        <!-- Replace (char & char) or (char and char) with (char&char) -->
        <charFilter class="solr.PatternReplaceCharFilterFactory"
            pattern="(.*?)(\b(\w) (&amp;|and) (\w))(.*?)" 
replacement="$1$3&amp;$5$6"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
      </analyzer>
    </fieldtype>    
    
    <field name="name" type="nameText" indexed="true" stored="true" 
required="false" omitNorms="true" />
    
I vaildated that the regex works properly outside of Solr using just Java.  The 
regex attempts to normalize single word characters around an '&' into something 
consistent for searching.  For example, it will turn "A & B Company" into "A&B 
Company".  The user can then search on "A&B", "A and B", or "A & B" and the 
proper result will be located.

However, when I import a document with "A & B Company" I can't ever locate it 
with "A & B" query.  It can be located with "A&B" query.  When I run 
analysis.jsp it works properly and it will match using any of the combinations.

So from this I concluded that it was being indexed properly, but for some 
reason the query wasn't applying the regex properly.  I hooked up a debugger 
and could see a difference in how the analyzer was applying the charFilter and 
how the query was applying the charFilter.  When the analyzer invoked 
PatternReplaceCharFilterFactory.create(CharStream) the entire field was 
provided in a single call.  When the query invoked 
PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times with 3 
seperate tokens (A, &, B).  Because of this the regex won't ever locate the 
full string in the field.

I'm using the following encoded URL to perform the query.  
This works
http://localhost:8983/solr/select?q=name:%28a%26b%29

But this doesn't
http://localhost:8983/solr/select?q=name:%28a+%26+b%29

Why is the query parser tokenizing the name field prior to the charFilter 
getting a chance to perform processing?

Field getting tokenized prior to charFilter on select query

Reply via email to