Thanks for the help - changing the field type of the destination for the copy fields to "text_en" solved the problem. I'd foolishly assumed that the analysis of the source fields was applied then the resulting tokens passed to the copy field, which doesn't really make sense now that I think about it!

So the indexing process is:

+-----------+     +----------------+     +-------------+
|companyName|     |  companyName   |     | companyName |
|input data |---->|text_en analysis|---->|    index    |
+-----------+     +----------------+     +-------------+
      |
      |           +----------------+     +-------------+
      +---------->|      text      |---->|    text     |
                  |text_en analysis|     |    index    |
                  +----------------+     +-------------+

Rather than:

+-----------+     +----------------+       +-------------+
|companyName|     |  companyName   |       | companyName |
|input data |---->|text_en analysis|------>|    index    |
+-----------+     +----------------+       +-------------+
                          |
               +---------------------+     +-------------+
               |         text        |---->|    text     |
               |text_general analysis|     |    index    |
               +---------------------+     +-------------+


On 28/01/2019 12:37, Scott Stults wrote:
Hi Chris,

You've included the field definition of type text_en, but in your queries
you're searching the field "text", which is of type text_general. That may
be the source of your problem, but if looking into that doesn't help send
the definition of text_general as well.

Hope that helps!

-Scott

On Mon, Jan 28, 2019 at 6:02 AM Chris Wareham <
chris.ware...@graduate-jobs.com> wrote:

I'm trying to index some data which often includes domain names. I'd
like to remove the .com TLD, so I have modified the text_en field type
by adding a PatternReplaceFilterFactory filter. However, it doesn't
appear to be working as a search for "text:(mydomain.com)" matches
records but "text:(mydomain)" does not.

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" expand="true"
ignoreCase="true" synonyms="synonyms.txt"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([-a-z])\.com" replacement="$1"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" expand="true"
ignoreCase="true" synonyms="synonyms.txt"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([-a-z])\.com" replacement="$1"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

The actual field definitions are as follows:

    <field name="companyName"      type="text_en"      indexed="true"
stored="true"  required="true"             />
    <field name="jobTitle"         type="text_en"      indexed="true"
stored="true"  required="true"             />
    <field name="text"             type="text_general" indexed="true"
stored="false"                             />

    <copyField source="companyName" dest="text" />
    <copyField source="jobTitle"    dest="text" />



Reply via email to