Re: StandardTokenizer and domain names containing digits

Alex Willmer Mon, 23 Apr 2012 02:35:39 -0700

Steven A Rowe <sarowe <at> syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules 
> from 
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-
17.html#Word_Boundaries>. 
> These rules don't include recognition of URLs or domain names.
> 
> Lucene/Solr includes another tokenizer that does recognize URLs and domain 
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT
okenizerFactory>.
>  (Stand-alone domain names are recognized as URLs.)
> 
> My suggestion is that you add a filter (for both the indexing and querying) 
that splits tokens containing
> periods:
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF
ilterFactory>,
> something like (untested!):
> 
>     <filter class="solr.WordDelimiterFilterFactory"
>             splitOnCaseChange="0"
>             splitOnNumerics="0"
>             stemEnglishPossessive="0"
>             generateWordParts="1"
>             preserveOriginal="1" />


Steve, Thank you very much for this reply, it helped immensely. In the end I've 
gone for your suggestion, plus a swap of StandardTokenizer -> 
UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The 
fieldType now looks like

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory"
            synonyms="index_synonyms.txt" ignoreCase="true" 
            expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" 
            synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

autoGeneratePhraseQueries is set so that the tokens generated in the query 
analyzer behave more like tokens from a space delimited query. So 
"ns1.define.logica.com" finds a similar set of documents to "ns1 define logica 
com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR 
logica OR com". 

Many thanks, Alex

Re: StandardTokenizer and domain names containing digits

Reply via email to