RE: StandardTokenizer and domain names containing digits

Steven A Rowe Thu, 19 Apr 2012 10:47:02 -0700

Hi Alex,

TLDR; Try adding WordDelimiterFilter to your analyzer(s).


StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from 
Unicode 6.0.0 Standard Annex #29, a.k.a. UAX#29: 
<http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries>.  These 
rules don't include recognition of URLs or domain names.  (The details: in 
UAX#29 Word Boundary rules terminology, the default rule - WB14 - says that 
boundaries will be made everywhere they are not prohibited, and since there is 
no rule to prohibit making a boundary in the character sequence /Numeric, 
MidNumLet, ALetter/ - "." FULL STOP belongs to MidNumLet - boundaries are made 
between Number and MidNumLet, and between MidNumLet and ALetter.  
StandardTokenizer emits as tokens the character sequences between UAX#29 word 
boundaries that contain alphanumeric characters, so the MidNumLet-only token is 
dropped.)

Lucene/Solr includes another tokenizer that does recognize URLs and domain 
names, in addition to the UAX#29 Word Boundary rules: UAX29URLEmailTokenizer 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory>.
  (Stand-alone domain names are recognized as URLs.)

I think Lucene/Solr should have a way to tokenize URL (and e-mail) components, 
so that e.g. if you have "http://www.example.com/page.html"; in your text, your 
index can contain "www.example.com" and "example.com", to enable e.g. queries 
containing just "example.com".  I'd like to have a URLFilter and an EmailFilter 
that would configurably tokenize components (e.g. for URLs: protocol; domain; 
base domain; domain elements; full path; path elements; 
URL-decoded-uax29-word-boundary-tokenized path elements).

This doesn't solve your problem, though.

My suggestion is that you add a filter (for both the indexing and querying) 
that splits tokens containing periods: 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>,
 something like (untested!):

    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="0"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />

Note that this filter will be applied to *all* of your tokens, not just domain 
names.
 
Steve
 
-----Original Message-----
From: Alex Willmer [mailto:al.will...@logica.com] 
Sent: Thursday, April 19, 2012 12:04 PM
To: solr-user@lucene.apache.org
Subject: StandardTokenizer and domain names containing digits

TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in 
the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a 
slightly surprising behaviour involving the query "ns1.define.logica.com", 
through an edismax handler with "q.op"=AND defined with

<requestHandler name="search" class="solr.SearchHandler" default="true">  <lst 
name="defaults">
   <str name="echoParams">explicit</str>
   <int name="rows">10</int>
   <!-- #define customisations -->
   <str name="defType">edismax</str>
   <str name="q.op">AND</str>
   <str name="qf">
    body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
    author^10.9 changed created oneline^0.7
   </str>
   <str name="pf">
    body^0.2 tags^1.1 title^1.5
   </str>
 </lst>
</requestHandler>

The schema is defined with fields of type text_general, as found in the example 
schema.xml, namely:

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The search string is being tokenised to "ns2", "define.logica.com", and the 
resulting query becomes

+DisjunctionMaxQuery((((tags:ns1 tags:define.logica.com)^1.2) |
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) |
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) |
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single 
token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
--
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom
M: +44 7557 752744
al.will...@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968) Registered Office: 
250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom

RE: StandardTokenizer and domain names containing digits

Reply via email to