Hi Alex, TLDR; Try adding WordDelimiterFilter to your analyzer(s).
StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from Unicode 6.0.0 Standard Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries>. These rules don't include recognition of URLs or domain names. (The details: in UAX#29 Word Boundary rules terminology, the default rule - WB14 - says that boundaries will be made everywhere they are not prohibited, and since there is no rule to prohibit making a boundary in the character sequence /Numeric, MidNumLet, ALetter/ - "." FULL STOP belongs to MidNumLet - boundaries are made between Number and MidNumLet, and between MidNumLet and ALetter. StandardTokenizer emits as tokens the character sequences between UAX#29 word boundaries that contain alphanumeric characters, so the MidNumLet-only token is dropped.) Lucene/Solr includes another tokenizer that does recognize URLs and domain names, in addition to the UAX#29 Word Boundary rules: UAX29URLEmailTokenizer <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory>. (Stand-alone domain names are recognized as URLs.) I think Lucene/Solr should have a way to tokenize URL (and e-mail) components, so that e.g. if you have "http://www.example.com/page.html" in your text, your index can contain "www.example.com" and "example.com", to enable e.g. queries containing just "example.com". I'd like to have a URLFilter and an EmailFilter that would configurably tokenize components (e.g. for URLs: protocol; domain; base domain; domain elements; full path; path elements; URL-decoded-uax29-word-boundary-tokenized path elements). This doesn't solve your problem, though. My suggestion is that you add a filter (for both the indexing and querying) that splits tokens containing periods: <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>, something like (untested!): <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" generateWordParts="1" preserveOriginal="1" /> Note that this filter will be applied to *all* of your tokens, not just domain names. Steve -----Original Message----- From: Alex Willmer [mailto:al.will...@logica.com] Sent: Thursday, April 19, 2012 12:04 PM To: solr-user@lucene.apache.org Subject: StandardTokenizer and domain names containing digits TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in the same way "ns.define.logica.com" would be? We are just starting to use Solr 3.5.0 in production and have run into a slightly surprising behaviour involving the query "ns1.define.logica.com", through an edismax handler with "q.op"=AND defined with <requestHandler name="search" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="echoParams">explicit</str> <int name="rows">10</int> <!-- #define customisations --> <str name="defType">edismax</str> <str name="q.op">AND</str> <str name="qf"> body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0 author^10.9 changed created oneline^0.7 </str> <str name="pf"> body^0.2 tags^1.1 title^1.5 </str> </lst> </requestHandler> The schema is defined with fields of type text_general, as found in the example schema.xml, namely: <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> The search string is being tokenised to "ns2", "define.logica.com", and the resulting query becomes +DisjunctionMaxQuery((((tags:ns1 tags:define.logica.com)^1.2) | id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) | ((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1 oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) | ((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1 comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1 define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1 define.logica.com"^1.5)) meaning that documents containing "ns1" OR "define.logica.com" are returned. This is contrary to e.g. "ns.logica.define.com" which is treated as a single token. Is there a way I can make Solr treat both queries the same way? Many thanks, Alex -- Alex Willmer | Developer 2 Trinity Park, Birmingham, B37 7ES | United Kingdom M: +44 7557 752744 al.will...@logica.com | www.logica.com Logica UK Ltd, registered in UK (registered number 947968) Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom