LetterTokenizerFactory will use each contiguous sequence of letters and discard 
the rest. http, https, com,  etc. would need to be a stopword.

Alternatively you can try PatternTokenizerFactory with a regular expression if 
you are looking for a specific part of the URL.

On Sep 23, 2010, at 10:59 PM, Max Lynch wrote:

> Is there a tokenizer that will allow me to search for parts of a URL?  For
> example, the search "google" would match on the data "
> http://mail.google.com/dlkjadf";
> 
> This tokenizer factory doesn't seem to be sufficient:
> 
>        <fieldType name="text_standard" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>            </analyzer>
>            <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>             </analyzer>
>    </fieldType>
> 
> Thanks.

Reply via email to