Hi Marian, Extending the StandardTokenizer(Factory) java class is not the way to go if you want to change its behavior.
StandardTokenizer is generated from a JFlex <http://jflex.de/> specification, so you would need to modify the specification to include your special slash-containing-word rule, then regenerate the java code, and then compile it. It would be much simpler to use a PatternReplaceCharFilter <http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html> to convert the slashes into unusual (sequences of) characters that won't be broken up by the analyzer you're using, then add a PatternReplaceFilter to convert the unusual sequences back to slashes. E.g. if you used "-blah-" as the unusual sequence (note: people have also reported using a single character drawn from a script that would otherwise not be used in the text, e.g. a Chinese ideograph in English text), "AB/1234/5678" could become "AB-blah-1234-blah-5678". Here's an (untested!) analyzer specification that would do this: <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([A-Z]+)/([0-9]+)/([0-9]+)" replacement="$1-blah-$2-blah-$3"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="-blah-" replacement="/" replace="all"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> Steve > -----Original Message----- > From: Marian Steinbach [mailto:marian.steinb...@gmail.com] > Sent: Wednesday, November 30, 2011 9:41 AM > To: solr-user@lucene.apache.org > Subject: Re: Leaving certain tokens intact during indexing and search > > Thanks for the quick response! > > Are you saying that I should extend WhitespaceTokenizerFactory to create > my > own? Or should I simply use it? > > Because, I guess tokenizing on spaces wouldn't be enough. I would need > tokenizing on slashes in other positions, just not within strings matching > ([A-Z]+/[0-9]+/[0-9]+). > > Marian > > > 2011/11/30 Erick Erickson <erickerick...@gmail.com> > > > There's about a zillion tokenizers, for what you're describing > > WhitespaceTokenizerFactory is a good candidate. > > > > See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > > for a partial list, and it has links to the authoritative docs. > > > > Best > > Erick > > > >