I have documents containing tokens of a certain format in arbitrary
positions, like this:
... blah blahblah AB/1234/5678 blah blah blahblah ...
I would like to enable usual keyword searching within these documents. In
addition, I'd also like to enable users to find AB/1234/5678, ideally
There's about a zillion tokenizers, for what you're describing
WhitespaceTokenizerFactory is a good candidate.
See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for a partial list, and it has links to the authoritative docs.
Best
Erick
On Wed, Nov 30, 2011 at 9:23 AM, Marian
Thanks for the quick response!
Are you saying that I should extend WhitespaceTokenizerFactory to create my
own? Or should I simply use it?
Because, I guess tokenizing on spaces wouldn't be enough. I would need
tokenizing on slashes in other positions, just not within strings matching
Well, it depends (tm). No, in your case WhitespaceTokenizer wouldn't work,
although it did satisfy your initial statement.
You could consider PatternTokenizerFactory, but take a look at the
link I provided, and follow it to the javadocs to see if there are
better matches.
Best
Erick
On Wed, Nov
, November 30, 2011 9:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Leaving certain tokens intact during indexing and search
Thanks for the quick response!
Are you saying that I should extend WhitespaceTokenizerFactory to create
my
own? Or should I simply use it?
Because, I guess
That's pretty helpful, thanks! Especially since I didn't understand so far
that I could use a filter like PatternReplaceCharFilterFactory both as a
charFilter and as a filter.
In the meantime I had figured out another alternative,
involving WordDelimiterFilterFactory. But I had to
use
-Original Message-
From: Marian Steinbach [mailto:marian.steinb...@gmail.com]
Sent: Wednesday, November 30, 2011 10:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Leaving certain tokens intact during indexing and search
That's pretty helpful, thanks! Especially since I didn't understand so
Got me right when Solr reported the error on restart :) Thanks!
2011/11/30 Steven A Rowe sar...@syr.edu
Note that my example does not actually use PatternReplaceCharFilterFactory
twice - the second one is actually a PatternReplaceFilterFactory - note
that Char isn't present in the second one.