Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
I have documents containing tokens of a certain format in arbitrary positions, like this: ... blah blahblah AB/1234/5678 blah blah blahblah ... I would like to enable usual keyword searching within these documents. In addition, I'd also like to enable users to find AB/1234/5678, ideally

Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Erick Erickson
There's about a zillion tokenizers, for what you're describing WhitespaceTokenizerFactory is a good candidate. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for a partial list, and it has links to the authoritative docs. Best Erick On Wed, Nov 30, 2011 at 9:23 AM, Marian

Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
Thanks for the quick response! Are you saying that I should extend WhitespaceTokenizerFactory to create my own? Or should I simply use it? Because, I guess tokenizing on spaces wouldn't be enough. I would need tokenizing on slashes in other positions, just not within strings matching

Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Erick Erickson
Well, it depends (tm). No, in your case WhitespaceTokenizer wouldn't work, although it did satisfy your initial statement. You could consider PatternTokenizerFactory, but take a look at the link I provided, and follow it to the javadocs to see if there are better matches. Best Erick On Wed, Nov

RE: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Steven A Rowe
, November 30, 2011 9:41 AM To: solr-user@lucene.apache.org Subject: Re: Leaving certain tokens intact during indexing and search Thanks for the quick response! Are you saying that I should extend WhitespaceTokenizerFactory to create my own? Or should I simply use it? Because, I guess

Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
That's pretty helpful, thanks! Especially since I didn't understand so far that I could use a filter like PatternReplaceCharFilterFactory both as a charFilter and as a filter. In the meantime I had figured out another alternative, involving WordDelimiterFilterFactory. But I had to use

RE: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Steven A Rowe
-Original Message- From: Marian Steinbach [mailto:marian.steinb...@gmail.com] Sent: Wednesday, November 30, 2011 10:44 AM To: solr-user@lucene.apache.org Subject: Re: Leaving certain tokens intact during indexing and search That's pretty helpful, thanks! Especially since I didn't understand so

Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
Got me right when Solr reported the error on restart :) Thanks! 2011/11/30 Steven A Rowe sar...@syr.edu Note that my example does not actually use PatternReplaceCharFilterFactory twice - the second one is actually a PatternReplaceFilterFactory - note that Char isn't present in the second one.