On Mar 30, 2010, at 8:33 AM, Yonik Seeley wrote: > On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir <rcm...@gmail.com> wrote: >> We have two choices: >> * we could treat this stuff as impl details, and add protwords.txt support >> to all stemming factories. we could just wrap the filter with a >> keywordmarkerfilter internally. >> * we could deprecate the explicit protwords.txt in the few factories that >> support it, and instead create a factory for KeywordMarkerFilter. >> * we could do something else, e.g. both. >> >> So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user >> could do: >> >> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> >> <filter class="solr.SomeStemmer"/> >> >> and get the same effect, instead of having to add support for protwords.txt >> to every single stem factory. > > Yep, this decomposition seems more powerful. > > Sort of related: for a long time I've had the idea of allowing the > expression of more complex filter chains that can conditionally > execute some parts based on tags set by other parts. > > This is straightforward to just hand-code in Java of course, but > trickier to do well in a declarative setting: > > <filter class="solr.Tagger" tag="protect" words="protwords.txt"/> > <filter class="solr.SomeStemmer" skipTags="protect"/> > > The idea was to also make this fast by allocating a bit per tag > (assuming we somehow knew all of the possible ones in a particular > filter chain) and using a bitfield (long) to set and test. I was > planning on using Token.flags before the new analysis attribute stuff > came into being.
I believe you have to declare the Attributes up front, right? Should be possible to know them, right? > > It would also be nice to make the token categories generated by > tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A > tokenizer that detected many of the properties could significantly > speed up analysis because tokens would not have to be re-analyzed to > see if they contain mixed case, numbers, hyphens, etc (i.e. the fast > path for WDF would be checking a bit per token). Good opportunity to also get rid of the TypeAttribute all together, too, as that thing is no longer useful.