Re: protwords.txt support in stemmers

Grant Ingersoll Tue, 30 Mar 2010 12:17:47 -0700

On Mar 30, 2010, at 8:33 AM, Yonik Seeley wrote:

> On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir <rcm...@gmail.com> wrote:
>> We have two choices:
>> * we could treat this stuff as impl details, and add protwords.txt support
>> to all stemming factories. we could just wrap the filter with a
>> keywordmarkerfilter internally.
>> * we could deprecate the explicit protwords.txt in the few factories that
>> support it, and instead create a factory for KeywordMarkerFilter.
>> * we could do something else, e.g. both.
>> 
>> So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
>> could do:
>> 
>> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>> <filter class="solr.SomeStemmer"/>
>> 
>> and get the same effect, instead of having to add support for protwords.txt
>> to every single stem factory.
> 
> Yep, this decomposition seems more powerful.
> 
> Sort of related: for a long time I've had the idea of allowing the
> expression of more complex filter chains that can conditionally
> execute some parts based on tags set by other parts.
> 
> This is straightforward to just hand-code in Java of course, but
> trickier to do well in a declarative setting:
> 
> <filter class="solr.Tagger" tag="protect" words="protwords.txt"/>
> <filter class="solr.SomeStemmer" skipTags="protect"/>
> 
> The idea was to also make this fast by allocating a bit per tag
> (assuming we somehow knew all of the possible ones in a particular
> filter chain) and using a bitfield (long) to set and test.  I was
> planning on using Token.flags before the new analysis attribute stuff
> came into being.


I believe you have to declare the Attributes up front, right?  Should be 
possible to know them, right?

> 
> It would also be nice to make the token categories generated by
> tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
> tokenizer that detected many of the properties could significantly
> speed up analysis because tokens would not have to be re-analyzed to
> see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
> path for WDF would be checking a bit per token).

Good opportunity to also get rid of the TypeAttribute all together, too, as 
that thing is no longer useful.

Re: protwords.txt support in stemmers

Reply via email to