protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
Hello Solr devs, One thing we did recently in lucene that I would like to expose in Solr, is add support for protected words to all stemmers. So the way this works is that a TokenStream attribute 'KeywordAttribute' is set, and all the stemfilters know to ignore tokens with this boolean value

Re: protwords.txt support in stemmers

2010-03-30 Thread Yonik Seeley
On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote: We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap the filter with a keywordmarkerfilter internally. * we could deprecate the

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: It would also be nice to make the token categories generated by tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A tokenizer that detected many of the properties could significantly speed up analysis

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote: We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap

Re: protwords.txt support in stemmers

2010-03-30 Thread Yonik Seeley
On Tue, Mar 30, 2010 at 10:07 AM, Robert Muir rcm...@gmail.com wrote: Sorta unrelated too, but on the same topic of performance, I'd really like to improve the indexing speed with the example schema, and thats my hidden motivation here. I think we've already significantly improved WDF and

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 10:32 AM, Yonik Seeley yo...@lucidimagination.comwrote: Unfortunately not... it's normally something ad hoc like uploading a big CSV file, etc. There's also the very simplistic TestIndexingPerformance. ant test -Dtestcase=TestIndexingPerformance -Dargs=-server

Re: protwords.txt support in stemmers

2010-03-30 Thread Grant Ingersoll
On Mar 30, 2010, at 8:33 AM, Yonik Seeley wrote: On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote: We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap the filter with a