Hello Solr devs,
One thing we did recently in lucene that I would like to expose in Solr, is
add support for protected words to all stemmers.
So the way this works is that a TokenStream attribute 'KeywordAttribute' is
set, and all the stemfilters know to ignore tokens with this boolean value
On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote:
We have two choices:
* we could treat this stuff as impl details, and add protwords.txt support
to all stemming factories. we could just wrap the filter with a
keywordmarkerfilter internally.
* we could deprecate the
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote:
It would also be nice to make the token categories generated by
tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A
tokenizer that detected many of the properties could significantly
speed up analysis
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote:
On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote:
We have two choices:
* we could treat this stuff as impl details, and add protwords.txt
support
to all stemming factories. we could just wrap
On Tue, Mar 30, 2010 at 10:07 AM, Robert Muir rcm...@gmail.com wrote:
Sorta unrelated too, but on the same topic of performance, I'd really like
to improve the indexing speed with the example schema, and thats my hidden
motivation here.
I think we've already significantly improved WDF and
On Tue, Mar 30, 2010 at 10:32 AM, Yonik Seeley
yo...@lucidimagination.comwrote:
Unfortunately not... it's normally something ad hoc like uploading a
big CSV file, etc.
There's also the very simplistic TestIndexingPerformance.
ant test -Dtestcase=TestIndexingPerformance -Dargs=-server
On Mar 30, 2010, at 8:33 AM, Yonik Seeley wrote:
On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote:
We have two choices:
* we could treat this stuff as impl details, and add protwords.txt support
to all stemming factories. we could just wrap the filter with a