On Thu, Sep 23, 2010 at 12:02 PM, Burton-West, Tom <tburt...@umich.edu>wrote: > > The problem with "l'art" is actually due to a bug or feature in the > QueryParser. Currently the QueryParser interacts with the token chain and > decides whether the tokens coming back from a tokenfilter should be treated > as a phrase query based on whether or not more than one non-synonym token > comes back from the tokestream for a single 'queryparser token'. >
Just a note: in solr's trunk or 3x branch you have a lot more flexibility already with this stuff: 1. for the specific problem of l'art: you can use the ElisionFilterFactory, its actually designed to address this. But before it was a bit unwieldy to use (you had to supply your own list of french contractions: l', m', etc): with trunk or 3x you can just add it to your analyzer, if you don't specify a list it uses the default list from Lucene's FrenchAnalyzer. 2. if you are using WordDelimiterFilter, you can customize how it splits on a per-character basis. See https://issues.apache.org/jira/browse/SOLR-2059 , a user gave a nice example there of how you can treat '#' and '@' special for twitter messages. 3. in all cases, if you don't want phrase queries automatically formed unless the user put them in quotes, you can turn it off in your fieldtype: <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false"> (somewhat related) Tom, thanks for posting your schema. given your problems with huge amounts of terms, i looked at your previous messages and ran some quick math and guestimated your average term length must be quite large. Yet i notice from your website ( http://www.hathitrust.org/visualizations_languages) it says you have 18,329 thai books (and you have no ThaiWordFilter in your schema). Are you sure that your terms are not filled with tons of very long untokenized thai sentences? (thai uses no spaces between words) just an idea :) -- Robert Muir rcm...@gmail.com