On Thu, Sep 23, 2010 at 12:02 PM, Burton-West, Tom <tburt...@umich.edu>wrote:
>
> The problem with "l'art" is actually due to a bug or feature in the
> QueryParser.  Currently the QueryParser interacts with the token chain and
> decides whether the tokens coming back from a tokenfilter should be treated
> as a phrase query based on whether or not more than one non-synonym token
> comes back from the tokestream for a single 'queryparser token'.
>

Just a note: in solr's trunk or 3x branch you have a lot more flexibility
already with this stuff:

1. for the specific problem of l'art: you can use the ElisionFilterFactory,
its actually designed to address this. But before it was a bit unwieldy to
use (you had to supply your own list of french contractions: l', m', etc):
with trunk or 3x you can just add it to your analyzer, if you don't specify
a list it uses the default list from Lucene's FrenchAnalyzer.

2. if you are using WordDelimiterFilter, you can customize how it splits on
a per-character basis. See https://issues.apache.org/jira/browse/SOLR-2059 ,
a user gave a nice example there of how you can treat '#' and '@' special
for twitter messages.

3. in all cases, if you don't want phrase queries automatically formed
unless the user put them in quotes, you can turn it off in your fieldtype:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="false">

(somewhat related)
Tom, thanks for posting your schema. given your problems with huge amounts
of terms, i looked at your previous messages and ran some quick math and
guestimated your average term length must be quite large.

Yet i notice from your website (
http://www.hathitrust.org/visualizations_languages) it says you have 18,329
thai books (and you have no ThaiWordFilter in your schema).

Are you sure that your terms are not filled with tons of very long
untokenized thai sentences? (thai uses no spaces between words) just an idea
:)

-- 
Robert Muir
rcm...@gmail.com

Reply via email to