Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Marvin Humphrey
On Oct 11, 2005, at 10:04 AM, Hugo Lafayette wrote: First of all, add maybe I make a false assumption here, but if you strip leading "j'", "t'" and so on, that means that if you make a search like: +text:"il m'aime" you will get documents with the sentence "il m'aime" (french for "he lov

Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Hugo Lafayette
Marvin Humphrey wrote: > I'm curious: are there any cases in French where a string with an > apostrophe in it ought to be split into two searchable tokens? I > know of no such cases in English: you never want to search for the ll > in you'll, or the O in O'Reilly, etc. First of all, add ma

Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Marvin Humphrey
On Oct 11, 2005, at 7:52 AM, Hugo Lafayette wrote: Why do not include that in the FrenchStemFilter "next()" method itself ? It will be a bad design ? I agree with your assessment. Conceptually, this is a stemming problem. By extension, it's not a tokenizing problem, and the behavior o

Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Erik Hatcher
On Oct 11, 2005, at 10:52 AM, Hugo Lafayette wrote: Erik Hatcher wrote: Rather than changing StandardAnalyzer, you could create a custom Analyzer that is something along the lines of StandardTokenizer -> custom apostrophe splitting filter -> ISOLatinFilter. Why do not include that in the

Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Hugo Lafayette
Erik Hatcher wrote: > Rather than changing StandardAnalyzer, you could create a custom > Analyzer that is something along the lines of StandardTokenizer -> > custom apostrophe splitting filter -> ISOLatinFilter. Why do not include that in the FrenchStemFilter "next()" method itself ? It wil

Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Erik Hatcher
On Oct 11, 2005, at 9:22 AM, Hugo Lafayette wrote: - accentuated characters: The french analyzer keep accents, which could be useful, but may also become boring. I just have to add the ISOLatinFilter.java to correct that, but maybe adding an option to keep them or not could be useful. - ap