At 12:13 PM -0400 9/27/07, Steven Rowe wrote:
>Chris Hostetter wrote:
>> : is there an analyzer which automatically converts all german special
>> : characters to their specific dissected from, such as ü to ue and ä to
>> : ae, etc.?!
>> See also the ISOLatin1TokenFilter which does this regardless of langauge.
>Actually, ISOLatin1TokenFilter does NOT convert /ü/ to /ue/, /ä/ to
>/ae/, etc.
>Instead, it converts /ü/ to /u/, /ä/ to /a/, etc.  It *does* convert /ß/
>to /ss/, though I've seen some people write that the correct
>substitution for /ß/ in German is /sz/ - I don't speak or read German,
>so I don't know.

You and lots of other people, including myself... Thus while there is indeed a 
"specific dissected form" -- certainly German speakers clearly understand that 
when an input mechanism doesn't allow for umlauted vowels (e.g. ASCII, 
non-German typewriters) that the /ue/, /ae/, etc. equivalents are to be used -- 
if maximally flexible matching between input texts and queries is desired, an 
information system used by non-German speakers has to account for them simply 
ignoring the umlaut and entering /u/, /e/ etc. while /ß/ needs to be matched as 
itself, /ss/, /sz/ (/ß/ is read as 'ess zed'), and I expect even /b/.

So perhaps it would make sense for translation into a canonical format /ü/ to 
/ue/ and /ß/ to /ss/ at both index and query time, but also to then emit 
synonym (overlapping) tokens with /ue/ -> /u/, /sz/ -> /ss/, and perhaps even 
/b/ -> /ss/.

(This is just thinking aloud and I'd love to be corrected by someone with more 
experience in this realm)

>Maybe there should be an option on ISOLatin1TokenFilter to use German
>substitutions, in addition to the current behavior of simply stripping

As for implementation, the first part could easily and flexibly accomplished 
with the current PatternReplaceFilter, and I'm thinking the second could be 
done with an extension to that or better yet a new Filter which allows parsing 
synonymous tokens from a flat to overlaid format, e.g. something on the order 

    <filter class="solr.PatternReplaceFilterFactory"
     tokensep="|"  <!-- not currently implemented -->

or perhaps better,

    <filter class="solr.PatternReplaceFilterFactory"
    <filter class="solr.OverlayTokenFilterFactory"
     tokensep="|"/>   <!-- not currently implemented -->

which in my fantasy implementation would map:

    Müller -> Mueller|Muller
    Mueller -> Mueller|Muller
    Muller -> Muller

and could be run at index-time and/or query-time as appropriate.

>Does anyone know if there are other (Latin-1-utilizing) languages
>besides German with standardized diacritic substitutions that involve
>something other than just stripping the diacritics?

I'm curious about this too.

- J.J.

