At 12:13 PM -0400 9/27/07, Steven Rowe wrote:
>Chris Hostetter wrote:
>> : is there an analyzer which automatically converts all german special
>> : characters to their specific dissected from, such as ü to ue and ä to
>> : ae, etc.?!
>>
>> See also the ISOLatin1TokenFilter which does this regardless of langauge.
>
>Actually, ISOLatin1TokenFilter does NOT convert /ü/ to /ue/, /ä/ to
>/ae/, etc.
>
>Instead, it converts /ü/ to /u/, /ä/ to /a/, etc.  It *does* convert /ß/
>to /ss/, though I've seen some people write that the correct
>substitution for /ß/ in German is /sz/ - I don't speak or read German,
>so I don't know.

You and lots of other people, including myself... Thus while there is indeed a 
"specific dissected form" -- certainly German speakers clearly understand that 
when an input mechanism doesn't allow for umlauted vowels (e.g. ASCII, 
non-German typewriters) that the /ue/, /ae/, etc. equivalents are to be used -- 
if maximally flexible matching between input texts and queries is desired, an 
information system used by non-German speakers has to account for them simply 
ignoring the umlaut and entering /u/, /e/ etc. while /ß/ needs to be matched as 
itself, /ss/, /sz/ (/ß/ is read as 'ess zed'), and I expect even /b/.

So perhaps it would make sense for translation into a canonical format /ü/ to 
/ue/ and /ß/ to /ss/ at both index and query time, but also to then emit 
synonym (overlapping) tokens with /ue/ -> /u/, /sz/ -> /ss/, and perhaps even 
/b/ -> /ss/.

(This is just thinking aloud and I'd love to be corrected by someone with more 
experience in this realm)

>Maybe there should be an option on ISOLatin1TokenFilter to use German
>substitutions, in addition to the current behavior of simply stripping
>diacritics?

As for implementation, the first part could easily and flexibly accomplished 
with the current PatternReplaceFilter, and I'm thinking the second could be 
done with an extension to that or better yet a new Filter which allows parsing 
synonymous tokens from a flat to overlaid format, e.g. something on the order 
of:

    <filter class="solr.PatternReplaceFilterFactory"
     pattern="(.*)(ü|ue)(.*)"
     replacement="$1ue$3|$1u$3"
     tokensep="|"  <!-- not currently implemented -->
     replace="first"/>

or perhaps better,

    <filter class="solr.PatternReplaceFilterFactory"
     pattern="(.*)(ü|ue)(.*)"
     replacement="$1ue$3|$1u$3"
     replace="first"/>
    <filter class="solr.OverlayTokenFilterFactory"
     tokensep="|"/>   <!-- not currently implemented -->

which in my fantasy implementation would map:

    Müller -> Mueller|Muller
    Mueller -> Mueller|Muller
    Muller -> Muller

and could be run at index-time and/or query-time as appropriate.

>Does anyone know if there are other (Latin-1-utilizing) languages
>besides German with standardized diacritic substitutions that involve
>something other than just stripping the diacritics?

I'm curious about this too.

- J.J.

Reply via email to