At 12:13 PM -0400 9/27/07, Steven Rowe wrote: >Chris Hostetter wrote: >> : is there an analyzer which automatically converts all german special >> : characters to their specific dissected from, such as ü to ue and ä to >> : ae, etc.?! >> >> See also the ISOLatin1TokenFilter which does this regardless of langauge. > >Actually, ISOLatin1TokenFilter does NOT convert /ü/ to /ue/, /ä/ to >/ae/, etc. > >Instead, it converts /ü/ to /u/, /ä/ to /a/, etc. It *does* convert /ß/ >to /ss/, though I've seen some people write that the correct >substitution for /ß/ in German is /sz/ - I don't speak or read German, >so I don't know.
You and lots of other people, including myself... Thus while there is indeed a "specific dissected form" -- certainly German speakers clearly understand that when an input mechanism doesn't allow for umlauted vowels (e.g. ASCII, non-German typewriters) that the /ue/, /ae/, etc. equivalents are to be used -- if maximally flexible matching between input texts and queries is desired, an information system used by non-German speakers has to account for them simply ignoring the umlaut and entering /u/, /e/ etc. while /ß/ needs to be matched as itself, /ss/, /sz/ (/ß/ is read as 'ess zed'), and I expect even /b/. So perhaps it would make sense for translation into a canonical format /ü/ to /ue/ and /ß/ to /ss/ at both index and query time, but also to then emit synonym (overlapping) tokens with /ue/ -> /u/, /sz/ -> /ss/, and perhaps even /b/ -> /ss/. (This is just thinking aloud and I'd love to be corrected by someone with more experience in this realm) >Maybe there should be an option on ISOLatin1TokenFilter to use German >substitutions, in addition to the current behavior of simply stripping >diacritics? As for implementation, the first part could easily and flexibly accomplished with the current PatternReplaceFilter, and I'm thinking the second could be done with an extension to that or better yet a new Filter which allows parsing synonymous tokens from a flat to overlaid format, e.g. something on the order of: <filter class="solr.PatternReplaceFilterFactory" pattern="(.*)(ü|ue)(.*)" replacement="$1ue$3|$1u$3" tokensep="|" <!-- not currently implemented --> replace="first"/> or perhaps better, <filter class="solr.PatternReplaceFilterFactory" pattern="(.*)(ü|ue)(.*)" replacement="$1ue$3|$1u$3" replace="first"/> <filter class="solr.OverlayTokenFilterFactory" tokensep="|"/> <!-- not currently implemented --> which in my fantasy implementation would map: Müller -> Mueller|Muller Mueller -> Mueller|Muller Muller -> Muller and could be run at index-time and/or query-time as appropriate. >Does anyone know if there are other (Latin-1-utilizing) languages >besides German with standardized diacritic substitutions that involve >something other than just stripping the diacritics? I'm curious about this too. - J.J.