[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12858144#action_12858144 ] Robert Muir commented on LUCENE-1343: - OK! I think we have a good solution here!. We can use ICU's Normalizer2 to implement this, by simply creating a custom normalization mapping. This way we can meet multiple use-cases, e.g. someone wants to remove diacritics, someone else doesn't. And we get solid unicode behavior and high performance to boot. So I will keep this issue open, I think the best solution is to take the accent-folding mappings here (or use the ones in AsciiFoldingFilter?) and create a .txt file of mappings, passing it to gennorm2 along with NFKC case fold mappings. This way we can implement this on top of LUCENE-2399, all compiled to an efficient binary form with no code. I'll take a shot at this once LUCENE-2399 is resolved. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786941#action_12786941 ] DM Smith commented on LUCENE-1343: -- I also am dubious about a general purpose folding filter that maps letters to their ASCII look-alike and agree that folding is language dependent. May Americans are illiterate when it comes to text with diacritics and NSM. Personally I'm nearly illiterate. I think having prominent folding filters without adequate explanation about their pitfalls or usefulness may lead illiterates into a false sense of sufficiency. If it makes sense to have a filter for TR39 I think that should be a separate issue. If that's what this issue is all about then it's description should be modified. I think this should otherwise be closed as a bad idea. Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags? A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786946#action_12786946 ] Robert Muir commented on LUCENE-1343: - bq. Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you want (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?) bq. Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags? This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786968#action_12786968 ] DM Smith commented on LUCENE-1343: -- {quote} bq. Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you want (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?) {quote} Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any). The GreekLowerCaseFilter appears to only do some of the work and only works on composed characters. My question is not whether I'd find the filter useful, but whether it'd be a useful addition to Lucene. {quote} bq. Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags? This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs. {quote} I have a terrible habit of not being exact or using the proper terms. Shame on me. I meant that the latter strip all other marks. bq. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables. If there is no general purpose contribution, then it should not be part of Lucene and I'll have my own. When I do work them up, I'll create an issue or two and attach the results. If they are deemed useful then they can be added to Lucene, otherwise ignored. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786971#action_12786971 ] Robert Muir commented on LUCENE-1343: - {quote} Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any). {quote} I guess I brought this up because this is where you have several situations where case folding and normalization interact, eg. applying FC_NFKC set when case folding so that later NFK[CD] normalization will be closed, I know this is supposed to solve various ways the YPOGEGRAMMENI can be implemented but I forget the details... This is why I think, the general purpose contribution should be case folding, normalization, and the stuff like this (FC_NFKC set) to make sure they work together... If you later want to apply something more specialized like StringPrep, you need this logic anyway, see http://www.ietf.org/rfc/rfc3454.txt (especially section 3.2) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786689#action_12786689 ] Mark Miller commented on LUCENE-1343: - Mr Muir, can you take a look at this? Offer anything over the ASCIIFoldingFilter? If not, we should close, if so, what do you recommend? A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786701#action_12786701 ] Robert Muir commented on LUCENE-1343: - The big picture here and all these other duplicated normalization issues across jira is related to the outdated unicode support in the JDK. This issue speaks of removing diacritical marks / NSM's, but the underlying issue is missing unicode normalization, duplicated here (incorrectly named): LUCENE-1215 and also here: LUCENE-1488 (disclaimer: my impl) Speaking for the accent removal: In truth I do not think we should be simply removing NSMs because in most cases, they are there for a reason. For example, they are diacritics in a lot of european languages, but for many eastern languages they are the actual vowels. (i.e. all the indic scripts) We need to separate the issue of missing unicode normalization (which is clearly something lucene needs), from the issue of removing diacritics (which is language-specific and doing it based on unicode properties is inappropriate). Finally just normalizing unicode in Lucene by itself is not very useful, because there is a careful interaction with other processes and attention needs to be paid to the order in which filters are run. For example, its interaction with case folding can be a bit tricky. If you are interested in this issue I urge you to read the javadocs writeup I placed in the ICUNormalizationFilter in LUCENE-1488. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786712#action_12786712 ] Ken Krugler commented on LUCENE-1343: - Just to make sure this point doesn't get lost in the discussion over normalization - the issue of visual normalization is one that I think ISOLatin1AccentFilter originally was trying to address. Specifically how to fold together forms of letters that a user, when typing, might consider equivalent. This is indeed language specific, and re-implementing support that's already in ICU4J is clearly a Bad Idea. I think there's value in a general normalizer that implements the Unicode Consortium's algorithm/data for normalization of int'l domain names, as this is intended to avoid visual spoofing of domain names. Don't know/haven't tracked if or when this is going into ICU4J. But (similar to ICU generic sorting) it provides a useful locale-agnostic approach that would work well-enough for most Lucene use cases. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786724#action_12786724 ] Robert Muir commented on LUCENE-1343: - Hi Ken, such functionality does exist, although it is new and I think still changing (you are talking about StringPrep/IDN/etc?). If a filter for this is desired, we can do it with ICU, though I think its relatively new (probably not optimized, only works on String, etc etc) I still think even this is stupid, because unicode encodes characters, not glyphs. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622476#action_12622476 ] Erik Hatcher commented on LUCENE-1343: -- {quote} Unit tests are the best way to document the many ways this thing can work. {quote} gets a judges score of 11 from me. Gold for Lance for Quote of the Day. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622607#action_12622607 ] Robert Haschart commented on LUCENE-1343: - The UnicodeNormalizationFilter does use the decompose normalization portion of the icu4j library as a starting point. However even with that there are several instances where the normalizer code does not decompose a character into an unaccented character and a accent mark, a notable one being ( Ł - L ) so the UnicodeNormalizationFilter start with the approach you outlined, perform a decompose normalization followed by discarding all non-spacing modifier characters, and then can go on from there to further normalize the data by folding the additional characters that aren't handled by the decompose normalization onto their Latin1 lookalikes. -Robert A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622746#action_12622746 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, So given that you and the Unicode consortium seem to be working on the same problem (normalizing visually similar characters), how similar are your tables to the ones that have been developed to deter spoofing of int'l domain names? -- Ken A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622345#action_12622345 ] Lance Norskog commented on LUCENE-1343: --- Some languages like Cyrillic have a standard latin-1 transliteration, and deserve their own filters. Cyrillic is one case of this. It is based on three alphabets: 1/3 latin, 1/3 greek, and 1/3 new characters for 'ya/ye', 'ts', 'sh', 'ch', 'zh', and 'sh-ch' (fiSH CHips!). Unit tests are the best way to document the many ways this thing can work. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622432#action_12622432 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, FWIW, the issues being discussed here are very similar to those covered by the [Unicode Security Considerations|http://www.unicode.org/reports/tr36/] technical report #36, and associated data found in the [Unicode Security Mechanisms|http://www.unicode.org/reports/tr39/] technical report #39. The fundamental issue for int'l domain name spoofing is detecting when two sequences of Unicode code points will render as similar glyphs...which is basically the same issue you're trying to address here, so that when you search for something you'll find all terms that look similar. So for a more complete (though undoubtedly slower bigger) solution, I'd suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing marks, lower-case the result, and finally apply mappings using the data tables found in the technical report #39 referenced above. -- Ken A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615770#action_12615770 ] Hoss Man commented on LUCENE-1343: -- Random related comment (just because this issue seemed like a good place to put it) People may also want to consider constructing a Filter based on the substitution tables from the perl Text::Unidecode module... http://search.cpan.org/~sburke/Text-Unidecode/ http://interglacial.com/~sburke/tpj/as_html/tpj22.html ...i have no idea how it's behavior compares to the UnicodeNormalizationFilter, just that it seems to have similar goals. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615775#action_12615775 ] Steven Rowe commented on LUCENE-1343: - Hi Robert, My comments below assume you're intrestested in having this code hosted in the Lucene source repository - please disregard if that's not the case. Have you seen the [HowToContribute page on the Lucene wiki|http://wiki.apache.org/lucene-java/HowToContribute]? It outlines some of the basics concerning code submissions. A couple of things I noticed that need to be addressed before the code will be accepted: # Tab characters should be converted to spaces # Indentation increment should be two spaces # Test(s) should be moved from the UnicodeNormalizationFilterFactory.main() method into standalone class(es) that extend LuceneTestCase # More/more explicit javadocs - for example, you should describe the set of provided transformations (e.g. Cyrillic diacritic stripping is included). # Solr is a separate code base, so the UnicodeNormalizationFilterFactory should be moved to a Solr JIRA issue # Because it has a dependency on the ICU jar, this contribution will have to live in the contrib/ area -- the Java packages name should be adjusted accordingly. # The submission should be repackaged as a patch (instructions available on the above-linked wiki page). A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]