[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2010-04-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12858144#action_12858144
 ] 

Robert Muir commented on LUCENE-1343:
-

OK! I think we have a good solution here!.

We can use ICU's Normalizer2 to implement this, by simply creating a custom 
normalization mapping.
This way we can meet multiple use-cases, e.g. someone wants to remove 
diacritics, someone else doesn't.

And we get solid unicode behavior and high performance to boot.

So I will keep this issue open, I think the best solution is to take the 
accent-folding mappings here (or use the ones in AsciiFoldingFilter?) and 
create a .txt file of mappings, passing it to gennorm2 along with NFKC case 
fold mappings.

This way we can implement this on top of LUCENE-2399, all compiled to an 
efficient binary form with no code.
I'll take a shot at this once LUCENE-2399 is resolved.

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-07 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786941#action_12786941
 ] 

DM Smith commented on LUCENE-1343:
--

I also am dubious about a general purpose folding filter that maps letters to 
their ASCII look-alike and agree that folding is language dependent.

May Americans are illiterate when it comes to text with diacritics and NSM. 
Personally I'm nearly illiterate. I think having prominent folding filters 
without adequate explanation about their pitfalls or usefulness may lead 
illiterates into a false sense of sufficiency.

If it makes sense to have a filter for TR39 I think that should be a separate 
issue. If that's what this issue is all about then it's description should be 
modified.

I think this should otherwise be closed as a bad idea.

Robert Muir, Would it make sense to have a Greek filter that strips diacritics? 
My thought is that if the letter is Greek then the diacritics would be removed, 
but otherwise it would not.

Similar question for Hebrew, I see value in two filters: one would strip 
cantillation and the other, vowel points. Or would it be better to have one 
that can do both depending on flags?

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786946#action_12786946
 ] 

Robert Muir commented on LUCENE-1343:
-

bq. Robert Muir, Would it make sense to have a Greek filter that strips 
diacritics? My thought is that if the letter is Greek then the diacritics would 
be removed, but otherwise it would not.

The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it 
removes tone marks... but this might not be what you want (depending on what 
that is), if you are dealing with polytonic Greek (sorry for my ignorance of 
the biblical test you are looking at, but I think it is ancient Greek?)

bq. Similar question for Hebrew, I see value in two filters: one would strip 
cantillation and the other, vowel points. Or would it be better to have one 
that can do both depending on flags?

This depends on your use case, and then you have dagesh,shin dot, too... These 
are all NSMs. But this is going to depend on the user, and I think every person 
will need their own, they can use CharFilter or other ways of defining these 
tables.


 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-07 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786968#action_12786968
 ] 

DM Smith commented on LUCENE-1343:
--

{quote}
 bq.   Robert Muir, Would it make sense to have a Greek filter that strips 
diacritics? My thought is that if the letter is Greek then the diacritics would 
be removed, but otherwise it would not.

The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it 
removes tone marks... but this might not be what you want (depending on what 
that is), if you are dealing with polytonic Greek (sorry for my ignorance of 
the biblical test you are looking at, but I think it is ancient Greek?)
{quote}

Yes, I'm referring to ancient Greek (grc, not el) and they are tone and 
breathing marks. Most ancient texts did not have these marks but modern do. 
Even some modern representations of the ancient. While I have several semesters 
of koine Greek under my belt and might be wrong, there may be ambiguities where 
two words have the same letters but differ on marks, but they are infrequent (I 
don't know of any).

The GreekLowerCaseFilter appears to only do some of the work and only works on 
composed characters.

My question is not whether I'd find the filter useful, but whether it'd be a 
useful addition to Lucene.

{quote}
bq.   Similar question for Hebrew, I see value in two filters: one would strip 
cantillation and the other, vowel points. Or would it be better to have one 
that can do both depending on flags?

This depends on your use case, and then you have dagesh,shin dot, too... These 
are all NSMs.
{quote}
I have a terrible habit of not being exact or using the proper terms. Shame on 
me. I meant that the latter strip all other marks.

bq. But this is going to depend on the user, and I think every person will need 
their own, they can use CharFilter or other ways of defining these tables.

If there is no general purpose contribution, then it should not be part of 
Lucene and I'll have my own.

When I do work them up, I'll create an issue or two and attach the results. If 
they are deemed useful then they can be added to Lucene, otherwise ignored.

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786971#action_12786971
 ] 

Robert Muir commented on LUCENE-1343:
-

{quote}
Yes, I'm referring to ancient Greek (grc, not el) and they are tone and 
breathing marks. Most ancient texts did not have these marks but modern do. 
Even some modern representations of the ancient. While I have several semesters 
of koine Greek under my belt and might be wrong, there may be ambiguities where 
two words have the same letters but differ on marks, but they are infrequent (I 
don't know of any).
{quote}

I guess I brought this up because this is where you have several situations 
where case folding and normalization interact, eg. applying FC_NFKC set when 
case folding so that later NFK[CD] normalization will be closed, I know this is 
supposed to solve various ways the YPOGEGRAMMENI can be implemented but I 
forget the details...

This is why I think, the general purpose contribution should be case folding, 
normalization, and the stuff like this (FC_NFKC set) to make sure they work 
together...

If you later want to apply something more specialized like StringPrep, you need 
this logic anyway, see http://www.ietf.org/rfc/rfc3454.txt (especially section 
3.2) 


 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786689#action_12786689
 ] 

Mark Miller commented on LUCENE-1343:
-

Mr Muir, can you take a look at this? Offer anything over the 
ASCIIFoldingFilter? If not, we should close, if so, what do you recommend?

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786701#action_12786701
 ] 

Robert Muir commented on LUCENE-1343:
-

The big picture here and all these other duplicated normalization issues across 
jira is related to the outdated unicode support in the JDK. 

This issue speaks of removing diacritical marks / NSM's, but the underlying 
issue is missing unicode normalization, duplicated here (incorrectly named): 
LUCENE-1215 and also here: LUCENE-1488 (disclaimer: my impl)

Speaking for the accent removal: In truth I do not think we should be simply 
removing NSMs because in most cases, they are there for a reason. For example, 
they are diacritics in a lot of european languages, but for many eastern 
languages they are the actual vowels. (i.e. all the indic scripts)

We need to separate the issue of missing unicode normalization (which is 
clearly something lucene needs), from the issue of removing diacritics (which 
is language-specific and doing it based on unicode properties is inappropriate).

Finally just normalizing unicode in Lucene by itself is not very useful, 
because there is a careful interaction with other processes and attention needs 
to be paid to the order in which filters are run. For example, its interaction 
with case folding can be a bit tricky. If you are interested in this issue I 
urge you to read the javadocs writeup I placed in the ICUNormalizationFilter in 
LUCENE-1488.


 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-06 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786712#action_12786712
 ] 

Ken Krugler commented on LUCENE-1343:
-

Just to make sure this point doesn't get lost in the discussion over 
normalization - the issue of visual normalization is one that I think 
ISOLatin1AccentFilter originally was trying to address. Specifically how to 
fold together forms of letters that a user, when typing, might consider 
equivalent.

This is indeed language specific, and re-implementing support that's already in 
ICU4J is clearly a Bad Idea.

I think there's value in a general normalizer that implements the Unicode 
Consortium's algorithm/data for normalization of int'l domain names, as this is 
intended to avoid visual spoofing of domain names.

Don't know/haven't tracked if or when this is going into ICU4J. But (similar to 
ICU generic sorting) it provides a useful locale-agnostic approach that would 
work well-enough for most Lucene use cases.

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786724#action_12786724
 ] 

Robert Muir commented on LUCENE-1343:
-

Hi Ken, such functionality does exist, although it is new and I think still 
changing (you are talking about StringPrep/IDN/etc?).

If a filter for this is desired, we can do it with ICU, though I think its 
relatively new (probably not optimized, only works on String, etc etc)

I still think even this is stupid, because unicode encodes characters, not 
glyphs.

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-14 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622476#action_12622476
 ] 

Erik Hatcher commented on LUCENE-1343:
--

{quote}
Unit tests are the best way to document the many ways this thing can work.
{quote}

gets a judges score of 11 from me.  Gold for Lance for Quote of the Day.

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-14 Thread Robert Haschart (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622607#action_12622607
 ] 

Robert Haschart commented on LUCENE-1343:
-

The UnicodeNormalizationFilter does use the decompose normalization 
portion of the icu4j library as a starting point.  However even with 
that there are several instances where the normalizer code does not 
decompose a character into an unaccented character and a accent mark, a 
notable one being   ( Ł - L )  so the UnicodeNormalizationFilter start 
with the approach you outlined, perform a decompose normalization 
followed by discarding all non-spacing modifier characters, and then can 
go on from there to further normalize the data by folding the additional 
characters that aren't handled by the decompose normalization onto their 
Latin1 lookalikes.

-Robert






 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-14 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622746#action_12622746
 ] 

Ken Krugler commented on LUCENE-1343:
-

Hi Robert,

So given that you and the Unicode consortium seem to be working on the same 
problem (normalizing visually similar characters), how similar are your tables 
to the ones that have been developed to deter spoofing of int'l domain names?

-- Ken

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-13 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622345#action_12622345
 ] 

Lance Norskog commented on LUCENE-1343:
---

Some languages like Cyrillic have a standard latin-1 transliteration, and 
deserve their own filters. 

Cyrillic is one case of this. It is based on three alphabets: 1/3 latin, 1/3 
greek, and 1/3 new characters for 'ya/ye', 'ts', 'sh', 'ch', 'zh', and 'sh-ch' 
(fiSH CHips!).

Unit tests are the best way to document the many ways this thing can work.





 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-13 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622432#action_12622432
 ] 

Ken Krugler commented on LUCENE-1343:
-

Hi Robert,

FWIW, the issues being discussed here are very similar to those covered by the 
[Unicode Security Considerations|http://www.unicode.org/reports/tr36/] 
technical report #36, and associated data found in the [Unicode Security 
Mechanisms|http://www.unicode.org/reports/tr39/] technical report #39.

The fundamental issue for int'l domain name spoofing is detecting when two 
sequences of Unicode code points will render as similar glyphs...which is 
basically the same issue you're trying to address here, so that when you search 
for something you'll find all terms that look similar.

So for a more complete (though undoubtedly slower  bigger) solution, I'd 
suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing 
marks, lower-case the result, and finally apply mappings using the data tables 
found in the technical report #39 referenced above.

-- Ken

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-07-22 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615770#action_12615770
 ] 

Hoss Man commented on LUCENE-1343:
--

Random related comment (just because this issue seemed like a good place to put 
it)

People may also want to consider constructing a Filter based on the 
substitution tables from the perl Text::Unidecode module...

http://search.cpan.org/~sburke/Text-Unidecode/
http://interglacial.com/~sburke/tpj/as_html/tpj22.html

...i have no idea how it's behavior compares to the UnicodeNormalizationFilter, 
just that it seems to have similar goals.

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-07-22 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615775#action_12615775
 ] 

Steven Rowe commented on LUCENE-1343:
-

Hi Robert,

My comments below assume you're intrestested in having this code hosted in the 
Lucene source repository - please disregard if that's not the case.

Have you seen the [HowToContribute page on the Lucene 
wiki|http://wiki.apache.org/lucene-java/HowToContribute]?  It outlines some of 
the basics concerning code submissions.

A couple of things I noticed that need to be addressed before the code will be 
accepted:

# Tab characters should be converted to spaces
# Indentation increment should be two spaces
# Test(s) should be moved from the UnicodeNormalizationFilterFactory.main() 
method into standalone class(es) that extend LuceneTestCase
# More/more explicit javadocs - for example, you should describe the set of 
provided transformations (e.g. Cyrillic diacritic stripping is included).
# Solr is a separate code base, so the UnicodeNormalizationFilterFactory should 
be moved to a Solr JIRA issue
# Because it has a dependency on the ICU jar, this contribution will have to 
live in the contrib/ area -- the Java packages name should be adjusted 
accordingly.
# The submission should be repackaged as a patch (instructions available on the 
above-linked wiki page).


 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]