[Wikidata-bugs] [Maniphest] T233204: Mixup of unicode characters in Query Service

Igorkim78 Tue, 03 Nov 2020 09:03:25 -0800

Igorkim78 added a comment.


  If you will consider changing collator configuration, note, that collator 
type should NOT be changed from the default value ICU:
  com.bigdata.btree.keys.KeyBuilder.collator=ICU
  There are collator type options JDK and ASCII, but both would not be usable, 
as JDK is basically result in the same comparison as ICU uses, but generate 
much larger keys; and ASCII just assumes the source text to be ASCII and 
completely drops Unicode support.
  
  As Stas mentioned Blazegraph uses ICU default collator strength. Which 
depends on locale of the literal, but is Tertiary in most cases (that's why it 
might behave differently if lang tag is specified):
  com.ibm.icu.text.Collator#getInstance(java.util.Locale)
  
  You have 4 strength options besides default Tertiary:
  Ref: http://userguide.icu-project.org/collation/concepts#TOC-Comparison-Levels
  
  Primary Level: Typically, this is used to denote differences between base 
characters (for example, "a" < "b"). It is the strongest difference. For 
example, dictionaries are divided into different sections by base character. 
This is also called the level-1 strength.
  
  Secondary Level: Accents in the characters are considered secondary 
differences (for example, "as" < "às" < "at"). Other differences between 
letters can also be considered secondary differences, depending on the 
language. A secondary difference is ignored when there is a primary difference 
anywhere in the strings. This is also called the level-2 strength.
  Note: In some languages (such as Danish), certain accented letters are 
considered to be separate base characters. In most languages, however, an 
accented letter only has a secondary difference from the unaccented version of 
that letter.
  
  Tertiary Level (Default in most cases): Upper and lower case differences in 
characters are distinguished at the tertiary level (for example, "ao" < "Ao" < 
"aò"). In addition, a variant of a letter differs from the base form on the 
tertiary level (such as "A" and "Ⓐ"). Another example is the difference between 
large and small Kana. A tertiary difference is ignored when there is a primary 
or secondary difference anywhere in the strings. This is also called the 
level-3 strength.
  
  Quaternary Level: When punctuation is ignored (see Ignoring Punctuations (§)) 
at level 1-3, an additional level can be used to distinguish words with and 
without punctuation (for example, "ab" < "a-b" < "aB"). This difference is 
ignored when there is a primary, secondary or tertiary difference. This is also 
known as the level-4 strength. The quaternary level should only be used if 
ignoring punctuation is required or when processing Japanese text (see Hiragana 
processing (§)).
  
  Identical Level: When all other levels are equal, the identical level is used 
as a tiebreaker. The Unicode code point values of the NFD form of each string 
are compared at this level, just in case there is no difference at levels 1-4 . 
For example, Hebrew cantillation marks are only distinguished at this level. 
This level should be used sparingly, as only code point values differences 
between two strings is an extremely rare occurrence. Using this level 
substantially decreases the performance for
  both incremental comparison and sort key generation (as well as increasing 
the sort key length). It is also known as level 5 strength.
  
  While Quaternary level might be sufficient for 'Abeŀlio' if dot is a 
punctuation here, but given the necessity to distinguish between ⑫ and ⓬, the 
only option to consider is Identical.
  
  The strength could be adjusted by specifying RWStore.properties parameter:
  com.bigdata.btree.keys.KeyBuilder.collator.strength=Identical
  
  It will not update configuration for existing journals, you would need full 
reload, and watch out for the size of the resulting journal, it will be larger, 
but it's hard to estimate how much.

TASK DETAIL
  https://phabricator.wikimedia.org/T233204

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Igorkim78
Cc: Nikki, CamelCaseNick, Smalyshev, Aklapper, Lucas_Werkmeister_WMDE, 
Igorkim78, Gehel, Lea_Lacroix_WMDE, CBogen, Akuckartz, Nandana, Namenlos314, 
Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, 
Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] T233204: Mixup of unicode characters in Query Service

Reply via email to