Igorkim78 added a comment.
If you will consider changing collator configuration, note, that collator type should NOT be changed from the default value ICU: com.bigdata.btree.keys.KeyBuilder.collator=ICU There are collator type options JDK and ASCII, but both would not be usable, as JDK is basically result in the same comparison as ICU uses, but generate much larger keys; and ASCII just assumes the source text to be ASCII and completely drops Unicode support. As Stas mentioned Blazegraph uses ICU default collator strength. Which depends on locale of the literal, but is Tertiary in most cases (that's why it might behave differently if lang tag is specified): com.ibm.icu.text.Collator#getInstance(java.util.Locale) You have 4 strength options besides default Tertiary: Ref: http://userguide.icu-project.org/collation/concepts#TOC-Comparison-Levels Primary Level: Typically, this is used to denote differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character. This is also called the level-1 strength. Secondary Level: Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). Other differences between letters can also be considered secondary differences, depending on the language. A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength. Note: In some languages (such as Danish), certain accented letters are considered to be separate base characters. In most languages, however, an accented letter only has a secondary difference from the unaccented version of that letter. Tertiary Level (Default in most cases): Upper and lower case differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In addition, a variant of a letter differs from the base form on the tertiary level (such as "A" and "Ⓐ"). Another example is the difference between large and small Kana. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called the level-3 strength. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations (§)) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a primary, secondary or tertiary difference. This is also known as the level-4 strength. The quaternary level should only be used if ignoring punctuation is required or when processing Japanese text (see Hiragana processing (§)). Identical Level: When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared at this level, just in case there is no difference at levels 1-4 . For example, Hebrew cantillation marks are only distinguished at this level. This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. Using this level substantially decreases the performance for both incremental comparison and sort key generation (as well as increasing the sort key length). It is also known as level 5 strength. While Quaternary level might be sufficient for 'Abeŀlio' if dot is a punctuation here, but given the necessity to distinguish between ⑫ and ⓬, the only option to consider is Identical. The strength could be adjusted by specifying RWStore.properties parameter: com.bigdata.btree.keys.KeyBuilder.collator.strength=Identical It will not update configuration for existing journals, you would need full reload, and watch out for the size of the resulting journal, it will be larger, but it's hard to estimate how much. TASK DETAIL https://phabricator.wikimedia.org/T233204 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Igorkim78 Cc: Nikki, CamelCaseNick, Smalyshev, Aklapper, Lucas_Werkmeister_WMDE, Igorkim78, Gehel, Lea_Lacroix_WMDE, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
