Nikki added a comment.

  I discovered the "all-titles" dumps a few days ago and realised I could use 
them to find page names containing \p{C} or \p{Z}. (I've put the commands I 
used in P44829 <https://phabricator.wikimedia.org/P44829>)
  
  There are about 1.1 million page names with those characters:
  
  - Almost all are \p{Cf}
  - ~21k have unassigned characters (\p{Cn}, list: P44820 
<https://phabricator.wikimedia.org/P44820>) (some of these were assigned in 
Unicode 15, I probably need to upgrade something)
  - ~10k have private-use area characters (\p{Co}, list: P44816 
<https://phabricator.wikimedia.org/P44816>)
  - ~4k have control characters (\p{Cc}, list: P44812 
<https://phabricator.wikimedia.org/P44812>)
  - There are no \p{Z} or \p{Cs}
  
  Of the \p{Cf} characters:
  
  - ~1 million have zero-width non-joiners
  - ~30k have zero-width joiners (list: P44807 
<https://phabricator.wikimedia.org/P44807>)
  - ~30k have zero-width spaces (list: P44806 
<https://phabricator.wikimedia.org/P44806>)
  - ~2k have soft hyphens (list: P44801 
<https://phabricator.wikimedia.org/P44801>)
  - ~1k have byte-order marks/zero-width non-breaking spaces (list: P44803 
<https://phabricator.wikimedia.org/P44803>)
  - ~1k have word joiners (list: P44802 
<https://phabricator.wikimedia.org/P44802>)
  - ~500 have tags (list: P44804 <https://phabricator.wikimedia.org/P44804>)
  - ~500 have other \p{Cf} characters (list: P44822 
<https://phabricator.wikimedia.org/P44822>)
  
  Looking at that, I would definitely include zero-width space too.
  
  The tag characters are used for some emoji flags and we do have some 
sitelinks in Wikidata which use them (Q65300420 
<https://www.wikidata.org/wiki/Q65300420>, Q100587671 
<https://www.wikidata.org/wiki/Q100587671>) but I think individual tag 
characters (when not part of a recognised sequence) are not considered 
printable characters, so it's probably better to not decode those.
  
  Private-use area characters do appear in some sitelinks (e.g. Q33061193 
<https://www.wikidata.org/wiki/Q33061193>), but whether they display properly 
or not depends whether a compatible font is used, so there's probably limited 
benefit to decoding those.
  
  Word joiner <https://en.wikipedia.org/wiki/Word_joiner> characters are the 
proper way to encode a zero-width non-breaking space (to prevent breaking at 
that point). I think those would be fine to decode.
  
  Soft hyphens <https://en.wikipedia.org/wiki/Soft_hyphen> are used to indicate 
where long words can break and are used in some particularly long page names 
(e.g. Q101 <https://www.wikidata.org/wiki/Q101>), so not decoding it is 
counter-productive (e.g. this query's results 
<https://query.wikidata.org/#select%20*%20%7B%20?sitelink%20schema:about%20wd:Q101;%20schema:inLanguage%20%22ia%22%20%7D>
 wouldn't scroll horizontally if the soft hyphen were decoded).

TASK DETAIL
  https://phabricator.wikimedia.org/T327514

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE, Nikki
Cc: Michael, ItamarWMDE, Aklapper, Arian_Bozorg, Nikki, Sarai-WMDE, 
Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Mahir256, QZanden, EBjune, merbst, LawExplorer, Salgo60, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Lydia_Pintscher, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to