Nikki added a comment.
I discovered the "all-titles" dumps a few days ago and realised I could use
them to find page names containing \p{C} or \p{Z}. (I've put the commands I
used in P44829 <https://phabricator.wikimedia.org/P44829>)
There are about 1.1 million page names with those characters:
- Almost all are \p{Cf}
- ~21k have unassigned characters (\p{Cn}, list: P44820
<https://phabricator.wikimedia.org/P44820>) (some of these were assigned in
Unicode 15, I probably need to upgrade something)
- ~10k have private-use area characters (\p{Co}, list: P44816
<https://phabricator.wikimedia.org/P44816>)
- ~4k have control characters (\p{Cc}, list: P44812
<https://phabricator.wikimedia.org/P44812>)
- There are no \p{Z} or \p{Cs}
Of the \p{Cf} characters:
- ~1 million have zero-width non-joiners
- ~30k have zero-width joiners (list: P44807
<https://phabricator.wikimedia.org/P44807>)
- ~30k have zero-width spaces (list: P44806
<https://phabricator.wikimedia.org/P44806>)
- ~2k have soft hyphens (list: P44801
<https://phabricator.wikimedia.org/P44801>)
- ~1k have byte-order marks/zero-width non-breaking spaces (list: P44803
<https://phabricator.wikimedia.org/P44803>)
- ~1k have word joiners (list: P44802
<https://phabricator.wikimedia.org/P44802>)
- ~500 have tags (list: P44804 <https://phabricator.wikimedia.org/P44804>)
- ~500 have other \p{Cf} characters (list: P44822
<https://phabricator.wikimedia.org/P44822>)
Looking at that, I would definitely include zero-width space too.
The tag characters are used for some emoji flags and we do have some
sitelinks in Wikidata which use them (Q65300420
<https://www.wikidata.org/wiki/Q65300420>, Q100587671
<https://www.wikidata.org/wiki/Q100587671>) but I think individual tag
characters (when not part of a recognised sequence) are not considered
printable characters, so it's probably better to not decode those.
Private-use area characters do appear in some sitelinks (e.g. Q33061193
<https://www.wikidata.org/wiki/Q33061193>), but whether they display properly
or not depends whether a compatible font is used, so there's probably limited
benefit to decoding those.
Word joiner <https://en.wikipedia.org/wiki/Word_joiner> characters are the
proper way to encode a zero-width non-breaking space (to prevent breaking at
that point). I think those would be fine to decode.
Soft hyphens <https://en.wikipedia.org/wiki/Soft_hyphen> are used to indicate
where long words can break and are used in some particularly long page names
(e.g. Q101 <https://www.wikidata.org/wiki/Q101>), so not decoding it is
counter-productive (e.g. this query's results
<https://query.wikidata.org/#select%20*%20%7B%20?sitelink%20schema:about%20wd:Q101;%20schema:inLanguage%20%22ia%22%20%7D>
wouldn't scroll horizontally if the soft hyphen were decoded).
TASK DETAIL
https://phabricator.wikimedia.org/T327514
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Lucas_Werkmeister_WMDE, Nikki
Cc: Michael, ItamarWMDE, Aklapper, Arian_Bozorg, Nikki, Sarai-WMDE,
Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, MPhamWMF, maantietaja,
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE,
GoranSMilovanovic, Mahir256, QZanden, EBjune, merbst, LawExplorer, Salgo60,
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs,
Jdouglas, aude, Tobias1984, Manybubbles, Lydia_Pintscher, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]