Re: [Wikitech-l] For title normalization, what characters are converted to uppercase ?

2019-08-03 Thread Yuri Astrakhan
Hi Nico, if possible, can your tool to actually use MW API to normalize
titles? It's a very quick API call, you can do multiple titles at once, but
it will save you a lot of grief over incompatibilities.
--Yuri

On Sat, Aug 3, 2019 at 10:57 AM Nicolas Vervelle 
wrote:

> Hello,
>
> On most wikis, MediaWiki is configuration to convert the first letter of a
> title to uppercase, but apparently it's not converting every Unicode
> characters : for example, on frwiki ɽ
>  is a
> different article than Ɽ , even
> if
> the second character is the uppercase version of the first one in Unicode.
>
> So, what characters are actually converted to uppercase by the title
> normalization ?
>
> I need to know this information to stop reporting some false positives in
> WPCleaner .
>
> Thanks, Nico
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] For title normalization, what characters are converted to uppercase ?

2019-08-03 Thread bawolff
MediaWiki uses php's mb_strtoupper.

I believe this will use normal unicode uppercase algorithm. However this
can vary depending on version of unicode. We are currently in the process
of switching to php7, but for the moment we are still using HHVM's
uppercasing code. There's a list of differences between hhvm and php7.2
uppercasing at
https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/Php72ToUpper.php
[All this is probably subject to change]

However, I am at a loss as to why hhvm & php < 5.6 [1] wouldn't map that
character, since the ɽ -> Ɽ mapping has been present since unicode 5
(2006). Guess it was using a really old unicode data or something.

See also  bug T219279 [2]

--
Brian

[1] https://3v4l.org/GHt3b
[2] https://phabricator.wikimedia.org/T219279

On Sat, Aug 3, 2019 at 7:57 AM Nicolas Vervelle  wrote:

> Hello,
>
> On most wikis, MediaWiki is configuration to convert the first letter of a
> title to uppercase, but apparently it's not converting every Unicode
> characters : for example, on frwiki ɽ
>  is a
> different article than Ɽ , even
> if
> the second character is the uppercase version of the first one in Unicode.
>
> So, what characters are actually converted to uppercase by the title
> normalization ?
>
> I need to know this information to stop reporting some false positives in
> WPCleaner .
>
> Thanks, Nico
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] For title normalization, what characters are converted to uppercase ?

2019-08-03 Thread Nicolas Vervelle
Hello,

On most wikis, MediaWiki is configuration to convert the first letter of a
title to uppercase, but apparently it's not converting every Unicode
characters : for example, on frwiki ɽ
 is a
different article than Ɽ , even if
the second character is the uppercase version of the first one in Unicode.

So, what characters are actually converted to uppercase by the title
normalization ?

I need to know this information to stop reporting some false positives in
WPCleaner .

Thanks, Nico
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l