GoranSMilovanovic added a comment.

@abian @Lydia_Pintscher We have the results.

Method

  • The power-law was estimated from 27,394,027 WD items that are currently used across the Wikimedia websites;
  • that makes approximately 50% of items that are now present in WD (54,195,898 is the today's number);
  • the statistic from which the power-law was estimated is the number of pages that make use of a particular item;
  • estimation procedures from the {poweRlaw} R package were used.

Results

  • Power-law behavior cannot be excluded,
  • with the value of the scaling parameter (alpha) of 2.050451 (infinite distribution variance), and
  • the value of the xmin parameter of 9 (in effect, this means: the distribution for all items with usage frequency >=9 does exhibit a power-law behavior).
  • The following is the log(Rank) vs log(Pages) plot for all WD items with usage frequency >= 9 across the pages in our projects:

F28030400: logRank-logPages.png

Recommendation

  • Protect all items that are used on 9 or more pages across the Wikimedia websites.
  • There are 1,656,137 such items, which makes only 3.06% of the total number of items in WD, and only 6.05% of WD items that are currently in use.

Discussion

  • If you can automate this, protecting 1,656,137 should not be a problem, I guess.
  • Currently, the list of items that are recommended for protection encompasses only item IDs and the number of pages that make use of them;
  • the list will be shared with @Lydia_Pintscher;
  • it would take some time/engineering to get the English labels in, and
  • the procedure to generate this list updated on regular daily basis would take approx. 3 - 4 hours for each run, but
  • it cannot be established on our infrastructure before we have R upgraded, see my request in T214598.

So, until we have R upgraded on our systems, I recommend you ask for an updated list whenever you need one.


TASK DETAIL
https://phabricator.wikimedia.org/T210664

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to