| mpopov added a comment. |
Okay, here are the numbers which were calculated with the following conditions:
- Using the December 2018 snapshot of MediaWiki History in the Data Lake
- Only files which have not been deleted are counted
- Only revisions to the metadata which were not reverted AND which were not reverts AND which were not deleted
- "Metadata augmented w/in 1st 2mo" means there was at least 1 byte-adding revision to the file's page within the first 60 days after creation
It looks like the baseline for % of files which have metadata added within the first 2 months is 45.6% overall:
| Files since 2003 | Metadata augmented w/in 1st 2mo (60d) | Proportion |
|---|---|---|
| 52,640,746 | 24,003,593 | 45.599% |
Here are the final numbers:
| Year | Files uploaded that year | Metadata augmented w/in 1st 2mo (60d) | Proportion |
|---|---|---|---|
| 2004 | 17,669 | 9,423 | 53.331% |
| 2005 | 265,976 | 108,449 | 40.774% |
| 2006 | 648,025 | 228,230 | 35.219% |
| 2007 | 1,205,884 | 371,729 | 30.826% |
| 2008 | 1,403,480 | 576,987 | 41.111% |
| 2009 | 1,927,836 | 822,061 | 42.642% |
| 2010 | 2,333,372 | 863,588 | 37.010% |
| 2011 | 3,884,635 | 1,287,972 | 33.156% |
| 2012 | 3,490,905 | 1,589,173 | 45.523% |
| 2013 | 4,591,272 | 2,007,547 | 43.725% |
| 2014 | 4,715,323 | 2,215,437 | 46.984% |
| 2015 | 5,683,966 | 2,990,535 | 52.614% |
| 2016 | 6,312,067 | 2,921,214 | 46.280% |
| 2017 | 8,182,236 | 3,623,897 | 44.290% |
| 2018 | 7,978,099 | 4,387,351 | 54.992% |
| Month | Files uploaded that month | Metadata augmented w/in 1st 2mo (60d) | Proportion |
|---|---|---|---|
| January 2018 | 652,863 | 322,246 | 49.359% |
| February 2018 | 705,945 | 399,709 | 56.620% |
| March 2018 | 784,484 | 358,703 | 45.725% |
| April 2018 | 609,520 | 276,230 | 45.319% |
| May 2018 | 714,875 | 414,765 | 58.019% |
| June 2018 | 588,235 | 363,863 | 61.857% |
| July 2018 | 650,022 | 409,261 | 62.961% |
| August 2018 | 783,718 | 515,037 | 65.717% |
| September 2018 | 817,719 | 436,632 | 53.396% |
| October 2018 | 563,806 | 296,135 | 52.524% |
| November 2018 | 573,655 | 363,017 | 63.281% |
| December 2018 | 533,257 | 231,753 | 43.460% |
Appendix
USE wmf; WITH page_creation_timestamps AS ( -- since page_creation_timestamp in mediawiki_history table is wrong: SELECT page_id, event_timestamp AS upload_timestamp FROM mediawiki_history WHERE snapshot = '${snapshot}' AND wiki_db = 'commonswiki' AND event_entity = 'revision' AND page_namespace = 6 AND revision_parent_id = 0 AND NOT revision_is_identity_revert -- don't count edits that are reverts AND NOT revision_is_identity_reverted -- don't count edits that were reverted AND NOT revision_is_deleted -- don't counts edits moved to archive table AND page_id IS NOT NULL -- don't count deleted files ), fixed_revision_history AS ( SELECT page_creation_timestamps.page_id AS page_id, upload_timestamp, event_timestamp AS revision_timestamp, revision_parent_id, revision_text_bytes_diff FROM page_creation_timestamps LEFT JOIN mediawiki_history ON ( page_creation_timestamps.page_id = mediawiki_history.page_id AND mediawiki_history.snapshot = '${snapshot}' AND mediawiki_history.wiki_db = 'commonswiki' AND NOT mediawiki_history.revision_is_identity_revert -- don't count edits that are reverts AND NOT mediawiki_history.revision_is_identity_reverted -- don't count edits that were reverted AND NOT mediawiki_history.revision_is_deleted -- don't counts edits moved to archive table ) ), summarized_revisions AS ( SELECT page_id, TO_DATE(upload_timestamp) AS creation_date, COUNT(1) AS n_edits, SUM(IF(revision_parent_id > 0, 1, 0)) as n_later_edits, SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(revision_timestamp, upload_timestamp) <= 60 AND revision_parent_id > 0, 1, 0)) AS n_additions_2mo FROM fixed_revision_history GROUP BY page_id, TO_DATE(upload_timestamp) ) SELECT creation_date, COUNT(1) AS n_uploaded, -- files uploaded SUM(IF(n_later_edits > 0, 1, 0)) AS n_later_edited, -- files whose pages were edited after upload SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added after creation and in first 2 months FROM summarized_revisions GROUP BY creation_date;
TASK DETAIL
EMAIL PREFERENCES
To: mpopov
Cc: Neil_P._Quinn_WMF, chelsyx, MNeisler, mpopov, kzimmerman, Ramsey-WMF, Abit, JKSTNK, Lahi, PDrouin-WMF, E1presidente, Cparle, Anooprao, SandraF_WMF, Tramullas, Acer, Silverfish, Susannaanas, Jane023, Wikidata-bugs, Base, matthiasmullie, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter
Cc: Neil_P._Quinn_WMF, chelsyx, MNeisler, mpopov, kzimmerman, Ramsey-WMF, Abit, JKSTNK, Lahi, PDrouin-WMF, E1presidente, Cparle, Anooprao, SandraF_WMF, Tramullas, Acer, Silverfish, Susannaanas, Jane023, Wikidata-bugs, Base, matthiasmullie, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
