| Neil_P._Quinn_WMF added a comment. |
In T213597#4893605, @mpopov wrote:Here's the query I used, which I would like someone in #product-analytics (e.g. @chelsyx and @Neil_P._Quinn_WMF) to review:
Sure thing!
I noticed once big thing: it seems like your counts of file page edits (n_edits_total, n_additions_total, etc.) include the initial edit that creates the pages, so in the end you're getting the proportion of files which have metadata added in the first 2 months, including during the initial upload.
I tried excluding those initial creations (event_timestamp != page_creation_timestamp), and it looks like the proportion goes from 99% to 50%.
Query excluding intial creations
WITH summarized_revisions AS ( SELECT page_id, TO_DATE(page_creation_timestamp) AS creation_date, COUNT(1) AS n_edits, -- not including reverts or reverted SUM(IF(event_timestamp != page_creation_timestamp, 1, 0)) as n_later_edits, SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(event_timestamp, page_creation_timestamp) <= 60 AND event_timestamp != page_creation_timestamp, 1, 0)) AS n_additions_2mo FROM wmf.mediawiki_history WHERE snapshot = '2018-12' AND wiki_db = 'commonswiki' AND page_creation_timestamp between "2018-10-01" and "2018-10-08" AND event_entity = 'revision' AND page_namespace = 6 AND NOT revision_is_identity_revert -- don't count edits that are reverts AND NOT revision_is_identity_reverted -- don't count edits that were reverted AND NOT revision_is_deleted -- don't counts edits moved to archive table AND page_id IS NOT NULL -- don't count deleted files GROUP BY page_id, TO_DATE(page_creation_timestamp) ) SELECT creation_date, COUNT(1) AS n_uploaded, -- files uploaded SUM(IF(n_later_edits > 0, 1, 0)) AS n_later_edited, -- files whose pages were edited after upload SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added after creation and in first 2 months FROM summarized_revisions GROUP BY creation_date;
creation_date n_uploaded n_later_edited n_added_to_2mo 2018-10-01 23390 13307 10248 2018-10-02 18226 11308 8947 2018-10-03 22763 16803 12142 2018-10-04 17455 12896 9088 2018-10-05 17321 11397 10261 2018-10-06 20191 12456 10558 2018-10-07 21479 11575 9853 Other comments
WITH summarized_revisions AS ( SELECT page_id, TO_DATE(page_creation_timestamp) AS creation_date, COUNT(1) AS n_edits_total, -- not including reverts or revertedI think this includes uploads of new file versions, not just metadata edits, but I don't think it would change the results much.
SUM(IF(revision_text_bytes_diff > 0, 1, 0)) AS n_additions_total, SUM(IF(DATEDIFF(event_timestamp, page_creation_timestamp) <= 60, 1, 0)) AS n_edits_2mo, SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(event_timestamp, page_creation_timestamp) <= 60, 1, 0)) AS n_additions_2mo FROM wmf.mediawiki_history WHERE snapshot = '2018-12' AND wiki_db = 'commonswiki' AND event_entity = 'revision' AND page_namespace = 6 AND NOT revision_is_identity_revert -- don't count edits that are reverts AND NOT revision_is_identity_reverted -- don't count edits that were reverted AND NOT revision_is_deleted -- don't counts edits moved to archive table AND page_id IS NOT NULL -- don't count deleted filesI don't understand the point of this, since the NOT revision_is_deleted should have already removed deleted files. (Also the page_id isn't necessarily null for deleted pages; after all the MediaWiki archive table has ar_page_id.)
GROUP BY page_id, TO_DATE(page_creation_timestamp) ) SELECT creation_date, COUNT(1) AS n_total, -- files uploaded SUM(IF(n_edits_total > 0, 1, 0)) AS n_edited, -- files that have had metadata edited SUM(IF(n_additions_total > 0, 1, 0)) AS n_added_to, -- files that have had metadata added SUM(IF(n_edits_2mo > 0, 1, 0)) AS n_edited_2mo, -- files that have had metadata edited in first 2 months SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added in first 2 months FROM summarized_revisions GROUP BY creation_date;
Cc: Neil_P._Quinn_WMF, chelsyx, MNeisler, mpopov, kzimmerman, Ramsey-WMF, Abit, JKSTNK, Lahi, PDrouin-WMF, E1presidente, Cparle, Anooprao, SandraF_WMF, Tramullas, Acer, Silverfish, Susannaanas, Jane023, Wikidata-bugs, Base, matthiasmullie, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
