mpopov added a comment.
Thanks @MPhamWMF! What Mike and David said is correct. Also, this ticket prompted me to finally add the decommission notice to the dashboard (previously it was only on the homepage). In T292152#7391826 <https://phabricator.wikimedia.org/T292152#7391826>, @Lydia_Pintscher wrote: > In the meantime for my talk: Do we know what the current number is? For 2021-09-30: | Path | "Automated" | "User" | Total | | ----------------------------- | ----------- | ------ | ----- | | / | 2109 | 2290 | 4399 | | /bigdata/ldf | 4 | 55230 | 55234 | | /bigdata/namespace/wdq/sparql | 1835762 | 5786966 | 7622728 | | Anyone with private data access can easily count 1 day's requests using Hue <https://hue.wikimedia.org/> and this Hive query (slightly modified from https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/golden/+/refs/heads/master/modules/metrics/wdqs/basic_usage): USE wmf; SELECT year, month, day, IF(uri_path = '/sparql', '/bigdata/namespace/wdq/sparql', uri_path) AS path, UPPER(http_status IN('200','304')) as http_success, CASE WHEN ( agent_type = 'user' AND ( user_agent RLIKE 'https?://' OR INSTR(user_agent, 'www.') > 0 OR INSTR(user_agent, 'github') > 0 OR LOWER(user_agent) RLIKE '([a-z0-9._%-]+@[a-z0-9.-]+\.(com|us|net|org|edu|gov|io|ly|co|uk))' OR ( user_agent_map['browser_family'] = 'Other' AND user_agent_map['device_family'] = 'Other' AND user_agent_map['os_family'] = 'Other' ) ) ) OR agent_type = 'spider' THEN 'TRUE' ELSE 'FALSE' END AS is_automata, COUNT(*) AS events FROM wmf.webrequest WHERE webrequest_source = 'text' AND year = ${year} AND month = ${month} AND day = ${day} AND uri_host = 'query.wikidata.org' AND uri_path IN('/', '/bigdata/namespace/wdq/sparql', '/bigdata/ldf', '/sparql') GROUP BY year, month, day, IF(uri_path = '/sparql', '/bigdata/namespace/wdq/sparql', uri_path), UPPER(http_status IN('200','304')), CASE WHEN ( agent_type = 'user' AND ( user_agent RLIKE 'https?://' OR INSTR(user_agent, 'www.') > 0 OR INSTR(user_agent, 'github') > 0 OR LOWER(user_agent) RLIKE '([a-z0-9._%-]+@[a-z0-9.-]+\.(com|us|net|org|edu|gov|io|ly|co|uk))' OR ( user_agent_map['browser_family'] = 'Other' AND user_agent_map['device_family'] = 'Other' AND user_agent_map['os_family'] = 'Other' ) ) ) OR agent_type = 'spider' THEN 'TRUE' ELSE 'FALSE' END; **I would NOT recommend querying an entire month with 1 query** since it uses webrequest data which **should be queried 1 day at a time at most**. Also, the query uses non-standard "automata" determination. At the time (those years ago) I thought it was clever, but these days I would not use those rules and if I had infinite time I would switch to https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection TASK DETAIL https://phabricator.wikimedia.org/T292152 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mpopov Cc: SWakiyama, MPhamWMF, dcausse, mpopov, Zbyszko, Aklapper, Lydia_Pintscher, Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
