mpopov added a comment.

  Thanks @MPhamWMF!
  
  What Mike and David said is correct. Also, this ticket prompted me to finally 
add the decommission notice to the dashboard (previously it was only on the 
homepage).
  
  In T292152#7391826 <https://phabricator.wikimedia.org/T292152#7391826>, 
@Lydia_Pintscher wrote:
  
  > In the meantime for my talk: Do we know what the current number is?
  
  For 2021-09-30:
  
  | Path                          | "Automated" | "User" | Total |
  | ----------------------------- | ----------- | ------ | ----- |
  | /                             | 2109        | 2290   | 4399  |
  | /bigdata/ldf                  | 4           | 55230  | 55234 |
  | /bigdata/namespace/wdq/sparql | 1835762           | 5786966      | 7622728  
   |
  |
  
  Anyone with private data access can easily count 1 day's requests using Hue 
<https://hue.wikimedia.org/> and this Hive query (slightly modified from 
https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/golden/+/refs/heads/master/modules/metrics/wdqs/basic_usage):
  
    USE wmf;
    SELECT
      year, month, day,
      IF(uri_path = '/sparql', '/bigdata/namespace/wdq/sparql', uri_path) AS 
path,
      UPPER(http_status IN('200','304')) as http_success,
      CASE
        WHEN (
          agent_type = 'user' AND (
            user_agent RLIKE 'https?://'
            OR INSTR(user_agent, 'www.') > 0
            OR INSTR(user_agent, 'github') > 0
            OR LOWER(user_agent) RLIKE 
'([a-z0-9._%-]+@[a-z0-9.-]+\.(com|us|net|org|edu|gov|io|ly|co|uk))'
            OR (
              user_agent_map['browser_family'] = 'Other'
              AND user_agent_map['device_family'] = 'Other'
              AND user_agent_map['os_family'] = 'Other'
              )
            )
        ) OR agent_type = 'spider' THEN 'TRUE'
        ELSE 'FALSE' END AS is_automata,
      COUNT(*) AS events
    FROM wmf.webrequest
    WHERE
      webrequest_source = 'text'
      AND year = ${year} AND month = ${month} AND day = ${day}
      AND uri_host = 'query.wikidata.org'
      AND uri_path IN('/', '/bigdata/namespace/wdq/sparql', '/bigdata/ldf', 
'/sparql')
    GROUP BY
      year, month, day,
      IF(uri_path = '/sparql', '/bigdata/namespace/wdq/sparql', uri_path),
      UPPER(http_status IN('200','304')),
      CASE
        WHEN (
          agent_type = 'user' AND (
            user_agent RLIKE 'https?://'
            OR INSTR(user_agent, 'www.') > 0
            OR INSTR(user_agent, 'github') > 0
            OR LOWER(user_agent) RLIKE 
'([a-z0-9._%-]+@[a-z0-9.-]+\.(com|us|net|org|edu|gov|io|ly|co|uk))'
            OR (
              user_agent_map['browser_family'] = 'Other'
              AND user_agent_map['device_family'] = 'Other'
              AND user_agent_map['os_family'] = 'Other'
              )
            )
        ) OR agent_type = 'spider' THEN 'TRUE'
        ELSE 'FALSE' END;
  
  **I would NOT recommend querying an entire month with 1 query** since it uses 
webrequest data which **should be queried 1 day at a time at most**. Also, the 
query uses non-standard "automata" determination. At the time (those years ago) 
I thought it was clever, but these days I would not use those rules and if I 
had infinite time I would switch to 
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection

TASK DETAIL
  https://phabricator.wikimedia.org/T292152

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mpopov
Cc: SWakiyama, MPhamWMF, dcausse, mpopov, Zbyszko, Aklapper, Lydia_Pintscher, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to