[Wikidata-bugs] [Maniphest] T360298: [Analytics] Public Superset dashboard pilot

AndrewTavis_WMDE Wed, 27 Mar 2024 09:30:31 -0700

AndrewTavis_WMDE added a comment.


  Post a large discussion about this in the `data-engineering-collab` channel 
on Slack, the general findings for this are:
  
  - The public Superset instance isn't suitable for this at this time and 
there's no time table for it to be (see above comments)
  - A suggestion of putting this information on Wikistats 
<https://stats.wikimedia.org/#/all-projects> was agreed to be too complex to 
setup and manage
    - We would need to use AQS 2 (Analytics Query Service) to make a 
service/API for this
  - An initial suggestion from WMDE to target Prometheus with the DAG was 
decided against
    - It is possible to push data to Prometheus, but there are many 
complications with this
  - A new suggestion is to leverage Turnilo 
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Turnilo> for this
    - There is a private instance at turnilo.wikimedia.org 
<https://turnilo.wikimedia.org/>
    - There are also public instances of this as seen at 
wiki-search-referrals.wmcloud.org <https://wiki-search-referrals.wmcloud.org/>
      - Wikitech docs for this can be found at 
wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/referrer_daily/Dashboard
 
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/referrer_daily/Dashboard>
      - The Turnilo dashboard is hosted on Cloud VPS 
<https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS>
      - The code for the Turnilo instance can be found at 
github.com/wikimedia/research-api-endpoint-template/turnilo-druid 
<https://github.com/wikimedia/research-api-endpoint-template/tree/turnilo-druid>
    - The way this would be achieved is that we would have the published 
datasets <https://analytics.wikimedia.org/published/datasets/> folder be 
another target of the DAG jobs, and we'd then ingest this data via the Turnilo 
instance
  
  This sounds like a good way forward, but the question of setting up the 
Turnilo instance and maintaining it then comes to mind. A big question is: how 
often are data pipelines supposed to be public, and would putting it all on a 
single Turnilo instance work well for our requirements?

TASK DETAIL
  https://phabricator.wikimedia.org/T360298

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: AndrewTavis_WMDE, Aklapper, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T360298: [Analytics] Public Superset dashboard pilot

Reply via email to