[Wikidata-bugs] [Maniphest] T334558: [Analytics] Unique user-agents accessing Wikidata's REST API
mforns added a comment. @AndrewTavis_WMDE Hi! I think you could go with simply `wmde`. The analytics prefix in product_analytics exists because the team is named like that. In your case, you could use `wmde` I think. BTW this is the task to create the WMDE Airflow instance: T340648 <https://phabricator.wikimedia.org/T340648> TASK DETAIL https://phabricator.wikimedia.org/T334558 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, mforns Cc: mforns, xcollazo, Ottomata, lbowmaker, WMDE-leszek, AndrewTavis_WMDE, Michael, Manuel, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T314131: Some reliability metrics missing since June 20th '22
mforns added a comment. Yes, if we had implemented the DAG differently, re-running would be a task that Airflow users could easily do! However, this particular DAG (and a couple others) follow a pattern that makes it difficult to re-run partially. We plan to change those DAGs to a better structure and add the documentation to our Airflow developer guide <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Developer_guide>. TASK DETAIL https://phabricator.wikimedia.org/T314131 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Aklapper, Michael, Astuthiodit_1, EChetty, BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T314131: Some reliability metrics missing since June 20th '22
mforns added a comment. I've created a task to specifically tackle the back-filling: https://phabricator.wikimedia.org/T321838 TASK DETAIL https://phabricator.wikimedia.org/T314131 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Aklapper, Michael, Astuthiodit_1, EChetty, BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T314131: Some reliability metrics missing since June 20th '22
mforns added a comment. Hi @Michael! Yes, we will back-fill as much as we can. I have to talk to the team tomorrow to see how we want to approach that, since that particular Airflow DAG is not easy to re-run partially... I'll keep you posted! TASK DETAIL https://phabricator.wikimedia.org/T314131 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Aklapper, Michael, Astuthiodit_1, EChetty, BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T314131: Some reliability metrics missing since June 20th '22
mforns added a comment. I've looked a bit into this and I think I found what's happening. Indeed the metrics query is able to gather the data correctly, but the metrics do not reach Graphite. The reason is the HiveToGraphite Spark job is failing when sending the metrics to Graphite, because the values of the metrics are doubles. 22/09/07 00:33:14 ERROR HiveToGraphite: java.lang.Double cannot be cast to java.lang.Long. Failed to send message to Graphite. HiveToGraphite expects that the metric values are longs, and not doubles. Although the queries do not explicitly specify the double type, my suspicion is that the `percentile_approx` calculation in some metrics outputs a double, percentile_approx(time_firstbyte, 0.5) as metric_count, which after the `UNION` statement affects all the results (all the metric values become doubles since they share the same column). But maybe I'm wrong! In any case, we have to modify the query file to make sure the output values are compatible with the type `long`. --- PS: One question is, why did Spark finish with final app status `SUCCEEDED` when it recorded a `HiveToGraphite ERROR`??? We Data Engineering should look into that, as well... TASK DETAIL https://phabricator.wikimedia.org/T314131 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Aklapper, Michael, Astuthiodit_1, EChetty, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T314131: Some reliability metrics missing since June 20th '22
mforns claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T314131 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Aklapper, Michael, Astuthiodit_1, EChetty, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns added a comment. TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: phuedx, mforns Cc: mforns, phuedx, Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Hellket777, LisafBia6531, Astuthiodit_1, ntsako, 786, Biggs657, karapayneWMDE, Invadibot, Universal_Omega, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, darthmon_wmde, Kent7301, holger.knust, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Wikidata-bugs, aude, GWicke, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns added a comment. Thanks a lot @phuedx! TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: phuedx, Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Hellket777, LisafBia6531, Astuthiodit_1, ntsako, 786, BTullis, Biggs657, karapayneWMDE, Invadibot, Universal_Omega, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, darthmon_wmde, Kent7301, holger.knust, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Wikidata-bugs, aude, GWicke, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: phuedx, Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Hellket777, Astuthiodit_1, ntsako, 786, BTullis, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, darthmon_wmde, Kent7301, holger.knust, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Wikidata-bugs, aude, GWicke, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: phuedx, Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Hellket777, Astuthiodit_1, ntsako, 786, BTullis, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, darthmon_wmde, Kent7301, holger.knust, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Wikidata-bugs, aude, GWicke, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: phuedx, Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Hellket777, Astuthiodit_1, ntsako, 786, BTullis, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, darthmon_wmde, Kent7301, holger.knust, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Wikidata-bugs, aude, GWicke, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T299059: Write an Airflow job converting commons structured data dump to Hive
mforns added a project: Airflow. TASK DETAIL https://phabricator.wikimedia.org/T299059 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Snwachukwu, mforns Cc: Cparle, nettrom_WMF, Miriam, Nuria, cchen, AKhatun_WMF, JAllemandou, ntsako, EChetty, toberto, ldelench_wmf, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns added projects: Data-Engineering, Data-Engineering-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, EChetty, Invadibot, maantietaja, Akuckartz, 4748kitoko, darthmon_wmde, holger.knust, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, JAllemandou, terrrydactyl, Wikidata-bugs, aude, GWicke, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Suran38, Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, Beast1978, Un1tY, Akuckartz, 4748kitoko, Hook696, darthmon_wmde, Kent7301, holger.knust, joker88john, CucyNoiD, Nandana, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, JAllemandou, terrrydactyl, Wikidata-bugs, aude, GWicke, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Suran38, Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, Beast1978, Un1tY, Akuckartz, 4748kitoko, Hook696, darthmon_wmde, Kent7301, holger.knust, joker88john, CucyNoiD, Nandana, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, JAllemandou, terrrydactyl, Wikidata-bugs, aude, GWicke, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy
mforns added a comment. @Michael Hi! I'm going to migrate this schema during the next couple weeks. I need to askk you a couple questions about it. 1. Do you need to collect IP or geocode information together with this schema? The legacy EventLogging system collects them by default. But in the new system we only collect them if necessary. Please, let me know! 2. Is the instrumentation that generates this data in the front-end (JS)? Or is it in the back end (PHP)? Cheers! TASK DETAIL https://phabricator.wikimedia.org/T290303 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Manuel, Addshore, Ottomata, awight, Lydia_Pintscher, Aklapper, Michael, Invadibot, maantietaja, Akuckartz, 4748kitoko, darthmon_wmde, holger.knust, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, JAllemandou, terrrydactyl, Wikidata-bugs, aude, GWicke, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T262942: PoC on anomaly detection with Flink
mforns edited projects, added Analytics-Radar; removed Analytics. TASK DETAIL https://phabricator.wikimedia.org/T262942 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Aklapper, Ottomata, Gehel, dcausse, CDanis, Zbyszko, CBogen, Akuckartz, 4748kitoko, darthmon_wmde, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, JAllemandou, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T236895: ArticlePlaceholder dashboard stopped tracking page views
mforns added a comment. We've deployed the patch, now. It has already started to crunch data starting at 2020-01-01. It will take a couple hours to backfill up to today. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, mforns Cc: mforns, Milimetric, Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher, Alter-paule, Hazizibinmahdi, Beast1978, Un1tY, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, E.S.A-Sheild, Iflorez, darthmon_wmde, alaa_wmde, Meekrab2012, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, cmadeo, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, terrrydactyl, Wikidata-bugs, aude, jayvdb, Ricordisamoa, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Unblock] T238878: Data about how many file pages on Commons contain at least one structured data element
mforns closed subtask T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T238878 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Triaged] T239565: Create reportupdater reports that execute SDC requests
mforns triaged this task as "High" priority. TASK DETAIL https://phabricator.wikimedia.org/T239565 WORKBOARD https://phabricator.wikimedia.org/project/board/11/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Milimetric, mforns Cc: Abit, Ramsey-WMF, kzimmerman, Addshore, matthiasmullie, gsingers, Mayakp.wiki, Ladsgroup, nettrom_WMF, Cparle, Nuria, Milimetric, mpopov, 4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs
mforns added a comment. @diego Hi! Is there anythin additional for us Analytics here? Thaanks TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: mforns, Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, alaa_wmde, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T127467: Finding items on Wikidata that should be merged
mforns added a comment. @MichaelSchoenitzer_WMDE Oh, cool. Yea, definitely useful. Thanks!TASK DETAILhttps://phabricator.wikimedia.org/T127467EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: mfornsCc: mforns, Lahi, MichaelSchoenitzer_WMDE, Ladsgroup, Esc3300, Liuxinyu970226, matej_suchanek, Bugreporter, Ricordisamoa, Aklapper, StudiesWorld, Lydia_Pintscher, samuwmde, Gq86, Vacio, GoranSMilovanovic, QZanden, LawExplorer, Culex, Wikidata-bugs, aude, Alchimista, Mbch331___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T127467: Finding items on Wikidata that should be merged
mforns added a comment. Hey! As @Ladsgroup knows, I worked on this task during the BCN Hackathon. It was super-interesting and I learned a lot about Wikidata :] Thanks for the opportunity! Here's a summary about what I did, issues I had, and next steps: After a while of reading docs and understanding basic stuff, I wrote a small bash script to extract Wikidata items from the dump in /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz, abridge its contents limiting them to: id, type, labels and sitelinks. And finally split them in 1M-lines files, to be processed in hdfs/hadoop in a distributed way. The script is: nice -n19 ionice -c2 -n7 sh -c "zcat /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz | head -n -1 | tail -n +2 | sed 's/,$//' | jq -c 'select(.type == \"item\") | {id, labels: .labels | [keys[] as \$k | [\$k, .[\$k].value]], sitelinks: .sitelinks | [keys[] as \$k | [\$k, .[\$k].title]]}' | split -l 100 - ~/wikidata_items_abridged_20180514/part_" Then, I compressed each file separately (hadoop can only distribute computation for compressed files, if they are compressed separately) and moved those to hdfs: /user/mforns/wikidata_items_abridged_20180514. Actually, I only moved 5 of the 49 files, to avoid computation of the whole data set while developing. But the rest are ready in stat1005:/home/mforns/wikidata_items_abridged_20180514 and can be copied over there any time. I also wrote a spark/scala script that reads the item files in hdfs and processes them to find duplicate candidates. The logic identifies items that have identic labels for at least one language, or that have identic sitelinks for at least one site. Labels or sitelinks of different languages/sites are not compared. As this is executed in the cluster using spark RDDs (resilient distributed datasets), the algorithm can compare all Wikidata items against themselves and output a graph, where the vertices are item IDs (Q12345) and edges mean two vertices have identic labels/sitelinks. The weight of the edge corresponds to the number of matches in labels/sitelinks between both vertices (items). Here's the code: import org.apache.spark.rdd.RDD import org.apache.spark.sql.types._ import org.apache.spark.sql.SparkSession type Item = (String, Map[String, String], Map[String, String]) def parseItems( sourceDirectory: String, spark: SparkSession ): RDD[Item] = { val schema = StructType(Seq( StructField("id", StringType, nullable = false), StructField("type", StringType, nullable = false), StructField("labels", ArrayType(ArrayType(StringType)), nullable = false), StructField("sitelinks", ArrayType(ArrayType(StringType)), nullable = false) )) val items = spark.read.schema(schema).json(sourceDirectory + "/*").rdd items.map(r => ( r.getString(0), r.getSeq(2).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap, r.getSeq(3).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap )) } val items = parseItems("/user/mforns/wikidata_items_abridged_20180514", spark) val expressions = items.flatMap { item => ( item._2.map(label => (label._1, label._2, item._1)) ++ item._3.map(sitelink => (sitelink._1, sitelink._2, item._1)) ).filter(e => e._2.size > 2) } val expressionGroups = (expressions .keyBy(e => (e._1, e._2)) .groupByKey .map(g => (g._1, g._2.map(_._3).toSeq.sortBy(id => id))) .filter(g => g._2.size > 1)) val explodedEdges = expressionGroups.flatMap(g => g._2.combinations(2)) val weightedEdges = explodedEdges.keyBy(e => e).groupByKey.map(g => (g._1, g._2.size)) val edges = weightedEdges.filter(e => e._2 > 1) edges.map(e => e._1(0) + "\t" + e._1(1) + "\t" + e._2).saveAsTextFile("/user/mforns/duplicate_candidates") The output looks like this (you can access it in hdfs under /user/mforns/duplicate_candidates): Q7545947 Q7545948 4 Q2581746 Q3779054 2 Q32850943 Q32851055 2 Q32498252 Q804060 2 Q4451724 Q4451776 5 ... Finally, I wrote a python script to read that output on a single machine and calculate the graph's connected components. I haven't tested it, but here it is: import networkx as nx import sys G = nx.Graph() with open(sys.argv[1], 'r') as input_file: for line in input_file: v1, v2, w = line.split(' ') G.add_edge(v1, v2, weight=w) for component in nx.connected_components(G): print(component) This should return all groups of items that are likely to be duplicates (same-label/sitelink duplicates, that is). Issues If you look at the duplicate_candidates files, you can quickly identify false positives. I found 2 types: Disambiguation pages: Th
[Wikidata-bugs] [Maniphest] [Updated] T191022: Add Wikidata website extract oozie job
mforns set the point value for this task to "8". TASK DETAILhttps://phabricator.wikimedia.org/T191022EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Jonas, mfornsCc: Smalyshev, Nuria, gerritbot, JAllemandou, Jonas, Aklapper, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, lisong, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, Lewizho99, Maathavan, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Triaged] T187296: Increase kafka event retention to 14 or 21 days
mforns triaged this task as "Low" priority. TASK DETAILhttps://phabricator.wikimedia.org/T187296EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: mfornsCc: mforns, elukey, Ottomata, Aklapper, Nuria, Ladsgroup, Pchelolo, JAllemandou, Smalyshev, Lahi, Gq86, Vacio, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Avner, Gehel, Jonas, FloNight, Xmlizer, Nirmos, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Krenair, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T187296: Increase kafka event retention to 14 or 21 days
mforns added a comment. We'll have this on our radar, until things are stable.TASK DETAILhttps://phabricator.wikimedia.org/T187296EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: mfornsCc: mforns, elukey, Ottomata, Aklapper, Nuria, Ladsgroup, Pchelolo, JAllemandou, Smalyshev, Lahi, Gq86, Vacio, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Avner, Gehel, Jonas, FloNight, Xmlizer, Nirmos, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Krenair, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T143819: Data request for logs from SparQL interface at query.wikidata.org
mforns added a comment. @Nuria @Smalyshev So probably if we round timestamp and remove sessionId your proposal for dattaset #1 is safe to keep long term (cc @mforns for anything I might be missing) I think it depends highly on how drastically we sanitize the potentially identifying fields (user agent and client IP) and the fields that can indicate user acivity/features (query, location). Intuitively it seems to me that we can keep this data in a private store indefinitely if sanitized. But having those sensitive 4 fields in the same data set will make it difficult to publicize, even if sanitized. I don't know how frequent are WDQS queries, but I imagine they are several orders of magnitude smaller than pageviews. Thus the buckets of this data set are likely to be sparse and small, which increases the threat to user privacy. If we wanted to make this public, I'd go for removing the geographic location field entirely, and probably for daily or monthly resolution instead of hourly (depending on bucket size). Also, splitting the data set in several unrelatable thematic data sets could help: queries by country, queries by user agent, session queries, etc. Sorry if I'm too pessimistic, I'm not familiar with the kind of information that WDQS queries can give away about users.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: mfornsCc: mforns, PokestarFan, Nuria, Lydia_Pintscher, mkroetzsch, leila, debt, thiemowmde, Jonas, Smalyshev, AndrewSu, Aklapper, I9606, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Avner, Gehel, FloNight, Xmlizer, JAllemandou, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Changed Project Column] T134426: Review shared data namespace (tabular data) implementation
mforns moved this task from Incoming to Radar on the Analytics board. TASK DETAIL https://phabricator.wikimedia.org/T134426 WORKBOARD https://phabricator.wikimedia.org/project/board/11/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Yurik, mforns Cc: Danny_B, DannyH, StudiesWorld, Steinsplitter, Aklapper, Lydia_Pintscher, ekkis, Matanya, MarkTraceur, JEumerus, Thryduulf, Milimetric, MZMcBride, Bawolff, -jem-, gerritbot, Pokefan95, TerraCodes, intracer, ThurnerRupert, brion, Jdforrester-WMF, Eloy, TheDJ, Yurik, Zppix, Riley_Huntley, D3r1ck01, Izno, Luke081515, JAllemandou, Wikidata-bugs, aude, El_Grafo, Ricordisamoa, Shizhao, fbstj, Fabrice_Florin, Mbch331, Jay8g, Krenair, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs