[Wikidata-bugs] [Maniphest] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers
JAllemandou added a comment. No objection :) I'd have gone for option 1 as it seems the easiest to maintain, but I agree, it means installing some stuff to the blazegraph machines. TASK DETAIL https://phabricator.wikimedia.org/T349069 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse, JAllemandou Cc: Daniel_Mietchen, JAllemandou, dr0ptp4kt, bking, BTullis, dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers
JAllemandou added a comment. I would suggest using the `hdfs-rsync` tool to do this - it requires some setting up with puppet, but it is helpful, through copying only new stuff from folders (see https://github.com/wikimedia/operations-puppet/blob/1c4d67ff19372832484f7551dc49836be5806024/modules/hdfs_tools/manifests/hdfs_rsync_job.pp and https://github.com/wikimedia/operations-puppet/blob/1c4d67ff19372832484f7551dc49836be5806024/modules/dumps/manifests/web/fetches/stats.pp) TASK DETAIL https://phabricator.wikimedia.org/T349069 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse, JAllemandou Cc: Daniel_Mietchen, JAllemandou, dr0ptp4kt, bking, BTullis, dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T336361: [Analytics] Identify access from mobile vs. desktop devices
JAllemandou added a comment. > However, my assumption is that when only filtering for agent_type != 'spider' the population will still include a lot of non-UI hits. The `agent_type` field currently can take 3 values: `spider`, `automated` and `user`. The `spider` one is used when user-agents self define themselves as bots, the `automated` one is used when we heuristically define the traffic as being automatically generated (big volume), and the rest falls under the `user` value. There indeed still is some non-user traffic being flagged as `user`. TASK DETAIL https://phabricator.wikimedia.org/T336361 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, JAllemandou Cc: JAllemandou, AndrewTavis_WMDE, Michael, Manuel, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T342416: Set data permission on new snapshot generation (discovery.wikibase_rdf)
JAllemandou added a comment. In T342416#9101868 <https://phabricator.wikimedia.org/T342416#9101868>, @EBernhardson wrote: > These are both generated by spark. The rdf is being imported by a scala application while the cirrus dump is imported by pyspark, but they should both be using the same underlying implementation. Both applications use `df.write.insertInto(table_name)` to instruct spark to do the actual output. I'm a bit surprised they end up generating different sets of permissions. > > I suppose it's not super important why the cirrus dump is world readable, it's fine to be readable, it just hints to me that there is something I don't understand about hdfs/spark/permissions happening here. Mwarf, wrong guess :) Interesting nonetheless - Let me know if you wish we pair on this. TASK DETAIL https://phabricator.wikimedia.org/T342416 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: EBernhardson, JAllemandou Cc: dcausse, BTullis, AndrewTavis_WMDE, Aklapper, JAllemandou, Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T342416: Set data permission on new snapshot generation (discovery.wikibase_rdf)
JAllemandou added a comment. In T342416#9091146 <https://phabricator.wikimedia.org/T342416#9091146>, @EBernhardson wrote: > I looked into these, the attached patch should fix it but it leaves an open question (@JAllemandou): > > The `core-site.xml`, along with puppet which writes it out, has the default umask of 027 since at least 2021, which prevents world readability. So why do we have the following permissions for historical dumps: > > drwxr-xr-x /wmf/data/discovery/wikidata/rdf/date=20230710 > drwxr-xr-x /wmf/data/discovery/wikidata/rdf/date=20230716 > drwxr-xr-x /wmf/data/discovery/wikidata/rdf/date=20230717 > drwxr-x--- /wmf/data/discovery/wikidata/rdf/date=20230723 > drwxr-x--- /wmf/data/discovery/wikidata/rdf/date=20230724 > drwxr-x--- /wmf/data/discovery/wikidata/rdf/date=20230730 > drwxr-x--- /wmf/data/discovery/wikidata/rdf/date=20230731 > drwxr-x--- /wmf/data/discovery/wikidata/rdf/date=20230806 The world-readable change were manually made by myself to unblock @AndrewTavis_WMDE - I logged my change in the analytics IRC chan but didn't ping on the search IRC chan - I should have, please excuse me on this :) > Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference? > > drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716 > drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723 > drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730 > drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806 The guess I have about those would be that they are still generated by a Hive job. Hive and spark behave differently in regard to permissions when generating files. Spark uses the configured umask, while hive reproduces the parent-dir patten. I'd be interested to be sure if my guess is correct :) TASK DETAIL https://phabricator.wikimedia.org/T342416 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: EBernhardson, JAllemandou Cc: dcausse, BTullis, AndrewTavis_WMDE, Aklapper, JAllemandou, Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3
JAllemandou added a comment. We met this morning with @AndrewTavis_WMDE and @Manuel - Thank you folks for the great meeting. The detailed Meeting notes are here: https://docs.google.com/document/d/1REsolXnZf2KqApL0p-DE8X4eWXI_zxHgrCe3k1hcZnw From the job list in previous comment: - 4 don't run spark andare kept as-is: `WMDE_BannerImpressions`, `Wiktionary_CognateDashboard`, `2021_WMDE_Mitmachen_Bereich_2021_Campaign`, `WDCM_Sqoop_Clients`) - 3 are stopped (crontaab commented): `Qurator_CuriousFacts`, `WDCM_EngineBiases`, `WD_PageviewsPerType` - 3 have been updated to run spark2 in fixed-resource mode, thus normally not failing after the migration to the spark3-shuffler: `WD_UsageCoverage`, `WD_languagesLandscape`, `NewEditors_comprehensive_report` With those changes there is no more blocker in migrating to the spark3-shuffler from this task :) TASK DETAIL https://phabricator.wikimedia.org/T334951 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, JAllemandou Cc: ItamarWMDE, BTullis, GoranSMilovanovic, AndrewTavis_WMDE, Aklapper, Manuel, JAllemandou, lbowmaker, xcollazo, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3
JAllemandou added a comment. Hi @AndrewTavis_WMDE, I've done some investigation, and here is what I have: Goran has 11 CRON jobs running from various hosts on our system (1on `stat1004`, 2 on `stat1007`, 7 on `stat1008`). - `WDCM_Sqoop_Clients` runs on`stat1004` weekly - It doesn't run spark (but Sqoop) - `2021_WMDE_Mitmachen_Bereich_2021_Campaign` runs on `stat1007` daily - It doesn't run spark (but Hive) - `WD_PageviewsPerType` runs on `stat1007` daily but has been failing since February 17th - It runs a spark job - `WD_UsageCoverage` runs on `stat1008` daily - It runs a spark job - `WD_languagesLandscape` runs on `stat1008` monthly (30th of the month) - It runs a spark job - `Wiktionary_CognateDashboard` runs on `stat1008` daily - It doesn't run spark - `WDCM_EngineBiases` runs on `stat1008` weekly - It runs a spark job - `Qurator_CuriousFacts` runs on `stat1008` monthly (10th of the month) - It runs a spark job - `WMDE_BannerImpressions` runs on `stat1008` hourly - It doesn't runspark (but Hive) - `NewEditors_comprehensive_report` runs on `stat1008` daily - It runs a spark job We need to meet and talk about your usage of the data generated by those scripts, and see what you wish us to try to make work versus stop. I'm booking some time on your calendar next Monday :) TASK DETAIL https://phabricator.wikimedia.org/T334951 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, JAllemandou Cc: ItamarWMDE, BTullis, GoranSMilovanovic, AndrewTavis_WMDE, Aklapper, Manuel, JAllemandou, lbowmaker, xcollazo, Astuthiodit_1, EChetty, karapayneWMDE, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3
JAllemandou added a comment. In T334951#8952583 <https://phabricator.wikimedia.org/T334951#8952583>, @AndrewTavis_WMDE wrote: > - If the answer to the above question of permanently losing some data that's being produced by Concepts Monitor and other WMDE jobs is no, then we're ok with option one above of stopping the job. I am not knowledgeable at all about the data generated by the job unfortunately, preventing me to assess whether there is data generated by the job that we would not be able to regenerate. Also, I have not been told about intermediary data stored on the cluster, making me think that all the data generated by the job is small enough to be saved for the reports only. But as stated befoe, those are uninformed ideas :( > - Aside from this we'd prefer option two of configuring it to use fixed-resource. We can test that :) TASK DETAIL https://phabricator.wikimedia.org/T334951 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, JAllemandou Cc: ItamarWMDE, BTullis, GoranSMilovanovic, AndrewTavis_WMDE, Aklapper, Manuel, JAllemandou, lbowmaker, xcollazo, Astuthiodit_1, EChetty, karapayneWMDE, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3
JAllemandou added a comment. In T334951#8946790 <https://phabricator.wikimedia.org/T334951#8946790>, @AndrewTavis_WMDE wrote: > I'll async with him now and see if we can come to a decision sooner than that, but you all will have the answer by Wednesday at the latest 😊 Awesome, thank you :) TASK DETAIL https://phabricator.wikimedia.org/T334951 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, JAllemandou Cc: ItamarWMDE, BTullis, GoranSMilovanovic, AndrewTavis_WMDE, Aklapper, Manuel, JAllemandou, lbowmaker, xcollazo, Astuthiodit_1, EChetty, karapayneWMDE, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3
JAllemandou added a comment. Hi Folks - What is the status on this one? I'd like Data-Engineering to announce the deprecation of Spark2 for this end of month, but not without knowing how we plan on tackling your job :) Here are the 2 possible solutions I can think of: - Stopping the job while it is revamped to spark3 (Knowing that the dashboard is broken, is it a possible solution?) - Configure the job not to use DynamicAllocation but to use fixed-resource, making the job work in spark2 despite spark2 being deprecated, but using more cluster resources than really needed - Postpone deprecating spark2 (if we could not do that, I'd be super happy :) Let me know your thoughts :) TASK DETAIL https://phabricator.wikimedia.org/T334951 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, JAllemandou Cc: ItamarWMDE, BTullis, GoranSMilovanovic, AndrewTavis_WMDE, Aklapper, Manuel, JAllemandou, lbowmaker, xcollazo, Astuthiodit_1, EChetty, karapayneWMDE, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis
JAllemandou added a comment. In T303831#8237323 <https://phabricator.wikimedia.org/T303831#8237323>, @EBernhardson wrote: > data cleanup looks to now have run successfully Thanks a lot @EBernhardson for finalizing on this :) TASK DETAIL https://phabricator.wikimedia.org/T303831 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: EBernhardson, JAllemandou Cc: EBernhardson, dcausse, Gehel, JAllemandou, Aklapper, AKhatun_WMF, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis
JAllemandou added a comment. In T303831#8175252 <https://phabricator.wikimedia.org/T303831#8175252>, @EBernhardson wrote: > @JAllemandou The one remaining piece of this ticket is cleaning up the historical data, per T303831#8081172 <https://phabricator.wikimedia.org/T303831#8081172>. Any suggestions on how we should manage droping old data from tables partitioned by a snapshot column? The we currently do this is with this script: https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-mediawiki-snapshots it works differently from the generic `refinery-drop-older-than` script, in that it lists all the datasets to clean and then applies the deletion. It's possible to add the datasets you need to delete in there, it shouldn't be complicated. TASK DETAIL https://phabricator.wikimedia.org/T303831 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: EBernhardson, JAllemandou Cc: EBernhardson, dcausse, Gehel, JAllemandou, Aklapper, AKhatun_WMF, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis
JAllemandou added a comment. Thanks a lot @EBernhardson for the help on finishing this! TASK DETAIL https://phabricator.wikimedia.org/T303831 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: EBernhardson, dcausse, Gehel, JAllemandou, Aklapper, AKhatun_WMF, Hellket777, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake
JAllemandou closed subtask T299059: Write an Airflow job converting commons structured data dump to Hive as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T258834 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF, Fernandobacasegua34, Astuthiodit_1, ntsako, 786, EChetty, Suran38, Biggs657, karapayneWMDE, toberto, ldelench_wmf, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T299059: Write an Airflow job converting commons structured data dump to Hive
JAllemandou closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T299059 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Snwachukwu, JAllemandou Cc: Cparle, nettrom_WMF, Miriam, Nuria, cchen, AKhatun_WMF, JAllemandou, Astuthiodit_1, ntsako, EChetty, karapayneWMDE, toberto, ldelench_wmf, Invadibot, MPhamWMF, maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake
JAllemandou closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T258834 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF, Fernandobacasegua34, Astuthiodit_1, ntsako, 786, EChetty, Suran38, Biggs657, karapayneWMDE, toberto, ldelench_wmf, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T252443: Create dashboard to show growth of structured data on Commons over time
JAllemandou closed subtask T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T252443 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: cchen, JAllemandou Cc: kzimmerman, nettrom_WMF, GFontenelle_WMF, Abit, Ramsey-WMF, CBogen, Astuthiodit_1, EChetty, karapayneWMDE, toberto, ldelench_wmf, Invadibot, maantietaja, Y.ssk, FRomeo_WMF, Muchiri124, ItamarWMDE, Nintendofan885, Akuckartz, Nandana, JKSTNK, Lahi, Gq86, E1presidente, Cparle, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, 4nn1l2, Taiwania_Justo, Scott_WUaS, Susannaanas, Ixocactus, Wong128hk, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, El_Grafo, Dinoguy1000, Ricordisamoa, Wesalius, Lydia_Pintscher, Raymond, Steinsplitter, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T300240: Missing Wikidata RDF (ttl and nt) dumps for 20220117
JAllemandou added a comment. Thank you for letting us know :) TASK DETAIL https://phabricator.wikimedia.org/T300240 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: ArielGlenn, Aklapper, JAllemandou, AKhatun_WMF, dcausse, karapayneWMDE, Invadibot, maantietaja, jannee_e, Akuckartz, holger.knust, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Addshore, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T299059: Write an Airflow job converting commons structured data dump to Hive
JAllemandou created this task. JAllemandou added projects: Product-Analytics, Structured-Data-Backlog, Wikidata-Query-Service, Wikidata, Data-Engineering, Discovery-Search (Current work), Patch-For-Review, Data-Engineering-Kanban. Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION The airflow job should - be run weekly on Mondays. - wait for source data to be available: - source folder is of form `hdfs://analytics-hadoop/wmf/data/raw/commons/dumps/mediainfo-json/MMDD` - source folder contains a file named `_IMPORTED` when the source data has been succesfully imported in the folder - run a spark job reading the source data and writing it to hive - the spark job is in the `refinery-job.jar` archive, we need to have it as a dependency for the job - the spark job class is `org.wikimedia.analytics.refinery.job.structureddata.jsonparse.JsonDumpConverter` - main parameters of the job are the input folder, the output hive table and the snapshot (time partition) being created. The output hive table will be `structured_data.commons_entity` and the `snapshot` will be in the form `-MM-DD`. See the class for the detailed list of parameters :) TASK DETAIL https://phabricator.wikimedia.org/T299059 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: nettrom_WMF, Miriam, Nuria, cchen, AKhatun_WMF, JAllemandou, ntsako, EChetty, toberto, ldelench_wmf, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake
JAllemandou added a comment. Code is ready: - Import `commons-mediainfo` json dumps to HDFS (https://gerrit.wikimedia.org/r/738874) - Update spark transformation job to work with both wikidata and commons dumps (https://gerrit.wikimedia.org/r/739129) - Update `wikidata_entity` table creation script and oozie job for the new fields added by the patch above (https://gerrit.wikimedia.org/r/c/analytics/refinery/+/740589) - Add `commons_entoty` table creation script (https://gerrit.wikimedia.org/r/c/analytics/refinery/+/740590) - Update spark transformation job to write directly to a hive table instead of to files (https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/747508/) What we need after having merged/deployed the above is: - A new airflow job for the `commons_entity` data genration - A migration of the `wikidata_entity` oozie job to Airflow TASK DETAIL https://phabricator.wikimedia.org/T258834 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF, 786, EChetty, Suran38, Biggs657, toberto, ldelench_wmf, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, Un1tY, Akuckartz, 4748kitoko, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake
JAllemandou added a project: Data-Engineering-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T258834 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF, 786, EChetty, Suran38, Biggs657, toberto, ldelench_wmf, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, Un1tY, Akuckartz, 4748kitoko, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T291205: Analysis: Property usage by items' P31
JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T291205 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, Jmixter87, JAllemandou, MPhamWMF, CBogen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T291205: Analysis: Property usage by items' P31
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION It is interesting to understand how properties are used by different content subgraphs (for instance humans, scholarly articles etc). It would allow us to better understand how properties used in a certain query context can be affected performance-wise by their usage in other contexts. For instance, the `main-topic` property when used for books could suffer from the property being widely used for scholarly-articles (a huge subgraph). This analysis would use the P31 <https://phabricator.wikimedia.org/P31> values of items to try to cluster items into groups (maybe we could even be better in using P279 <https://phabricator.wikimedia.org/P279>?), and we would count property usage by group to do further analysis. TASK DETAIL https://phabricator.wikimedia.org/T291205 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, Jmixter87, JAllemandou, MPhamWMF, CBogen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries
JAllemandou added a comment. Why not adding other prefixes if it's as simple as adding the prefix to the AQS list - I think there'll be more gotchas. let's try @AKhatun_WMF :) TASK DETAIL https://phabricator.wikimedia.org/T285465 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Gehel, MPhamWMF, Lucas_Werkmeister_WMDE, Esc3300, dcausse, Aklapper, AKhatun_WMF, JAllemandou, Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282139: Provide a quantitative description of the Wikidata-triples dataset
JAllemandou closed this task as "Resolved". JAllemandou added a comment. The analysis is documented here: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Basic_Analysis. Thanks @AKhatun_WMF :) TASK DETAIL https://phabricator.wikimedia.org/T282139 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Esc3300, GoranSMilovanovic, CBogen, AKhatun_WMF, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries
JAllemandou added subscribers: MPhamWMF, Gehel. JAllemandou added a comment. Thanks @AKhatun_WMF for the analysis. @dcausse , @Gehel and @MPhamWMF - Do you think it;s worth trying to make our parser being able to process queries with the 'mwapi' prefix (it represents 10% of all requests) - otherwise this task can be closed. TASK DETAIL https://phabricator.wikimedia.org/T285465 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Gehel, MPhamWMF, Lucas_Werkmeister_WMDE, Esc3300, dcausse, Aklapper, AKhatun_WMF, JAllemandou, Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou added a subtask: T285465: Document and analyze the number of parsing errors for parsed WDQS queries. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse, Gehel, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T285465 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, AKhatun_WMF, JAllemandou, MPhamWMF, CBogen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION We wish, for the month of June 2021: - Report the number of parsing errors when generating parsed queries information - Provide information about why parsing errors happen TASK DETAIL https://phabricator.wikimedia.org/T285465 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, AKhatun_WMF, JAllemandou, MPhamWMF, CBogen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T283256: Extract operator/nodes/triples/paths/exprs list from queries
JAllemandou added a comment. The problem I see with using a generic class in the `QueryElem` object is the conversion to parquet. I don't think it'll work out of the box, leading to having to devise our own conversion. Let's brainstorm on ideas on this, possibly in meeting to make it faster :) TASK DETAIL https://phabricator.wikimedia.org/T283256 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Gehel, dcausse, CBogen, Aklapper, AKhatun_WMF, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T283258: Provide a job regularly deleting wdqs processed query after 90 days
JAllemandou created this task. JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review, Discovery-Search (Current work). Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION This task is related to T273854 <https://phabricator.wikimedia.org/T273854>. When the job generating hourly query-info is launched, we should make sure we also delete the data after 90 days to be within our data-retention policy. TASK DETAIL https://phabricator.wikimedia.org/T283258 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Gehel, dcausse, CBogen, Aklapper, AKhatun_WMF, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction
JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: dcausse, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, 4748kitoko, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: dcausse, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, 4748kitoko, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction
JAllemandou removed JAllemandou as the assignee of this task. JAllemandou added a subscriber: dcausse. JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: dcausse, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, 4748kitoko, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou added a subtask: T273854: Automate regular WDQS query parsing and data-extraction. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse, Gehel, JAllemandou, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T283256: Extract operator/nodes/triples/paths/exprs list from queries
JAllemandou created this task. JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review, Discovery-Search (Current work). Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION Augment query-analysis QueryInfo with a list of operators+nodes+paths(+exprs?) that will be populated in order of AST-visit (and saved in Parquet). One complexity of this task is to find a common representation suitable for parquet for the various different items. TASK DETAIL https://phabricator.wikimedia.org/T283256 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Gehel, dcausse, CBogen, Aklapper, AKhatun_WMF, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T283255: Create CLI job extracting info from wdqs queries
JAllemandou created this task. JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review, Discovery-Search (Current work). Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION The job should process data hourly. Expected parameters to be passed are `year`, `month`, `day`, `hour`, `input_table`, `output_table`, and an optional `num_partitions` allowing to tweak the number of output files (default to 1). TASK DETAIL https://phabricator.wikimedia.org/T283255 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Gehel, dcausse, CBogen, Aklapper, AKhatun_WMF, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou closed subtask T282129: Test triple-analysis functions over a large dataset with Spark as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse, Gehel, JAllemandou, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark
JAllemandou closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark
JAllemandou added a comment. Closing this task :) Thanks fro the great work @AKhatun_WMF TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format
JAllemandou closed this task as "Resolved". JAllemandou added a comment. Great ! Thanks for that :) Closing the ticket. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou closed subtask T282130: Provide a way to save extracted query-information in parquet format as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse, Gehel, JAllemandou, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format
JAllemandou added a comment. @AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282139: Provide a quantitative description of the Wikidata-triples dataset
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION As a way to get familiar with the data, please provide quantitative information over the dataset using spark in a notebook (probably using python as it facilitates making charts). The data can be found in: hdfs://analytics-hadoop/wmf/data/discovery/wikidata/rdf/date=20210419/wiki=wikidata There are multiple snapshot date available, as well as multiple wikis (`wikidata` and `commons`). Just pick one date with `wikidata` data :) In hive or spark-sql: use discovery; show partitions wikibase_rdf; TASK DETAIL https://phabricator.wikimedia.org/T282139 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, MPhamWMF, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou added a subtask: T282130: Provide a way to save extracted query-information in parquet format. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse, Gehel, JAllemandou, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Being able to save the information in Parquet will be very useful as it allows to automatically process the queries as the y flow in (hourly or daily for instance), facilitating regular analysis. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou added a subtask: T282129: Test triple-analysis functions over a large dataset with Spark. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse, Gehel, JAllemandou, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, MPhamWMF, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Once ready locally with unit-tests, apply the triple-analysis method to bigger data in spark (a day). TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, MPhamWMF, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou added a subtask: T282127: Add unit-tests to WDQS analysis toolkit. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse, Gehel, JAllemandou, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282127: Add unit-tests to WDQS analysis toolkit
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Extract a set of queries to be used as unit-tests (10 queries) from the events. This should facilitate making sure the code is doing what we expect before running it on the cluster, TASK DETAIL https://phabricator.wikimedia.org/T282127 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T282127: Add unit-tests to WDQS analysis toolkit
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282127 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T281808: Wikidata all-json dumps not available from 2021-04-26
JAllemandou created this task. JAllemandou added projects: Wikidata, Dumps-Generation, Analytics. Restricted Application added a project: wdwb-tech. TASK DESCRIPTION Analytics load wikidata all-json dumps weekly on the hadoop cluster, and we have received an alert for dumps not being available from 2021-04-26 onward. TASK DETAIL https://phabricator.wikimedia.org/T281808 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Invadibot, maantietaja, jannee_e, Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, gnosygnu, terrrydactyl, Wikidata-bugs, aude, Addshore, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. Restricted Application added a project: Wikidata. TASK DESCRIPTION The current analysis parses queries and extracts: - Operators (list, and map with number of usage) - Nodes (variables, URIs, literals, blanck nodes) map with number of usage - Prefixes (map with number of usage) - Services (map with number of usage) - Wikidata names (URIs with main value matching regex `"^[QP]\\d+$"`) - Expressions - Paths The values used to identify operators, expressions, path or nodes are string, either the detailed name (for operators or nodes for instance), or the full print of the subtree portion (for path or expressions for instance). One thing we badly miss for our analysis is triple-pattern-matching information: when a triple-pattern is met , which form is it in ( , for instance), and what are the defined value it embeds (URIs, literals etc). With that information we should be able to be more precise in term of triple-pattern usages in queries, possibly also getting a better feel of subgraphs heavily used. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, dcausse, Gehel, tanny411, JAllemandou, Invadibot, MPhamWMF, maantietaja, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T94019: Generate RDF from JSON
JAllemandou added a subscriber: dcausse. JAllemandou added a comment. Info: There already is in the cluster a job doing `TTL -> RDF` conversion. The TTL dumps are imported weekly, and converted to blazegraph RDF once available. The job is maintained by the Search Platform team (ping @dcausse ' :). TASK DETAIL https://phabricator.wikimedia.org/T94019 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: dcausse, Addshore, toan, Tonina_Zhelyazkova_WMDE, JAllemandou, Pintoch, Smalyshev, hoo, Liuxinyu970226, mkroetzsch, Aklapper, daniel, Invadibot, maantietaja, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction
JAllemandou claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, JAllemandou, MPhamWMF, CBogen, Akuckartz, 4748kitoko, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction
JAllemandou created this task. JAllemandou added projects: Analytics, Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. Restricted Application added a project: Wikidata. TASK DESCRIPTION This task is about running regular query-parsing jobs for WDQS and storing the result in a dedicated table on HDFS. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, JAllemandou, MPhamWMF, CBogen, Akuckartz, 4748kitoko, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T266022: Programmatically categorize WDQS queries by potential alternative solution
JAllemandou added a comment. Ah! I realize I have not updated that task. The analysis can be found here: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis @CBogen : I let you handle the definition of done, and whether this task should be closed or not :) TASK DETAIL https://phabricator.wikimedia.org/T266022 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: GoranSMilovanovic, dcausse, JMinor, Aklapper, Gehel, Addshore, JAllemandou, Lydia_Pintscher, CBogen, MPhamWMF, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T266022: Programmatically categorize WDQS queries by potential alternative solution
JAllemandou added a comment. Planned deadline was end of last month. I've gone through various issues preventing to achieve it. I'have started the actual work today (I gave it thought but didn't code) and wish to present results before the end of the month. TASK DETAIL https://phabricator.wikimedia.org/T266022 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: dcausse, JMinor, Aklapper, Gehel, Addshore, JAllemandou, Lydia_Pintscher, CBogen, MPhamWMF, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)
JAllemandou added a comment. Some more info on this aspect: I have done a quick analysis over September queries today and found that my assumption that long queries were made by users from UI is wrong. First, total numbers of request and sum of query-time split by queries taking more than 1s or less: +---+-+---+ |more_1s|requests |query_time | +---+-+---+ |false |160185762|11285161245| |true |2757758 |22233005459| +---+-+---+ The proportions of number of queries per time classes are the same whether a referer is present (expected UI) or not (expected bot). +---++-+---+ |has_referer|query_time_class|count|query_time | +---++-+---+ |false |1_less_10ms |8613461 |43244699 | |false |2_10ms_to_100ms |118036102|3382186064 | |false |3_100ms_to_1s |28377288 |7058252741 | |false |4_1s_to_10s |1815394 |6081683264 | |false |5_more_10s |591957 |14313410554| |true |1_less_10ms |24329|133314 | |true |2_10ms_to_100ms |3123534 |140796917 | |true |3_100ms_to_1s |2011048 |660547510 | |true |4_1s_to_10s |310037 |800937814 | |true |5_more_10s |40370|1036973827 | +---++-+---+ Below are some information on the top-100 user-agents/referer making most requests with duration greatest than 1s: +-+-++--++--+ |user_agent |referer |requests_more_1s|query_time_more_1s|requests_less_1s|query_time_less_1s| +-+-++--++--+ |ChemAxon-Marvin/20.15.0 |null |209930 |1816992218|0 |0 | |SAP/1.0 |null |143198 |248447172 |5100562 |1834622022| |okhttp/4.0.0-alpha02 |null |128509 |552415825 |2967 |679104| |Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML\, like Gecko) Chrome/50.0.2661.102 Safari/537.36 |null |114446 |467381034 |5899288 |441228641 | |commonscat_copy_from_P373 Pywikibot/3.1.dev0 (g6) requests/2.22.0 Python/2.7.13.final.0 |null |99342 |404639751 |5907 |4823011 | |sparqlwrapper 1.8.2 (rdflib.github.io/sparqlwrapper) |null |86830 |2843969537|289618 |45768268 | |Apache-HttpClient/4.5.10 (Java/1.8.0_242) |null |70949 |1331327131|3127 |2242093 | |bbw-bot |null |68936 |292089223 |1957742 |234481876 | |MyCoolTool/0.1 dlworb1...@yonsei.ac.kr |null |52715 |149532780 |275170 |34374373 | |python-requests/2.24.0 |null |49917 |458021897 |364755 |33714974 | |Drupal
[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)
JAllemandou added a comment. I continued my analysis today looking at top-100 parsed user-agents from both queries-with-referer subset, and queries-without-referer subset, over the month of September. See https://phabricator.wikimedia.org/P12933 - The queries-with-referer have a defined user-agent. meaning that the user-agent-parser we use to extract structured information from the user-agent line provides values for a lot of its fields. By looking at the top-100 user-agents we actually cover more than 90% of requests made with referer - The queries-without-referer have either an undefined or `Spider` user-agent, meaning that the user-agent line is either not parseable or is parsed as a bot. I inspected manually the user-agent lines and confirm that most of the user-agent lines looks like bots (particularly the ones making most requests). By looking at the top 100 user-agents we also cover more than 90% of requests made without referer. This confirms that, despite being small, the requests providing a referer seems trustworthy. There is therefore nothing more to for this task, data is already available. TASK DETAIL https://phabricator.wikimedia.org/T261841 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Zbyszko, JAllemandou Cc: CBogen, JAllemandou, Aklapper, Gehel, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, darthmon_wmde, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)
JAllemandou added a comment. Heya - I'm sorry I completely missed the ping :S Quick analysis: spark.sql("SELECT (http.request_headers['referer'] IS NOT NULL) as defined_referer, count(1) as c from event.wdqs_external_sparql_query where year = 2020 and month = 9 group by (http.request_headers['referer'] IS NOT NULL) limit 100").show(100, false) +---+-+ |defined_referer|c| +---+-+ |false |165201676| |true |5613278 | +---+-+ --> 3.3% of requests have referer defined for September Among those 3.3%, here is the top 10: spark.sql("SELECT http.request_headers['referer'] as referer, count(1) as c from event.wdqs_external_sparql_query where year = 2020 and month = 9 and http.request_headers['referer'] IS NOT NULL group by http.request_headers['referer'] order by c desc limit 10").show(10, false) +-+---+ |referer |c | +-+---+ |https://query.wikidata.org/ |2730003| |https://labs.minutelabs.io/Tree-of-Life-Explorer/|307426 | |https://www.wikidata.org/|212431 | |https://labs.minutelabs.io/ |138757 | |https://ru.wikipedia.org/|107558 | |https://query.wikidata.org/embed.html|102165 | |https://wlmuk.toolforge.org/ |96946 | |https://maps.wikilovesmonuments.org/ |89894 | |https://wikishootme.toolforge.org/ |87632 | |https://en.wikipedia.org/|62147 | +-+---+ --> Using headers over a month, https://query.wikidata.org/ queries represent 1.6% of queries. Having 3.3% of referer seems small. If someone with better gut-feeling of that could chime-in that's be great, otherwise I'm gonna try to do more advanced user-agent analysis on the data and try to judge if it feels organix or not. TASK DETAIL https://phabricator.wikimedia.org/T261841 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Zbyszko, JAllemandou Cc: CBogen, JAllemandou, Aklapper, Gehel, Alter-paule, Beast1978, Un1tY, Akuckartz, Hook696, darthmon_wmde, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T258269: Add query result to the current WDQS event logging
JAllemandou added a comment. In term of logging-size, it probably depends on the result type: in case of descriptions or other text-heavy fields, this could get bigger if high or no `LIMIT` are set in the number of returned rows. We should set a limit :) TASK DETAIL https://phabricator.wikimedia.org/T258269 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, GoranSMilovanovic, Gehel, Aklapper, CBogen, Akuckartz, darthmon_wmde, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T261937: Add CPU load and query concurrency as context to event logging from WDQS
JAllemandou added a comment. Will make it a lot easier to analyze than to have to build the 'in-flight' view of queries! TASK DETAIL https://phabricator.wikimedia.org/T261937 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Aklapper, Gehel, CBogen, Akuckartz, darthmon_wmde, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS
JAllemandou added a comment. @GoranSMilovanovic I have indeed done some analysis using Apache Jena parser to extract algebraic representation of queries. Not yet to the level of completion I like though. I'll be on holidays until August 15th starting tonight - let's discuss when I come back? TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, JAllemandou Cc: Samantha_Alipio_WMDE, MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS
JAllemandou added a comment. @GoranSMilovanovic I finally published a wiki page with most of the results I found: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Traffic_Analysis Sorry for the delay ... TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, JAllemandou Cc: Samantha_Alipio_WMDE, MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS
JAllemandou added a comment. SELECT http.request_headers['user-agent'], user_agent_map, count(1) as c FROM event.wdqs_external_sparql_query WHERE year = 2020 and month = 5 and day = 1 GROUP BY http.request_headers['user-agent'], user_agent_map ORDER BY c DESC LIMIT 100; TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, JAllemandou Cc: Samantha_Alipio_WMDE, MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS
JAllemandou added a comment. > First step: analyze the frequency distribution of the user_agent field (string) from wmf.webrequest where queries are SPARQL. I suggest you use events instead fo webrequest: `event.wdqs_internal_sparql_query` and `event.wdqs_external_sparql_query`. I have done some work emcompassing user-agent frequency analysis and I 'm in the process of writing the findings for this end of week. TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, JAllemandou Cc: Samantha_Alipio_WMDE, MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Closed] T249319: Remove wb_terms from sqoop
JAllemandou closed this task as "Resolved". JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T249319 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Milimetric, Aklapper, Addshore, 4748kitoko, Iflorez, darthmon_wmde, alaa_wmde, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, JAllemandou, terrrydactyl, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T253753: Increase retention for mediawiki.revision-create on the kafka jumbo cluster
JAllemandou added a comment. An idea: How about sending back to kafka the update stream and make THAT one retention higher? Moving retention to 30 days for revision-create will make a lot of data stay that wouldn't be necessary (about half of the data), while keeping only the updates should be enough. Just an idea :) TASK DETAIL https://phabricator.wikimedia.org/T253753 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Ottomata, dcausse, Aklapper, CBogen, 4748kitoko, darthmon_wmde, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T236895: ArticlePlaceholder dashboard stopped tracking page views
JAllemandou added a comment. Patch needs to be deployed before the dashboard shows data. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, JAllemandou Cc: Milimetric, Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher, Alter-paule, Hazizibinmahdi, Beast1978, Un1tY, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, E.S.A-Sheild, Iflorez, darthmon_wmde, alaa_wmde, Meekrab2012, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, cmadeo, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, terrrydactyl, Wikidata-bugs, aude, jayvdb, Ricordisamoa, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T246237: Extract some statistics on the use of the isBlank() function in wdqs query logs
JAllemandou added a comment. Events using `isBlank` since the beginning of year are now stored here: `/user/joal/wdqs_queries/2020_use_isBlank/wdqs_use_is_blank_202002.json`. There are ~56k events stored in json format in a single file to facilitate analysis. TASK DETAIL https://phabricator.wikimedia.org/T246237 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Lea_Lacroix_WMDE, JAllemandou, Aklapper, Lucas_Werkmeister_WMDE, dcausse, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T246237: Extract some statistics on the use of the isBlank() function in wdqs query logs
JAllemandou added a comment. As I was working on getting a better idea of the queries, I got some results relatively easily: Since beginning of year: - Internal cluster: No request using `isBlank()`, 481202298 requests total - External cluster: 54669 requests using `isBlank()`, 202695416 requests total (0.03%) I can provide more details as needed :) TASK DETAIL https://phabricator.wikimedia.org/T246237 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Aklapper, Lucas_Werkmeister_WMDE, dcausse, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Retitled] T209655: Copy Wikidata dumps to HDFS + parquet
JAllemandou renamed this task from "Copy Wikidata dumps to HDFS" to "Copy Wikidata dumps to HDFS + parquet". TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian, leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, Beast1978, Un1tY, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, darthmon_wmde, Meekrab2012, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T209655: Copy Wikidata dumps to HDFS
JAllemandou added a subtask: T243832: Fix hdfs-rsync`prune-empty-dirs` feature. TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian, leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, Un1tY, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, darthmon_wmde, AramBakir, Meekrab2012, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Claimed] T209655: Copy Wikidata dumps to HDFS
JAllemandou claimed this task. JAllemandou added a project: Analytics-Kanban. JAllemandou set the point value for this task to "5". TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian, leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, Un1tY, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, darthmon_wmde, AramBakir, Meekrab2012, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T236895: ArticlePlaceholder dashboard stopped tracking page views
JAllemandou added a project: Analytics-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, JAllemandou Cc: Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, Iflorez, darthmon_wmde, alaa_wmde, Meekrab2012, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, cmadeo, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, terrrydactyl, Wikidata-bugs, aude, jayvdb, Ricordisamoa, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T236895: ArticlePlaceholder dashboard stopped tracking page views
JAllemandou added a comment. The patch merged by @Nuria had a bug. I commented on the already merged patch on a solution. For the moment the job is not started. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, JAllemandou Cc: Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, Iflorez, darthmon_wmde, alaa_wmde, Meekrab2012, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, cmadeo, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, terrrydactyl, Wikidata-bugs, aude, jayvdb, Ricordisamoa, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T239898: Investigate triple counts difference between dumps and what blazegraph reports
JAllemandou added a comment. Chiming in: I suggest using Spark for investigations - Given the size of the dataset, parallel computation should help. This means another hop for the data: --> stat1004 --> HDFS. Please ping if you want/need help :) TASK DETAIL https://phabricator.wikimedia.org/T239898 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Gehel, elukey, dcausse, Aklapper, darthmon_wmde, DannyS712, Nandana, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Changed Subscribers] T209655: Copy Wikidata dumps to HDFS
JAllemandou added a subscriber: Groceryheist. JAllemandou added a comment. New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page. hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1 drwxr-xr-x - analytics joal 0 2019-12-04 18:31 /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20191202 hdfs dfs -ls /user/joal/wmf/data/wmf/wikidata/item_page_link/ | tail -1 drwxr-xr-x - joal joal 0 2019-12-04 18:50 /user/joal/wmf/data/wmf/wikidata/item_page_link/20191202 TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Groceryheist, MGerlach, WMDE-leszek, abian, leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, 4748kitoko, darthmon_wmde, DannyS712, Nandana, Akovalyov, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T239471: Sqoop wikidata terms tables into hadoop
JAllemandou added a project: Analytics-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T239471 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore, JAllemandou Cc: JAllemandou, Addshore, Aklapper, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, Iflorez, darthmon_wmde, alaa_wmde, Meekrab2012, joker88john, DannyS712, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, terrrydactyl, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T101013: Log Wikidata Query Service queries to the event gate infrastructure
JAllemandou added a comment. Does this being closed mean we can access data on kafka? TASK DETAIL https://phabricator.wikimedia.org/T101013 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse, JAllemandou Cc: Igorkim78, JAllemandou, Ottomata, Smalyshev, Deskana, Aklapper, 4748kitoko, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, darthmon_wmde, holger.knust, Meekrab2012, joker88john, ET4Eva, DannyS712, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, WSH1906, Avner, Lewizho99, Maathavan, Gehel, _jensen, rosalieper, Scott_WUaS, Jonas, FloNight, Xmlizer, mobrovac, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, GWicke, Manybubbles, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T236895: ArticlePlaceholder dashboard stopped tracking page views
JAllemandou added a comment. I think this problem could be related to T226730 (preventing most `Special:XXX` pages to be flagged as pageviews). TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher, 4748kitoko, darthmon_wmde, DannyS712, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, QZanden, cmadeo, LawExplorer, _jensen, rosalieper, Jonas, terrrydactyl, Wikidata-bugs, aude, jayvdb, Ricordisamoa, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFs
JAllemandou added a comment. this is done @GoranSMilovanovic. Raw data is here `/user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902` and parquet data is here `/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902` TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: WMDE-leszek, abian, leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, 4748kitoko, darthmon_wmde, DannyS712, Nandana, Akovalyov, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFs
JAllemandou added a comment. @GoranSMilovanovic : You're welcome :) At some point I'll manage to have that productionize ;) TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: abian, leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, darthmon_wmde, Premeditated, Nandana, Akovalyov, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T220977: Investigate surprising rise in mobile page views for wikidata
JAllemandou added a comment. A lot trickier :) We have the `wmf_raw.mediawiki_private_cu_changes` table in hive, allowing us to compute geo-editors (editors-by-country, aggregated). This table only contains 3 month of data for PII removal reasons. It's probably not enough for what you're after, but I have nothing better (see https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_monthly_data.hql for an example). I've just created T223444 <https://phabricator.wikimedia.org/T223444> to submit the general idea of having geo-editors stats split by desktop/mobile. TASK DETAIL https://phabricator.wikimedia.org/T220977 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, JAllemandou Cc: JAllemandou, Milimetric, RazShuty, Lea_WMDE, Aklapper, darthmon_wmde, alaa_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T220977: Investigate surprising rise in mobile page views for wikidata
JAllemandou added a comment. Hi @Lea_WMDE and @GoranSMilovanovic - I think the answer the your problem is solved in this month snapshot with the `revision_tags` field of mediawiki_history: spark.sql(""" SELECT substr(event_timestamp, 0, 4) as year, array_contains(revision_tags, 'mobile edit') as mobile, array_contains(revision_tags, 'mobile app edit') as mobile_app, count(1) as c FROM wmf.mediawiki_history WHERE snapshot = '2019-04' AND wiki_db = 'wikidatawiki' AND event_entity = 'revision' GROUP BY substr(event_timestamp, 0, 4), array_contains(revision_tags, 'mobile edit'), array_contains(revision_tags, 'mobile app edit') ORDER BY year, mobile, mobile_app desc """).show(100, false) ++--+--+-+ |year|mobile|mobile_app|c| ++--+--+-+ |2004|null |null |146 | |2005|null |null |495 | |2006|null |null |1838 | |2007|null |null |2814 | |2008|null |null |2384 | |2009|null |null |2175 | |2010|null |null |1650 | |2011|null |null |1354 | |2012|null |null |2912961 | |2012|false |false |4| |2013|null |null |94142292 | |2013|false |false |181133 | |2014|null |null |69236941 | |2014|false |true |2| |2014|false |false |18174243 | |2014|true |false |51 | |2015|null |null |76088107 | |2015|false |true |586 | |2015|false |false |26269493 | |2015|true |false |4058 | |2016|null |null |82178134 | |2016|false |false |53308675 | |2016|true |true |618 | |2016|true |false |24248| |2017|null |null |109041593| |2017|false |false |83147234 | |2017|true |true |114906 | |2017|true |false |49836| |2018|null |null |141536855| |2018|false |false |67149958 | |2018|true |true |186065 | |2018|true |false |71822| |2019|null |null |55814156 | |2019|false |false |49994060 | |2019|true |true |85968| |2019|true |false |23867| ++--+--+-+ TASK DETAIL https://phabricator.wikimedia.org/T220977 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, JAllemandou Cc: JAllemandou, Milimetric, RazShuty, Lea_WMDE, Aklapper, darthmon_wmde, alaa_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T94019: Generate RDF from JSON
JAllemandou added a comment. The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization. TASK DETAIL https://phabricator.wikimedia.org/T94019 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Pintoch, Smalyshev, hoo, Liuxinyu970226, mkroetzsch, Aklapper, daniel, alaa_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday
JAllemandou added a comment. Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for helping driving this :) TASK DETAIL https://phabricator.wikimedia.org/T216160 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Lydia_Pintscher, Pintoch, Rosiestep, Lea_Lacroix_WMDE, WMDE-leszek, Mvolz, notconfusing, Envlh, Melderick, Nicolastorzec, hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, alaa_wmde, joker88john, CucyNoiD, Nandana, NebulousIris, Akovalyov, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Zambujo, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Lunewa, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, gnosygnu, Wikidata-bugs, aude, Daniel_Mietchen, jayvdb, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T218901: Track number of Wikidata edits by namespace
JAllemandou added a comment. Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :) TASK DETAIL https://phabricator.wikimedia.org/T218901 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Lucas_Werkmeister_WMDE, JAllemandou Cc: JAllemandou, Addshore, Aklapper, Lucas_Werkmeister_WMDE, pdehaye, alaa_wmde, joker88john, Michael, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, YULdigitalpreservation, LawExplorer, Salgo60, Lewizho99, Maathavan, _jensen, rosalieper, abian, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T218901: Track number of Wikidata edits by namespace
JAllemandou added a comment. Reading about this - Would delayed data be interesting? This information is accessible in hadoop :) TASK DETAIL https://phabricator.wikimedia.org/T218901 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Lucas_Werkmeister_WMDE, JAllemandou Cc: JAllemandou, Addshore, Aklapper, Lucas_Werkmeister_WMDE, pdehaye, alaa_wmde, Michael, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, YULdigitalpreservation, LawExplorer, Salgo60, Lewizho99, Maathavan, _jensen, rosalieper, abian, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T209655: Copy Wikidata dumps to HDFs
JAllemandou added a comment. Most of the complicated things already exist for this to work (equicalent of rsync for HDFS, spark job converting wikidata json dumps to parquet). I wanted for T216160 <https://phabricator.wikimedia.org/T216160> to be settled before moving into productionization (having the same date for the various dumps we handle simplifies quite a bit), and it takes time. TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, alaa_wmde, Nandana, Akovalyov, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T214897: data for analyzing and visualizing the identifier landscape of Wikidata
JAllemandou added a comment. Hey @GoranSMilovanovic - I don't have a good understanding of what you're after, but having read pairs and contingency table above, maybe this Spark function could be helpful: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/DataFrameStatFunctions.html TASK DETAIL https://phabricator.wikimedia.org/T214897 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, JAllemandou Cc: RazShuty, Addshore, JAllemandou, Aklapper, GoranSMilovanovic, Lydia_Pintscher, alaa_wmde, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday
JAllemandou added a comment. In T216160#5020236 <https://phabricator.wikimedia.org/T216160#5020236>, @ArielGlenn wrote: > By Friday I'll have done that; by next Wednesday let's make a decision, barring any huge obstacles. Awesome, thanks @ArielGlenn :) TASK DETAIL https://phabricator.wikimedia.org/T216160 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: notconfusing, Envlh, Melderick, Nicolastorzec, hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, alaa_wmde, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, gnosygnu, Wikidata-bugs, aude, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday
JAllemandou added a comment. Following up on this: another viable solution to get monthly-coherence between dumps is to force a dump on the 1st of the month ... I'm not sure the idea is better. @ArielGlenn - How do we proceed to try moving forward (in either direction) ? TASK DETAIL https://phabricator.wikimedia.org/T216160 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Envlh, Melderick, Nicolastorzec, hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, alaa_wmde, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, gnosygnu, Wikidata-bugs, aude, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T217821: Investigate duplication of strings in wb_terms table for wikidatawiki
JAllemandou added a comment. Exact analysis ran on 2018-12-06: val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001") val base_rdd = df.select("labels", "descriptions", "aliases").rdd val strings = base_rdd.flatMap(r => { r.getMap[String,String](0).values ++ r.getMap[String,String](1).values ++ r.getMap[String,Seq[String]](2).values.flatMap(l => l) }) val grouped_strings = strings.map(s => (s, 1)).reduceByKey(_+_) val total_bytes = grouped_strings.map(t => t._1.getBytes.length * t._2).sum() val duplicate_bytes = grouped_strings.map(t => t._1.getBytes.length * (t._2 - 1)).sum() println(f"Total bytes for strings: $total_bytes%15.0f") println(f"Total duplicate bytes for strings: $duplicate_bytes%15.0f") println(f"Usefull bytes for strings: ${total_bytes - duplicate_bytes}%15.0f") //Total bytes for strings: 45,724,033,674 //Total duplicate bytes for strings: 41,630,588,801 //Usefull bytes for strings: 4,093,444,873 // Usefull is 1 order of magnitude less than used // Triple check usefull bytes for strings: grouped_strings.map(_._1.getBytes.length).sum() == (total_bytes - duplicate_bytes) // true // How many unique strings? grouped_strings.count() // 98,524,732 // How many string with 1 instance? grouped_strings.filter(t => t._2 == 1).count() // 72,584,179 // Leaving 25,940,553 unique strings having multiple instances // --> If we go for table-indirection, we'll need ~100M longs (4 bytes) // --> 400,000,000 bytes - 1 order of magnitude less than unique string size TASK DETAIL https://phabricator.wikimedia.org/T217821 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Aklapper, Addshore, alaa_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs
JAllemandou added a comment. Hi @Isaac Sorry for the issue. I correcte the query above (last query, join criteria: `AND ws.sitelink.title = title_namespace_localized` --> `AND REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized` We were not joining correctly on title as mediawikik-history encodes them with underscores while wikidata dump uses spaces. Problem solves, data regenerated at the same place as before, double check on enwiki numbers look good: 5.96M pages have an item in namespace 0 (7.95M for all namespaces). TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs
JAllemandou added a comment. We're on the same page @diego :) I can precompute the table described in ii) if needed, and will surely do it once we'll have the wikidata-dump productioned - Let me know if you need it before TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday
JAllemandou added a comment. I can't speak about failures and restarts as I don't know much about the dumps-generation process. @ArielGlenn would the person to know best. As for the dates, the main reason we ask for the change is for dates consistency by month, mimic-ing what exists for xml dumps.TASK DETAILhttps://phabricator.wikimedia.org/T216160EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: Melderick, Nicolastorzec, hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs