[Wikidata-bugs] [Maniphest] T336361: [Analytics] Identify access from mobile vs. desktop devices

2023-09-07 Thread JAllemandou
JAllemandou added a comment. > However, my assumption is that when only filtering for agent_type != 'spider' the population will still include a lot of non-UI hits. The `agent_type` field currently can take 3 values: `spider`, `automated` and `user`. The `spider` one is used when u

[Wikidata-bugs] [Maniphest] T342416: Set data permission on new snapshot generation (discovery.wikibase_rdf)

2023-08-18 Thread JAllemandou
JAllemandou added a comment. In T342416#9101868 <https://phabricator.wikimedia.org/T342416#9101868>, @EBernhardson wrote: > These are both generated by spark. The rdf is being imported by a scala application while the cirrus dump is imported by pyspark, but they should both

[Wikidata-bugs] [Maniphest] T342416: Set data permission on new snapshot generation (discovery.wikibase_rdf)

2023-08-18 Thread JAllemandou
JAllemandou added a comment. In T342416#9091146 <https://phabricator.wikimedia.org/T342416#9091146>, @EBernhardson wrote: > I looked into these, the attached patch should fix it but it leaves an open question (@JAllemandou): > > The `core-site.xml`, along with pupp

[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3

2023-07-03 Thread JAllemandou
JAllemandou added a comment. We met this morning with @AndrewTavis_WMDE and @Manuel - Thank you folks for the great meeting. The detailed Meeting notes are here: https://docs.google.com/document/d/1REsolXnZf2KqApL0p-DE8X4eWXI_zxHgrCe3k1hcZnw From the job list in previous comment

[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3

2023-06-30 Thread JAllemandou
JAllemandou added a comment. Hi @AndrewTavis_WMDE, I've done some investigation, and here is what I have: Goran has 11 CRON jobs running from various hosts on our system (1on `stat1004`, 2 on `stat1007`, 7 on `stat1008`). - `WDCM_Sqoop_Clients` runs on`stat1004` weekly - It doesn't

[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3

2023-06-22 Thread JAllemandou
JAllemandou added a comment. In T334951#8952583 <https://phabricator.wikimedia.org/T334951#8952583>, @AndrewTavis_WMDE wrote: > - If the answer to the above question of permanently losing some data that's being produced by Concepts Monitor and other WMDE jobs is no, then

[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3

2023-06-19 Thread JAllemandou
JAllemandou added a comment. In T334951#8946790 <https://phabricator.wikimedia.org/T334951#8946790>, @AndrewTavis_WMDE wrote: > I'll async with him now and see if we can come to a decision sooner than that, but you all will have the answer by Wednesday at the latest 

[Wikidata-bugs] [Maniphest] T334951: Wikidata Concepts Monitor ETL Migration to Spark3

2023-06-19 Thread JAllemandou
JAllemandou added a comment. Hi Folks - What is the status on this one? I'd like Data-Engineering to announce the deprecation of Spark2 for this end of month, but not without knowing how we plan on tackling your job :) Here are the 2 possible solutions I can think of: - Stopping

[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis

2022-09-15 Thread JAllemandou
JAllemandou added a comment. In T303831#8237323 <https://phabricator.wikimedia.org/T303831#8237323>, @EBernhardson wrote: > data cleanup looks to now have run successfully Thanks a lot @EBernhardson for finalizing on this :) TASK DETAIL https://phabricator.wikimedia.or

[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis

2022-08-26 Thread JAllemandou
JAllemandou added a comment. In T303831#8175252 <https://phabricator.wikimedia.org/T303831#8175252>, @EBernhardson wrote: > @JAllemandou The one remaining piece of this ticket is cleaning up the historical data, per T303831#8081172 <https://phabricator.wikimedia.org/T30

[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis

2022-07-13 Thread JAllemandou
JAllemandou added a comment. Thanks a lot @EBernhardson for the help on finishing this! TASK DETAIL https://phabricator.wikimedia.org/T303831 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: EBernhardson, dcausse

[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake

2022-04-08 Thread JAllemandou
JAllemandou closed subtask T299059: Write an Airflow job converting commons structured data dump to Hive as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T258834 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: AKhatun_WMF

[Wikidata-bugs] [Maniphest] T299059: Write an Airflow job converting commons structured data dump to Hive

2022-04-08 Thread JAllemandou
JAllemandou closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T299059 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Snwachukwu, JAllemandou Cc: Cparle, nettrom_WMF, Miriam, Nuria, cchen, AKhatun_WMF, J

[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake

2022-03-30 Thread JAllemandou
JAllemandou closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T258834 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF, Fernandob

[Wikidata-bugs] [Maniphest] T252443: Create dashboard to show growth of structured data on Commons over time

2022-03-30 Thread JAllemandou
JAllemandou closed subtask T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T252443 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: cchen, JAllemandou Cc

[Wikidata-bugs] [Maniphest] T300240: Missing Wikidata RDF (ttl and nt) dumps for 20220117

2022-03-07 Thread JAllemandou
JAllemandou added a comment. Thank you for letting us know :) TASK DETAIL https://phabricator.wikimedia.org/T300240 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: ArielGlenn, Aklapper, JAllemandou, AKhatun_WMF, dcausse

[Wikidata-bugs] [Maniphest] T299059: Write an Airflow job converting commons structured data dump to Hive

2022-01-12 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Product-Analytics, Structured-Data-Backlog, Wikidata-Query-Service, Wikidata, Data-Engineering, Discovery-Search (Current work), Patch-For-Review, Data-Engineering-Kanban. Restricted Application removed a project: Patch-For-Review. TASK

[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake

2021-12-16 Thread JAllemandou
JAllemandou added a comment. Code is ready: - Import `commons-mediainfo` json dumps to HDFS (https://gerrit.wikimedia.org/r/738874) - Update spark transformation job to work with both wikidata and commons dumps (https://gerrit.wikimedia.org/r/739129) - Update `wikidata_entity` table

[Wikidata-bugs] [Maniphest] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake

2021-12-15 Thread JAllemandou
JAllemandou added a project: Data-Engineering-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T258834 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF, 786, EChetty

[Wikidata-bugs] [Maniphest] T291205: Analysis: Property usage by items' P31

2021-09-16 Thread JAllemandou
JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T291205 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, Jmixter87, JAllemandou, MPhamWMF, CBogen, Namenlos314, Gq86

[Wikidata-bugs] [Maniphest] T291205: Analysis: Property usage by items' P31

2021-09-16 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION It is interesting to understand how properties are used by different content subgraphs (for instance humans, scholarly articles etc

[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries

2021-07-19 Thread JAllemandou
JAllemandou added a comment. Why not adding other prefixes if it's as simple as adding the prefix to the AQS list - I think there'll be more gotchas. let's try @AKhatun_WMF :) TASK DETAIL https://phabricator.wikimedia.org/T285465 EMAIL PREFERENCES https://phabricator.wikimedia.org

[Wikidata-bugs] [Maniphest] T282139: Provide a quantitative description of the Wikidata-triples dataset

2021-07-19 Thread JAllemandou
JAllemandou closed this task as "Resolved". JAllemandou added a comment. The analysis is documented here: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Basic_Analysis. Thanks @AKhatun_WMF :) TASK DETAIL https://phabricator.wikimedia.org/T282139 EMAIL PREFERENC

[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries

2021-07-19 Thread JAllemandou
JAllemandou added subscribers: MPhamWMF, Gehel. JAllemandou added a comment. Thanks @AKhatun_WMF for the analysis. @dcausse , @Gehel and @MPhamWMF - Do you think it;s worth trying to make our parser being able to process queries with the 'mwapi' prefix (it represents 10% of all requests

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-06-24 Thread JAllemandou
JAllemandou added a subtask: T285465: Document and analyze the number of parsing errors for parsed WDQS queries. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc

[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries

2021-06-24 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T285465 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, AKhatun_WMF, JAllemandou, MPhamWMF, CBogen

[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries

2021-06-24 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION We wish, for the month of June 2021: - Report the number of parsing errors when generating parsed queries information - Provide

[Wikidata-bugs] [Maniphest] T283256: Extract operator/nodes/triples/paths/exprs list from queries

2021-05-25 Thread JAllemandou
JAllemandou added a comment. The problem I see with using a generic class in the `QueryElem` object is the conversion to parquet. I don't think it'll work out of the box, leading to having to devise our own conversion. Let's brainstorm on ideas on this, possibly in meeting to make it faster

[Wikidata-bugs] [Maniphest] T283258: Provide a job regularly deleting wdqs processed query after 90 days

2021-05-20 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review, Discovery-Search (Current work). Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION This task is related to T273854 <https://phabricator.wikimedia.

[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction

2021-05-20 Thread JAllemandou
JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: dcausse, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz

[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction

2021-05-20 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: dcausse, Aklapper, JAllemandou, Invadibot, MPhamWMF

[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction

2021-05-20 Thread JAllemandou
JAllemandou removed JAllemandou as the assignee of this task. JAllemandou added a subscriber: dcausse. JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-20 Thread JAllemandou
JAllemandou added a subtask: T273854: Automate regular WDQS query parsing and data-extraction. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper

[Wikidata-bugs] [Maniphest] T283256: Extract operator/nodes/triples/paths/exprs list from queries

2021-05-20 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review, Discovery-Search (Current work). Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION Augment query-analysis QueryInfo with a list of operators+nodes+paths

[Wikidata-bugs] [Maniphest] T283255: Create CLI job extracting info from wdqs queries

2021-05-20 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review, Discovery-Search (Current work). Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION The job should process data hourly. Expected parameters to be passed

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-20 Thread JAllemandou
JAllemandou closed subtask T282129: Test triple-analysis functions over a large dataset with Spark as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-20 Thread JAllemandou
JAllemandou closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, Invadibot

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-20 Thread JAllemandou
JAllemandou added a comment. Closing this task :) Thanks fro the great work @AKhatun_WMF TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: CBogen, AKhatun_WMF

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-20 Thread JAllemandou
JAllemandou closed this task as "Resolved". JAllemandou added a comment. Great ! Thanks for that :) Closing the ticket. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, J

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-20 Thread JAllemandou
JAllemandou closed subtask T282130: Provide a way to save extracted query-information in parquet format as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-20 Thread JAllemandou
JAllemandou added a comment. @AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings

[Wikidata-bugs] [Maniphest] T282139: Provide a quantitative description of the Wikidata-triples dataset

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION As a way to get familiar with the data, please provide quantitative information over the dataset using spark in a notebook (probably using

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-06 Thread JAllemandou
JAllemandou added a subtask: T282130: Provide a way to save extracted query-information in parquet format. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-06 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Being able to save the information in Parquet will be very useful as it allows to automatically process the queries as the y flow in (hourly

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-06 Thread JAllemandou
JAllemandou added a subtask: T282129: Test triple-analysis functions over a large dataset with Spark. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-06 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, MPhamWMF

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Once ready locally with unit-tests, apply the triple-analysis method to bigger data in spark (a day). TASK DETAIL https

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-06 Thread JAllemandou
JAllemandou added a subtask: T282127: Add unit-tests to WDQS analysis toolkit. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse

[Wikidata-bugs] [Maniphest] T282127: Add unit-tests to WDQS analysis toolkit

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Extract a set of queries to be used as unit-tests (10 queries) from the events. This should facilitate making sure the code is doing what

[Wikidata-bugs] [Maniphest] T282127: Add unit-tests to WDQS analysis toolkit

2021-05-06 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282127 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF

[Wikidata-bugs] [Maniphest] T281808: Wikidata all-json dumps not available from 2021-04-26

2021-05-04 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Wikidata, Dumps-Generation, Analytics. Restricted Application added a project: wdwb-tech. TASK DESCRIPTION Analytics load wikidata all-json dumps weekly on the hadoop cluster, and we have received an alert for dumps not being available

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-04-20 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. Restricted Application added a project: Wikidata. TASK DESCRIPTION The current analysis parses queries and extracts: - Operators (list, and map

[Wikidata-bugs] [Maniphest] T94019: Generate RDF from JSON

2021-04-19 Thread JAllemandou
JAllemandou added a subscriber: dcausse. JAllemandou added a comment. Info: There already is in the cluster a job doing `TTL -> RDF` conversion. The TTL dumps are imported weekly, and converted to blazegraph RDF once available. The job is maintained by the Search Platform team (p

[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction

2021-02-04 Thread JAllemandou
JAllemandou claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, JAllemandou, MPhamWMF, CBogen, Akuckartz, 4748kitoko, Nandana, Namenlos314, Akovalyov

[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction

2021-02-04 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Analytics, Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. Restricted Application added a project: Wikidata. TASK DESCRIPTION This task is about running regular query-parsing jobs for WDQS and storing

[Wikidata-bugs] [Maniphest] T266022: Programmatically categorize WDQS queries by potential alternative solution

2021-02-03 Thread JAllemandou
JAllemandou added a comment. Ah! I realize I have not updated that task. The analysis can be found here: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis @CBogen : I let you handle the definition of done, and whether this task should be closed or not :) TASK DETAIL

[Wikidata-bugs] [Maniphest] T266022: Programmatically categorize WDQS queries by potential alternative solution

2021-01-04 Thread JAllemandou
JAllemandou added a comment. Planned deadline was end of last month. I've gone through various issues preventing to achieve it. I'have started the actual work today (I gave it thought but didn't code) and wish to present results before the end of the month. TASK DETAIL https

[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)

2020-10-16 Thread JAllemandou
JAllemandou added a comment. Some more info on this aspect: I have done a quick analysis over September queries today and found that my assumption that long queries were made by users from UI is wrong. First, total numbers of request and sum of query-time split by queries taking more

[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)

2020-10-06 Thread JAllemandou
JAllemandou added a comment. I continued my analysis today looking at top-100 parsed user-agents from both queries-with-referer subset, and queries-without-referer subset, over the month of September. See https://phabricator.wikimedia.org/P12933 - The queries-with-referer have

[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)

2020-10-02 Thread JAllemandou
JAllemandou added a comment. Heya - I'm sorry I completely missed the ping :S Quick analysis: spark.sql("SELECT (http.request_headers['referer'] IS NOT NULL) as defined_referer, count(1) as c from event.wdqs_external_sparql_query where year = 2020 and month = 9

[Wikidata-bugs] [Maniphest] T258269: Add query result to the current WDQS event logging

2020-09-07 Thread JAllemandou
JAllemandou added a comment. In term of logging-size, it probably depends on the result type: in case of descriptions or other text-heavy fields, this could get bigger if high or no `LIMIT` are set in the number of returned rows. We should set a limit :) TASK DETAIL https

[Wikidata-bugs] [Maniphest] T261937: Add CPU load and query concurrency as context to event logging from WDQS

2020-09-07 Thread JAllemandou
JAllemandou added a comment. Will make it a lot easier to analyze than to have to build the 'in-flight' view of queries! TASK DETAIL https://phabricator.wikimedia.org/T261937 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-22 Thread JAllemandou
JAllemandou added a comment. @GoranSMilovanovic I have indeed done some analysis using Apache Jena parser to extract algebraic representation of queries. Not yet to the level of completion I like though. I'll be on holidays until August 15th starting tonight - let's discuss when I come back

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-22 Thread JAllemandou
JAllemandou added a comment. @GoranSMilovanovic I finally published a wiki page with most of the results I found: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Traffic_Analysis Sorry for the delay ... TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-15 Thread JAllemandou
JAllemandou added a comment. SELECT http.request_headers['user-agent'], user_agent_map, count(1) as c FROM event.wdqs_external_sparql_query WHERE year = 2020 and month = 5 and day = 1 GROUP BY http.request_headers['user-agent

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-15 Thread JAllemandou
JAllemandou added a comment. > First step: analyze the frequency distribution of the user_agent field (string) from wmf.webrequest where queries are SPARQL. I suggest you use events instead fo webrequest: `event.wdqs_internal_sparql_query` and `event.wdqs_external_sparql_query`.

[Wikidata-bugs] [Maniphest] [Closed] T249319: Remove wb_terms from sqoop

2020-06-02 Thread JAllemandou
JAllemandou closed this task as "Resolved". JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T249319 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Milimetric, Aklapper, Addshore,

[Wikidata-bugs] [Maniphest] [Commented On] T253753: Increase retention for mediawiki.revision-create on the kafka jumbo cluster

2020-05-27 Thread JAllemandou
JAllemandou added a comment. An idea: How about sending back to kafka the update stream and make THAT one retention higher? Moving retention to 30 days for revision-create will make a lot of data stay that wouldn't be necessary (about half of the data), while keeping only the updates

[Wikidata-bugs] [Maniphest] [Commented On] T236895: ArticlePlaceholder dashboard stopped tracking page views

2020-03-13 Thread JAllemandou
JAllemandou added a comment. Patch needs to be deployed before the dashboard shows data. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, JAllemandou Cc: Milimetric, Ladsgroup, Nuria

[Wikidata-bugs] [Maniphest] [Commented On] T246237: Extract some statistics on the use of the isBlank() function in wdqs query logs

2020-02-27 Thread JAllemandou
JAllemandou added a comment. Events using `isBlank` since the beginning of year are now stored here: `/user/joal/wdqs_queries/2020_use_isBlank/wdqs_use_is_blank_202002.json`. There are ~56k events stored in json format in a single file to facilitate analysis. TASK DETAIL https

[Wikidata-bugs] [Maniphest] [Commented On] T246237: Extract some statistics on the use of the isBlank() function in wdqs query logs

2020-02-26 Thread JAllemandou
JAllemandou added a comment. As I was working on getting a better idea of the queries, I got some results relatively easily: Since beginning of year: - Internal cluster: No request using `isBlank()`, 481202298 requests total - External cluster: 54669 requests using `isBlank

[Wikidata-bugs] [Maniphest] [Retitled] T209655: Copy Wikidata dumps to HDFS + parquet

2020-02-18 Thread JAllemandou
JAllemandou renamed this task from "Copy Wikidata dumps to HDFS" to "Copy Wikidata dumps to HDFS + parquet". TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc:

[Wikidata-bugs] [Maniphest] [Updated] T209655: Copy Wikidata dumps to HDFS

2020-01-28 Thread JAllemandou
JAllemandou added a subtask: T243832: Fix hdfs-rsync`prune-empty-dirs` feature. TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian

[Wikidata-bugs] [Maniphest] [Claimed] T209655: Copy Wikidata dumps to HDFS

2020-01-28 Thread JAllemandou
JAllemandou claimed this task. JAllemandou added a project: Analytics-Kanban. JAllemandou set the point value for this task to "5". TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAll

[Wikidata-bugs] [Maniphest] [Updated] T236895: ArticlePlaceholder dashboard stopped tracking page views

2020-01-08 Thread JAllemandou
JAllemandou added a project: Analytics-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, JAllemandou Cc: Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher

[Wikidata-bugs] [Maniphest] [Commented On] T236895: ArticlePlaceholder dashboard stopped tracking page views

2020-01-08 Thread JAllemandou
JAllemandou added a comment. The patch merged by @Nuria had a bug. I commented on the already merged patch on a solution. For the moment the job is not started. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel

[Wikidata-bugs] [Maniphest] [Commented On] T239898: Investigate triple counts difference between dumps and what blazegraph reports

2019-12-09 Thread JAllemandou
JAllemandou added a comment. Chiming in: I suggest using Spark for investigations - Given the size of the dataset, parallel computation should help. This means another hop for the data: --> stat1004 --> HDFS. Please ping if you want/need help :) TASK DETAIL

[Wikidata-bugs] [Maniphest] [Changed Subscribers] T209655: Copy Wikidata dumps to HDFS

2019-12-04 Thread JAllemandou
JAllemandou added a subscriber: Groceryheist. JAllemandou added a comment. New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page. hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1 drwxr-xr-x - analytics

[Wikidata-bugs] [Maniphest] [Updated] T239471: Sqoop wikidata terms tables into hadoop

2019-11-29 Thread JAllemandou
JAllemandou added a project: Analytics-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T239471 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore, JAllemandou Cc: JAllemandou, Addshore, Aklapper, 4748kitoko, Hook696, Daryl-TTMG

[Wikidata-bugs] [Maniphest] [Commented On] T101013: Log Wikidata Query Service queries to the event gate infrastructure

2019-11-27 Thread JAllemandou
JAllemandou added a comment. Does this being closed mean we can access data on kafka? TASK DETAIL https://phabricator.wikimedia.org/T101013 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse, JAllemandou Cc: Igorkim78, JAllemandou, Ottomata

[Wikidata-bugs] [Maniphest] [Updated] T236895: ArticlePlaceholder dashboard stopped tracking page views

2019-10-30 Thread JAllemandou
JAllemandou added a comment. I think this problem could be related to T226730 (preventing most `Special:XXX` pages to be flagged as pageviews). TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFs

2019-10-03 Thread JAllemandou
JAllemandou added a comment. this is done @GoranSMilovanovic. Raw data is here `/user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902` and parquet data is here `/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902` TASK DETAIL https://phabricator.wikimedia.org/T209655

[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFs

2019-06-08 Thread JAllemandou
JAllemandou added a comment. @GoranSMilovanovic : You're welcome :) At some point I'll manage to have that productionize ;) TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc

[Wikidata-bugs] [Maniphest] [Updated] T220977: Investigate surprising rise in mobile page views for wikidata

2019-05-16 Thread JAllemandou
JAllemandou added a comment. A lot trickier :) We have the `wmf_raw.mediawiki_private_cu_changes` table in hive, allowing us to compute geo-editors (editors-by-country, aggregated). This table only contains 3 month of data for PII removal reasons. It's probably not enough for what you're

[Wikidata-bugs] [Maniphest] [Commented On] T220977: Investigate surprising rise in mobile page views for wikidata

2019-05-14 Thread JAllemandou
JAllemandou added a comment. Hi @Lea_WMDE and @GoranSMilovanovic - I think the answer the your problem is solved in this month snapshot with the `revision_tags` field of mediawiki_history: spark.sql(""" SELECT substr(event_timestamp, 0, 4) as year,

[Wikidata-bugs] [Maniphest] [Commented On] T94019: Generate RDF from JSON

2019-04-23 Thread JAllemandou
JAllemandou added a comment. The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization. TASK DETAIL https://phabricator.wikimedia.org/T94019 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-04-23 Thread JAllemandou
JAllemandou added a comment. Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for helping driving this :) TASK DETAIL https://phabricator.wikimedia.org/T216160 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc

[Wikidata-bugs] [Maniphest] [Commented On] T218901: Track number of Wikidata edits by namespace

2019-04-08 Thread JAllemandou
JAllemandou added a comment. Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :) TASK DETAIL https://phabricator.wikimedia.org/T218901 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] [Commented On] T218901: Track number of Wikidata edits by namespace

2019-04-04 Thread JAllemandou
JAllemandou added a comment. Reading about this - Would delayed data be interesting? This information is accessible in hadoop :) TASK DETAIL https://phabricator.wikimedia.org/T218901 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Updated] T209655: Copy Wikidata dumps to HDFs

2019-03-26 Thread JAllemandou
JAllemandou added a comment. Most of the complicated things already exist for this to work (equicalent of rsync for HDFS, spark job converting wikidata json dumps to parquet). I wanted for T216160 <https://phabricator.wikimedia.org/T216160> to be settled before

[Wikidata-bugs] [Maniphest] [Commented On] T214897: data for analyzing and visualizing the identifier landscape of Wikidata

2019-03-15 Thread JAllemandou
JAllemandou added a comment. Hey @GoranSMilovanovic - I don't have a good understanding of what you're after, but having read pairs and contingency table above, maybe this Spark function could be helpful: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-03-14 Thread JAllemandou
JAllemandou added a comment. In T216160#5020236 <https://phabricator.wikimedia.org/T216160#5020236>, @ArielGlenn wrote: > By Friday I'll have done that; by next Wednesday let's make a decision, barring any huge obstacles. Awesome, thanks @ArielGlenn :) TASK DETAI

[Wikidata-bugs] [Maniphest] [Commented On] T217821: Investigate duplication of strings in wb_terms table for wikidatawiki

2019-03-07 Thread JAllemandou
JAllemandou added a comment. Exact analysis ran on 2018-12-06: val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001") val base_rdd = df.select("labels", "descriptions", "aliases").rdd val strings

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-26 Thread JAllemandou
JAllemandou added a comment. Hi @Isaac Sorry for the issue. I correcte the query above (last query, join criteria: `AND ws.sitelink.title = title_namespace_localized` --> `AND REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized` We were not joining correctly on ti

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-21 Thread JAllemandou
JAllemandou added a comment. We're on the same page @diego :) I can precompute the table described in ii) if needed, and will surely do it once we'll have the wikidata-dump productioned - Let me know if you need it before TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-20 Thread JAllemandou
JAllemandou added a comment. I can't speak about failures and restarts as I don't know much about the dumps-generation process. @ArielGlenn would the person to know best. As for the dates, the main reason we ask for the change is for dates consistency by month, mimic-ing what exists for xml

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread JAllemandou
JAllemandou added a comment. Thanks @Isaac for reformulating the question I tried to explain above :) @diego: Can you confirm there is value for you in having revisions tied to wikidata-items regardless of when the link happened?TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-18 Thread JAllemandou
JAllemandou added a comment. Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query: spark.sql("SET spark.sql.shuffle.partitions=128") val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/w

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-16 Thread JAllemandou
JAllemandou added a comment. Many thanks @ArielGlenn :)TASK DETAILhttps://phabricator.wikimedia.org/T216160EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, Nandana, Akovalyov, Lahi, Gq86

  1   2   >