[Wikidata-bugs] [Maniphest] [Commented On] T64874: [Story] Statistics for Wikidata exports

2015-11-17 Thread JAllemandou
JAllemandou added a subscriber: JAllemandou. JAllemandou added a comment. Hi, quick questions on that: Is the need regular, or would one shots make it ? Also, what level of aggregation ? Daily is good ? Below is a hive request that makes daily aggregation over (so thought) interesting dimension

[Wikidata-bugs] [Maniphest] [Updated] T64874: [Story] Statistics for Special:EntityData usage

2015-11-19 Thread JAllemandou
JAllemandou added a project: Analytics-Backlog. TASK DETAIL https://phabricator.wikimedia.org/T64874 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Halfak, hoo, Addshore, Ricordisamoa, Aklapper, drdee, Tnegrin, QChris

[Wikidata-bugs] [Maniphest] [Retitled] T119054: Investigate wikidata pageview sipke on 2015-11-14

2015-11-19 Thread JAllemandou
JAllemandou changed the title from "Remove query.wikidata.org from pageview definition (for wikidata)" to "Investigate wikidata pageview sipke on 2015-11-14". JAllemandou set Security to None. TASK DETAIL https://phabricator.wikimedia.org/T119054 EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T119054: Fix '.*http.*' not being tagged as spiders in webrequest

2015-11-19 Thread JAllemandou
JAllemandou added a comment. I messed up a deploy about a month ago, preventing the change merged here: https://gerrit.wikimedia.org/r/#/c/244465/ to actually being applied. I will: - bump refinery-core and refinery-hive (> 0.0.19) and update refine oozie job - deploy refinery with these

[Wikidata-bugs] [Maniphest] [Changed Project Column] T119054: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}

2015-11-19 Thread JAllemandou
JAllemandou moved this task to Ready to Deploy on the Analytics-Kanban workboard. TASK DETAIL https://phabricator.wikimedia.org/T119054 WORKBOARD https://phabricator.wikimedia.org/project/board/1030/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Retitled] T119054: Fix '.*http.*' not being tagged as spiders in webrequest

2015-11-19 Thread JAllemandou
JAllemandou changed the title from "Investigate wikidata pageview sipke on 2015-11-14" to "Fix '.*http.*' not being tagged as spiders in webrequest". JAllemandou triaged this task as "Unbreak Now!" priority. JAllemandou claimed this task. JAllemandou edited pr

[Wikidata-bugs] [Maniphest] [Changed Project Column] T119054: Fix '.*http.*' not being tagged as spiders in webrequest

2015-11-19 Thread JAllemandou
JAllemandou moved this task to In Progress on the Analytics-Kanban workboard. TASK DETAIL https://phabricator.wikimedia.org/T119054 WORKBOARD https://phabricator.wikimedia.org/project/board/1030/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Retitled] T119054: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}

2015-11-19 Thread JAllemandou
JAllemandou changed the title from "Fix '.*http.*' not being tagged as spiders in webrequest" to "Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}". TASK DETAIL https://phabricator.wikimedia.org/T119054 EMAIL PREFERENCES https://phabricator.wi

[Wikidata-bugs] [Maniphest] [Commented On] T119054: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}

2015-11-23 Thread JAllemandou
JAllemandou added a comment. @Addshore: Not feasible since original user_agent is not present in pageview_hourly. TASK DETAIL https://phabricator.wikimedia.org/T119054 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Tbayer

[Wikidata-bugs] [Maniphest] [Commented On] T119054: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}

2015-11-23 Thread JAllemandou
JAllemandou added a comment. Notes are to the dashiki page, but I think you can modify the existing ones if you wish :) TASK DETAIL https://phabricator.wikimedia.org/T119054 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Tbayer

[Wikidata-bugs] [Maniphest] [Commented On] T119054: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}

2015-11-23 Thread JAllemandou
JAllemandou added a comment. Thanks :) TASK DETAIL https://phabricator.wikimedia.org/T119054 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Tbayer, gerritbot, Lydia_Pintscher, Aklapper, Addshore, StudiesWorld, Wikidata-bugs, aude

[Wikidata-bugs] [Maniphest] [Commented On] T119054: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}

2015-11-23 Thread JAllemandou
JAllemandou added a comment. @ Addshore: The A are notes (there is a card if you place your mouse over it), and there is a note at deploy when the drop occurs. Is there a necessity to add another? If you think so, notes are created using wiki: https://meta.wikimedia.org/wiki

[Wikidata-bugs] [Maniphest] [Changed Project Column] T119054: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk}

2015-11-19 Thread JAllemandou
JAllemandou moved this task to Done on the Analytics-Kanban workboard. TASK DETAIL https://phabricator.wikimedia.org/T119054 WORKBOARD https://phabricator.wikimedia.org/project/board/1030/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Commented On] T135164: Pageview API not reporting spiders correctly

2016-05-12 Thread JAllemandou
JAllemandou added a comment. Results look correct to me with that query: SELECT agent_type, count(1) as count FROM webrequest WHERE year = 2016 AND month = 5 AND day = 10 AND uri_host LIKE "%wikidata.org" AND i

[Wikidata-bugs] [Maniphest] [Commented On] T135164: "egranary digital library system" UA should be listed as a spider

2016-05-13 Thread JAllemandou
JAllemandou added a comment. @Tbayer : I suggested @Addshore to request webrequest on a specific hour for detailed user_agent analysis. For this check @Addshore, I would really have gone for ONE HOUR of data, making the volume of data to work real smaller (data is partitionned up to hour

[Wikidata-bugs] [Maniphest] [Commented On] T160825: Grafana: "wikidata-api" doesn't update anymore

2017-03-20 Thread JAllemandou
JAllemandou added a comment. Just had a quick look at oozie jobs, and they seem successfull. Let's trouble that with @Addshore.TASK DETAILhttps://phabricator.wikimedia.org/T160825EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: JAllemandou

[Wikidata-bugs] [Maniphest] [Commented On] T177257: ArticlePlaceholder hit counts from bnwiki seem bogus

2017-10-30 Thread JAllemandou
JAllemandou added a comment. Hi folks, Not a bug for me: SELECT access_method, count(1) from wmf.webrequest WHERE is_pageview AND pageview_info['project'] = 'bn.wikipedia' AND year = 2017 AND month = 9 AND day = 30 AND webrequest_source = 'text' AND x_analytics_map['ns'] = '-1

[Wikidata-bugs] [Maniphest] [Commented On] T143819: Data request for logs from SparQL interface at query.wikidata.org

2018-01-05 Thread JAllemandou
JAllemandou added a comment. @Nuria , @Smalyshev : Given all wikidata-query tagged rows belong in misc, which is super small, I have no objection running jobs either hourly or daily.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings

[Wikidata-bugs] [Maniphest] [Commented On] T201168: [Trailblaze] Use apache mahout item recommender for property suggestions

2018-08-24 Thread JAllemandou
JAllemandou added a comment. Hi @Jonas - A quick comment as per a quick chat with @Addshore on IRC. If you want to implement recommandation based on collaborative filtering for instance, I suggest you go for Spark MLLib (Spark Machine Learning LIBrary). It has all the classical ML algorithms

[Wikidata-bugs] [Maniphest] [Commented On] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2018-09-25 Thread JAllemandou
JAllemandou added a comment. Jobs have been successful for the past months. However rerunning the jobs manually made the data-points appear. This is very bizarre. Let's keep this open and monitor next month.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps

[Wikidata-bugs] [Maniphest] [Commented On] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2018-12-20 Thread JAllemandou
JAllemandou added a comment. Same exact problem as last month: job has run, but no data is present :( More investigations needed, probably early next year.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Commented On] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2018-11-30 Thread JAllemandou
JAllemandou added a comment. Info backfilled since beggining of time: https://grafana.wikimedia.org/dashboard/db/wikidata-co-editors?orgId=1=now-8y=now Will keep an eye on next month run.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings

[Wikidata-bugs] [Maniphest] [Commented On] T208569: Get Wikidata clickstream

2018-11-23 Thread JAllemandou
JAllemandou added a comment. Hi @GoranSMilovanovic, the code we use to generate monthly data is here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala As per the clickstream database in Hive

[Wikidata-bugs] [Maniphest] [Commented On] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2019-01-07 Thread JAllemandou
JAllemandou added a comment. Hi @WMDE-leszek - core data has not been computed et (usually done around the 9th of the following month). I'll be sure to have an eye on data showing up for month 12 and rerun the job if needed.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps

[Wikidata-bugs] [Maniphest] [Commented On] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2019-01-07 Thread JAllemandou
JAllemandou added a comment. Bug found and corrected (patches above). Data is available now and the rerun problem should be solved.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: WMDE-leszek

[Wikidata-bugs] [Maniphest] [Updated] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2018-11-30 Thread JAllemandou
JAllemandou added a project: Analytics. TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Jonas, JAllemandouCc: WMDE-leszek, Tbayer, Aklapper, GerritBot, JAllemandou, Jonas, RazShuty, Ladsgroup, Addshore

[Wikidata-bugs] [Maniphest] [Commented On] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2018-11-30 Thread JAllemandou
JAllemandou added a comment. Thanks for raising the issue. This is very bizarre. The job for october was showing successful in our side. I reran it, and data showed up :( I have the feeling this is not the first time this happens, something must be wrong somewhere. I'am also going to run

[Wikidata-bugs] [Maniphest] [Updated] T209655: Copy Wikidata dumps to HDFs

2019-03-26 Thread JAllemandou
JAllemandou added a comment. Most of the complicated things already exist for this to work (equicalent of rsync for HDFS, spark job converting wikidata json dumps to parquet). I wanted for T216160 <https://phabricator.wikimedia.org/T216160> to be settled before

[Wikidata-bugs] [Maniphest] [Commented On] T218901: Track number of Wikidata edits by namespace

2019-04-04 Thread JAllemandou
JAllemandou added a comment. Reading about this - Would delayed data be interesting? This information is accessible in hadoop :) TASK DETAIL https://phabricator.wikimedia.org/T218901 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Commented On] T218901: Track number of Wikidata edits by namespace

2019-04-08 Thread JAllemandou
JAllemandou added a comment. Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :) TASK DETAIL https://phabricator.wikimedia.org/T218901 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-20 Thread JAllemandou
JAllemandou added a comment. I can't speak about failures and restarts as I don't know much about the dumps-generation process. @ArielGlenn would the person to know best. As for the dates, the main reason we ask for the change is for dates consistency by month, mimic-ing what exists for xml

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-21 Thread JAllemandou
JAllemandou added a comment. We're on the same page @diego :) I can precompute the table described in ii) if needed, and will surely do it once we'll have the wikidata-dump productioned - Let me know if you need it before TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-26 Thread JAllemandou
JAllemandou added a comment. Hi @Isaac Sorry for the issue. I correcte the query above (last query, join criteria: `AND ws.sitelink.title = title_namespace_localized` --> `AND REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized` We were not joining correctly on ti

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-03-14 Thread JAllemandou
JAllemandou added a comment. In T216160#5020236 <https://phabricator.wikimedia.org/T216160#5020236>, @ArielGlenn wrote: > By Friday I'll have done that; by next Wednesday let's make a decision, barring any huge obstacles. Awesome, thanks @ArielGlenn :) TASK DETAI

[Wikidata-bugs] [Maniphest] [Commented On] T217821: Investigate duplication of strings in wb_terms table for wikidatawiki

2019-03-07 Thread JAllemandou
JAllemandou added a comment. Exact analysis ran on 2018-12-06: val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001") val base_rdd = df.select("labels", "descriptions", "aliases").rdd val strings

[Wikidata-bugs] [Maniphest] [Commented On] T214897: data for analyzing and visualizing the identifier landscape of Wikidata

2019-03-15 Thread JAllemandou
JAllemandou added a comment. Hey @GoranSMilovanovic - I don't have a good understanding of what you're after, but having read pairs and contingency table above, maybe this Spark function could be helpful: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql

[Wikidata-bugs] [Maniphest] [Commented On] T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time

2019-02-07 Thread JAllemandou
JAllemandou added a comment. I confirm the fix :) Closing this task.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: WMDE-leszek, Tbayer, Aklapper, GerritBot, JAllemandou, Jonas, RazShuty

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-15 Thread JAllemandou
JAllemandou added a comment. Works for me :) I assume the system would work similarly to the existing XML dumps, meaning that dumps would be generated in the same date folder (1st, 8th, 15th, 22nd of every month for instance), one after the other, providing information on availability in a json

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-15 Thread JAllemandou
JAllemandou added a comment. @ArielGlenn : Could we decide on regular day-in-month patterns for the various entity-dumps that need to be generated? Here is my suggestion:: EntitiesFormatsCurrent Frequency New suggested frequency alljson / nt / ttlEvery monday1st, 8th, 15th, 22nd of every month

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-16 Thread JAllemandou
JAllemandou added a comment. Many thanks @ArielGlenn :)TASK DETAILhttps://phabricator.wikimedia.org/T216160EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, Nandana, Akovalyov, Lahi, Gq86

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-18 Thread JAllemandou
JAllemandou added a comment. Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query: spark.sql("SET spark.sql.shuffle.partitions=128") val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/w

[Wikidata-bugs] [Maniphest] [Updated] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-14 Thread JAllemandou
JAllemandou added a project: Analytics. TASK DETAILhttps://phabricator.wikimedia.org/T216160EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic

[Wikidata-bugs] [Maniphest] [Created] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-02-14 Thread JAllemandou
JAllemandou created this task.JAllemandou added projects: Wikidata, Dumps-Generation. TASK DESCRIPTIONCurrently wikidata-entities dumps are generated on a fixed weekday basis (monday every two weeks for instance). It would be easier for some use-cases to get a fixed day-of-month basis (1st day

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread JAllemandou
JAllemandou added a comment. @diego : This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) : spark.sql("SET spark.sql.shuffle.partitions=512") val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/w

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread JAllemandou
JAllemandou added a comment. Thanks @Isaac for reformulating the question I tried to explain above :) @diego: Can you confirm there is value for you in having revisions tied to wikidata-items regardless of when the link happened?TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T214897: data for analyzing and visualizing the identifier landscape of Wikidata

2019-02-04 Thread JAllemandou
JAllemandou added a comment. Hi folks - Sorry for late answer, I was at WMF all-hands last week and did not check tasks. I have started work work on having the wikidata-json dumps imported on the cluster, and while some data is available for ad-hoc analysis (see hdfs:///user/joal/wmf/data/wmf

[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFs

2019-06-08 Thread JAllemandou
JAllemandou added a comment. @GoranSMilovanovic : You're welcome :) At some point I'll manage to have that productionize ;) TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc

[Wikidata-bugs] [Maniphest] [Updated] T220977: Investigate surprising rise in mobile page views for wikidata

2019-05-16 Thread JAllemandou
JAllemandou added a comment. A lot trickier :) We have the `wmf_raw.mediawiki_private_cu_changes` table in hive, allowing us to compute geo-editors (editors-by-country, aggregated). This table only contains 3 month of data for PII removal reasons. It's probably not enough for what you're

[Wikidata-bugs] [Maniphest] [Commented On] T220977: Investigate surprising rise in mobile page views for wikidata

2019-05-14 Thread JAllemandou
JAllemandou added a comment. Hi @Lea_WMDE and @GoranSMilovanovic - I think the answer the your problem is solved in this month snapshot with the `revision_tags` field of mediawiki_history: spark.sql(""" SELECT substr(event_timestamp, 0, 4) as year,

[Wikidata-bugs] [Maniphest] [Commented On] T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday

2019-04-23 Thread JAllemandou
JAllemandou added a comment. Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for helping driving this :) TASK DETAIL https://phabricator.wikimedia.org/T216160 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc

[Wikidata-bugs] [Maniphest] [Commented On] T94019: Generate RDF from JSON

2019-04-23 Thread JAllemandou
JAllemandou added a comment. The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization. TASK DETAIL https://phabricator.wikimedia.org/T94019 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Updated] T236895: ArticlePlaceholder dashboard stopped tracking page views

2019-10-30 Thread JAllemandou
JAllemandou added a comment. I think this problem could be related to T226730 (preventing most `Special:XXX` pages to be flagged as pageviews). TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] [Changed Subscribers] T209655: Copy Wikidata dumps to HDFS

2019-12-04 Thread JAllemandou
JAllemandou added a subscriber: Groceryheist. JAllemandou added a comment. New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page. hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1 drwxr-xr-x - analytics

[Wikidata-bugs] [Maniphest] [Commented On] T239898: Investigate triple counts difference between dumps and what blazegraph reports

2019-12-09 Thread JAllemandou
JAllemandou added a comment. Chiming in: I suggest using Spark for investigations - Given the size of the dataset, parallel computation should help. This means another hop for the data: --> stat1004 --> HDFS. Please ping if you want/need help :) TASK DETAIL

[Wikidata-bugs] [Maniphest] [Commented On] T101013: Log Wikidata Query Service queries to the event gate infrastructure

2019-11-27 Thread JAllemandou
JAllemandou added a comment. Does this being closed mean we can access data on kafka? TASK DETAIL https://phabricator.wikimedia.org/T101013 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse, JAllemandou Cc: Igorkim78, JAllemandou, Ottomata

[Wikidata-bugs] [Maniphest] [Updated] T239471: Sqoop wikidata terms tables into hadoop

2019-11-29 Thread JAllemandou
JAllemandou added a project: Analytics-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T239471 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore, JAllemandou Cc: JAllemandou, Addshore, Aklapper, 4748kitoko, Hook696, Daryl-TTMG

[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFs

2019-10-03 Thread JAllemandou
JAllemandou added a comment. this is done @GoranSMilovanovic. Raw data is here `/user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902` and parquet data is here `/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902` TASK DETAIL https://phabricator.wikimedia.org/T209655

[Wikidata-bugs] [Maniphest] [Commented On] T246237: Extract some statistics on the use of the isBlank() function in wdqs query logs

2020-02-26 Thread JAllemandou
JAllemandou added a comment. As I was working on getting a better idea of the queries, I got some results relatively easily: Since beginning of year: - Internal cluster: No request using `isBlank()`, 481202298 requests total - External cluster: 54669 requests using `isBlank

[Wikidata-bugs] [Maniphest] [Commented On] T246237: Extract some statistics on the use of the isBlank() function in wdqs query logs

2020-02-27 Thread JAllemandou
JAllemandou added a comment. Events using `isBlank` since the beginning of year are now stored here: `/user/joal/wdqs_queries/2020_use_isBlank/wdqs_use_is_blank_202002.json`. There are ~56k events stored in json format in a single file to facilitate analysis. TASK DETAIL https

[Wikidata-bugs] [Maniphest] [Updated] T209655: Copy Wikidata dumps to HDFS

2020-01-28 Thread JAllemandou
JAllemandou added a subtask: T243832: Fix hdfs-rsync`prune-empty-dirs` feature. TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian

[Wikidata-bugs] [Maniphest] [Claimed] T209655: Copy Wikidata dumps to HDFS

2020-01-28 Thread JAllemandou
JAllemandou claimed this task. JAllemandou added a project: Analytics-Kanban. JAllemandou set the point value for this task to "5". TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAll

[Wikidata-bugs] [Maniphest] [Retitled] T209655: Copy Wikidata dumps to HDFS + parquet

2020-02-18 Thread JAllemandou
JAllemandou renamed this task from "Copy Wikidata dumps to HDFS" to "Copy Wikidata dumps to HDFS + parquet". TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc:

[Wikidata-bugs] [Maniphest] [Updated] T236895: ArticlePlaceholder dashboard stopped tracking page views

2020-01-08 Thread JAllemandou
JAllemandou added a project: Analytics-Kanban. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, JAllemandou Cc: Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher

[Wikidata-bugs] [Maniphest] [Commented On] T236895: ArticlePlaceholder dashboard stopped tracking page views

2020-01-08 Thread JAllemandou
JAllemandou added a comment. The patch merged by @Nuria had a bug. I commented on the already merged patch on a solution. For the moment the job is not started. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel

[Wikidata-bugs] [Maniphest] [Commented On] T236895: ArticlePlaceholder dashboard stopped tracking page views

2020-03-13 Thread JAllemandou
JAllemandou added a comment. Patch needs to be deployed before the dashboard shows data. TASK DETAIL https://phabricator.wikimedia.org/T236895 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, JAllemandou Cc: Milimetric, Ladsgroup, Nuria

[Wikidata-bugs] [Maniphest] T261937: Add CPU load and query concurrency as context to event logging from WDQS

2020-09-07 Thread JAllemandou
JAllemandou added a comment. Will make it a lot easier to analyze than to have to build the 'in-flight' view of queries! TASK DETAIL https://phabricator.wikimedia.org/T261937 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc

[Wikidata-bugs] [Maniphest] T258269: Add query result to the current WDQS event logging

2020-09-07 Thread JAllemandou
JAllemandou added a comment. In term of logging-size, it probably depends on the result type: in case of descriptions or other text-heavy fields, this could get bigger if high or no `LIMIT` are set in the number of returned rows. We should set a limit :) TASK DETAIL https

[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)

2020-10-06 Thread JAllemandou
JAllemandou added a comment. I continued my analysis today looking at top-100 parsed user-agents from both queries-with-referer subset, and queries-without-referer subset, over the month of September. See https://phabricator.wikimedia.org/P12933 - The queries-with-referer have

[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)

2020-10-16 Thread JAllemandou
JAllemandou added a comment. Some more info on this aspect: I have done a quick analysis over September queries today and found that my assumption that long queries were made by users from UI is wrong. First, total numbers of request and sum of query-time split by queries taking more

[Wikidata-bugs] [Maniphest] T261841: Tag WDQS query log with the source of the query (UI vs direct access)

2020-10-02 Thread JAllemandou
JAllemandou added a comment. Heya - I'm sorry I completely missed the ping :S Quick analysis: spark.sql("SELECT (http.request_headers['referer'] IS NOT NULL) as defined_referer, count(1) as c from event.wdqs_external_sparql_query where year = 2020 and month = 9

[Wikidata-bugs] [Maniphest] [Commented On] T253753: Increase retention for mediawiki.revision-create on the kafka jumbo cluster

2020-05-27 Thread JAllemandou
JAllemandou added a comment. An idea: How about sending back to kafka the update stream and make THAT one retention higher? Moving retention to 30 days for revision-create will make a lot of data stay that wouldn't be necessary (about half of the data), while keeping only the updates

[Wikidata-bugs] [Maniphest] [Closed] T249319: Remove wb_terms from sqoop

2020-06-02 Thread JAllemandou
JAllemandou closed this task as "Resolved". JAllemandou updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T249319 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Milimetric, Aklapper, Addshore,

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-22 Thread JAllemandou
JAllemandou added a comment. @GoranSMilovanovic I finally published a wiki page with most of the results I found: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Traffic_Analysis Sorry for the delay ... TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-22 Thread JAllemandou
JAllemandou added a comment. @GoranSMilovanovic I have indeed done some analysis using Apache Jena parser to extract algebraic representation of queries. Not yet to the level of completion I like though. I'll be on holidays until August 15th starting tonight - let's discuss when I come back

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-15 Thread JAllemandou
JAllemandou added a comment. > First step: analyze the frequency distribution of the user_agent field (string) from wmf.webrequest where queries are SPARQL. I suggest you use events instead fo webrequest: `event.wdqs_internal_sparql_query` and `event.wdqs_external_sparql_query`.

[Wikidata-bugs] [Maniphest] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-07-15 Thread JAllemandou
JAllemandou added a comment. SELECT http.request_headers['user-agent'], user_agent_map, count(1) as c FROM event.wdqs_external_sparql_query WHERE year = 2020 and month = 5 and day = 1 GROUP BY http.request_headers['user-agent

[Wikidata-bugs] [Maniphest] T266022: Programmatically categorize WDQS queries by potential alternative solution

2021-01-04 Thread JAllemandou
JAllemandou added a comment. Planned deadline was end of last month. I've gone through various issues preventing to achieve it. I'have started the actual work today (I gave it thought but didn't code) and wish to present results before the end of the month. TASK DETAIL https

[Wikidata-bugs] [Maniphest] T266022: Programmatically categorize WDQS queries by potential alternative solution

2021-02-03 Thread JAllemandou
JAllemandou added a comment. Ah! I realize I have not updated that task. The analysis can be found here: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis @CBogen : I let you handle the definition of done, and whether this task should be closed or not :) TASK DETAIL

[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction

2021-02-04 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Analytics, Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. Restricted Application added a project: Wikidata. TASK DESCRIPTION This task is about running regular query-parsing jobs for WDQS and storing

[Wikidata-bugs] [Maniphest] T273854: Automate regular WDQS query parsing and data-extraction

2021-02-04 Thread JAllemandou
JAllemandou claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T273854 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, JAllemandou, MPhamWMF, CBogen, Akuckartz, 4748kitoko, Nandana, Namenlos314, Akovalyov

[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries

2021-06-24 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION We wish, for the month of June 2021: - Report the number of parsing errors when generating parsed queries information - Provide

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-06-24 Thread JAllemandou
JAllemandou added a subtask: T285465: Document and analyze the number of parsing errors for parsed WDQS queries. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc

[Wikidata-bugs] [Maniphest] T285465: Document and analyze the number of parsing errors for parsed WDQS queries

2021-06-24 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T285465 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, AKhatun_WMF, JAllemandou, MPhamWMF, CBogen

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-06 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, MPhamWMF

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Once ready locally with unit-tests, apply the triple-analysis method to bigger data in spark (a day). TASK DETAIL https

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-06 Thread JAllemandou
JAllemandou added a subtask: T282129: Test triple-analysis functions over a large dataset with Spark. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF

[Wikidata-bugs] [Maniphest] T282139: Provide a quantitative description of the Wikidata-triples dataset

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION As a way to get familiar with the data, please provide quantitative information over the dataset using spark in a notebook (probably using

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-06 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Being able to save the information in Parquet will be very useful as it allows to automatically process the queries as the y flow in (hourly

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-06 Thread JAllemandou
JAllemandou added a subtask: T282130: Provide a way to save extracted query-information in parquet format. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF

[Wikidata-bugs] [Maniphest] T282127: Add unit-tests to WDQS analysis toolkit

2021-05-06 Thread JAllemandou
JAllemandou created this task. JAllemandou added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Extract a set of queries to be used as unit-tests (10 queries) from the events. This should facilitate making sure the code is doing what

[Wikidata-bugs] [Maniphest] T282127: Add unit-tests to WDQS analysis toolkit

2021-05-06 Thread JAllemandou
JAllemandou added a parent task: T280640: Refine WDQS queries analysis. TASK DETAIL https://phabricator.wikimedia.org/T282127 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-06 Thread JAllemandou
JAllemandou added a subtask: T282127: Add unit-tests to WDQS analysis toolkit. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: AKhatun_WMF, Aklapper, CBogen, dcausse

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-20 Thread JAllemandou
JAllemandou added a comment. @AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-20 Thread JAllemandou
JAllemandou closed subtask T282130: Provide a way to save extracted query-information in parquet format as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc

[Wikidata-bugs] [Maniphest] T282130: Provide a way to save extracted query-information in parquet format

2021-05-20 Thread JAllemandou
JAllemandou closed this task as "Resolved". JAllemandou added a comment. Great ! Thanks for that :) Closing the ticket. TASK DETAIL https://phabricator.wikimedia.org/T282130 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, J

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-20 Thread JAllemandou
JAllemandou closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, Invadibot

[Wikidata-bugs] [Maniphest] T282129: Test triple-analysis functions over a large dataset with Spark

2021-05-20 Thread JAllemandou
JAllemandou added a comment. Closing this task :) Thanks fro the great work @AKhatun_WMF TASK DETAIL https://phabricator.wikimedia.org/T282129 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc: CBogen, AKhatun_WMF

[Wikidata-bugs] [Maniphest] T280640: Refine WDQS queries analysis

2021-05-20 Thread JAllemandou
JAllemandou closed subtask T282129: Test triple-analysis functions over a large dataset with Spark as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T280640 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, JAllemandou Cc

[Wikidata-bugs] [Maniphest] T283255: Create CLI job extracting info from wdqs queries

2021-05-20 Thread JAllemandou
JAllemandou created this task. JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review, Discovery-Search (Current work). Restricted Application removed a project: Patch-For-Review. TASK DESCRIPTION The job should process data hourly. Expected parameters to be passed

  1   2   >