JAllemandou added a subscriber: JAllemandou.
JAllemandou added a comment.
Hi, quick questions on that:
Is the need regular, or would one shots make it ?
Also, what level of aggregation ? Daily is good ?
Below is a hive request that makes daily aggregation over (so thought)
interesting dimension
JAllemandou added a project: Analytics-Backlog.
TASK DETAIL
https://phabricator.wikimedia.org/T64874
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: JAllemandou, Halfak, hoo, Addshore, Ricordisamoa, Aklapper, drdee, Tnegrin,
QChris
JAllemandou changed the title from "Remove query.wikidata.org from pageview
definition (for wikidata)" to "Investigate wikidata pageview sipke on
2015-11-14".
JAllemandou set Security to None.
TASK DETAIL
https://phabricator.wikimedia.org/T119054
EMAIL
JAllemandou added a comment.
I messed up a deploy about a month ago, preventing the change merged here:
https://gerrit.wikimedia.org/r/#/c/244465/ to actually being applied.
I will:
- bump refinery-core and refinery-hive (> 0.0.19) and update refine oozie job
- deploy refinery with these
JAllemandou moved this task to Ready to Deploy on the Analytics-Kanban
workboard.
TASK DETAIL
https://phabricator.wikimedia.org/T119054
WORKBOARD
https://phabricator.wikimedia.org/project/board/1030/
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou changed the title from "Investigate wikidata pageview sipke on
2015-11-14" to "Fix '.*http.*' not being tagged as spiders in webrequest".
JAllemandou triaged this task as "Unbreak Now!" priority.
JAllemandou claimed this task.
JAllemandou edited pr
JAllemandou moved this task to In Progress on the Analytics-Kanban workboard.
TASK DETAIL
https://phabricator.wikimedia.org/T119054
WORKBOARD
https://phabricator.wikimedia.org/project/board/1030/
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou changed the title from "Fix '.*http.*' not being tagged as spiders
in webrequest" to "Fix '.*http.*' not being tagged as spiders in webrequest [5
pts] {hawk}".
TASK DETAIL
https://phabricator.wikimedia.org/T119054
EMAIL PREFERENCES
https://phabricator.wi
JAllemandou added a comment.
@Addshore: Not feasible since original user_agent is not present in
pageview_hourly.
TASK DETAIL
https://phabricator.wikimedia.org/T119054
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Tbayer
JAllemandou added a comment.
Notes are to the dashiki page, but I think you can modify the existing ones if
you wish :)
TASK DETAIL
https://phabricator.wikimedia.org/T119054
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Tbayer
JAllemandou added a comment.
Thanks :)
TASK DETAIL
https://phabricator.wikimedia.org/T119054
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Tbayer, gerritbot, Lydia_Pintscher, Aklapper, Addshore, StudiesWorld,
Wikidata-bugs, aude
JAllemandou added a comment.
@ Addshore: The A are notes (there is a card if you place your mouse over it),
and there is a note at deploy when the drop occurs.
Is there a necessity to add another? If you think so, notes are created using
wiki: https://meta.wikimedia.org/wiki
JAllemandou moved this task to Done on the Analytics-Kanban workboard.
TASK DETAIL
https://phabricator.wikimedia.org/T119054
WORKBOARD
https://phabricator.wikimedia.org/project/board/1030/
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a comment.
Results look correct to me with that query:
SELECT
agent_type,
count(1) as count
FROM
webrequest
WHERE
year = 2016
AND month = 5
AND day = 10
AND uri_host LIKE "%wikidata.org"
AND i
JAllemandou added a comment.
@Tbayer : I suggested @Addshore to request webrequest on a specific hour for
detailed user_agent analysis.
For this check @Addshore, I would really have gone for ONE HOUR of data,
making the volume of data to work real smaller (data is partitionned up to
hour
JAllemandou added a comment.
Just had a quick look at oozie jobs, and they seem successfull.
Let's trouble that with @Addshore.TASK DETAILhttps://phabricator.wikimedia.org/T160825EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: JAllemandou
JAllemandou added a comment.
Hi folks,
Not a bug for me:
SELECT access_method, count(1) from wmf.webrequest WHERE is_pageview AND pageview_info['project'] = 'bn.wikipedia' AND year = 2017 AND month = 9 AND day = 30 AND webrequest_source = 'text' AND x_analytics_map['ns'] = '-1
JAllemandou added a comment.
@Nuria , @Smalyshev : Given all wikidata-query tagged rows belong in misc, which is super small, I have no objection running jobs either hourly or daily.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings
JAllemandou added a comment.
Hi @Jonas - A quick comment as per a quick chat with @Addshore on IRC. If you want to implement recommandation based on collaborative filtering for instance, I suggest you go for Spark MLLib (Spark Machine Learning LIBrary). It has all the classical ML algorithms
JAllemandou added a comment.
Jobs have been successful for the past months. However rerunning the jobs manually made the data-points appear. This is very bizarre.
Let's keep this open and monitor next month.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps
JAllemandou added a comment.
Same exact problem as last month: job has run, but no data is present :(
More investigations needed, probably early next year.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a comment.
Info backfilled since beggining of time: https://grafana.wikimedia.org/dashboard/db/wikidata-co-editors?orgId=1=now-8y=now
Will keep an eye on next month run.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings
JAllemandou added a comment.
Hi @GoranSMilovanovic, the code we use to generate monthly data is here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala
As per the clickstream database in Hive
JAllemandou added a comment.
Hi @WMDE-leszek - core data has not been computed et (usually done around the 9th of the following month).
I'll be sure to have an eye on data showing up for month 12 and rerun the job if needed.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps
JAllemandou added a comment.
Bug found and corrected (patches above).
Data is available now and the rerun problem should be solved.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: WMDE-leszek
JAllemandou added a project: Analytics.
TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Jonas, JAllemandouCc: WMDE-leszek, Tbayer, Aklapper, GerritBot, JAllemandou, Jonas, RazShuty, Ladsgroup, Addshore
JAllemandou added a comment.
Thanks for raising the issue. This is very bizarre.
The job for october was showing successful in our side. I reran it, and data showed up :(
I have the feeling this is not the first time this happens, something must be wrong somewhere.
I'am also going to run
JAllemandou added a comment.
Most of the complicated things already exist for this to work (equicalent of
rsync for HDFS, spark job converting wikidata json dumps to parquet).
I wanted for T216160 <https://phabricator.wikimedia.org/T216160> to be
settled before
JAllemandou added a comment.
Reading about this - Would delayed data be interesting? This information is
accessible in hadoop :)
TASK DETAIL
https://phabricator.wikimedia.org/T218901
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a comment.
Some queries are computed using hadoop for wikidata (see
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If
SQL over recent-changes works for, that's great :)
TASK DETAIL
https://phabricator.wikimedia.org/T218901
EMAIL PREFERENCES
JAllemandou added a comment.
I can't speak about failures and restarts as I don't know much about the dumps-generation process. @ArielGlenn would the person to know best.
As for the dates, the main reason we ask for the change is for dates consistency by month, mimic-ing what exists for xml
JAllemandou added a comment.
We're on the same page @diego :)
I can precompute the table described in ii) if needed, and will surely do it
once we'll have the wikidata-dump productioned - Let me know if you need it
before
TASK DETAIL
https://phabricator.wikimedia.org/T215616
EMAIL
JAllemandou added a comment.
Hi @Isaac
Sorry for the issue. I correcte the query above (last query, join criteria:
`AND ws.sitelink.title = title_namespace_localized` --> `AND
REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized`
We were not joining correctly on ti
JAllemandou added a comment.
In T216160#5020236 <https://phabricator.wikimedia.org/T216160#5020236>,
@ArielGlenn wrote:
> By Friday I'll have done that; by next Wednesday let's make a decision,
barring any huge obstacles.
Awesome, thanks @ArielGlenn :)
TASK DETAI
JAllemandou added a comment.
Exact analysis ran on 2018-12-06:
val df =
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001")
val base_rdd = df.select("labels", "descriptions", "aliases").rdd
val strings
JAllemandou added a comment.
Hey @GoranSMilovanovic - I don't have a good understanding of what you're
after, but having read pairs and contingency table above, maybe this Spark
function could be helpful:
https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql
JAllemandou added a comment.
I confirm the fix :) Closing this task.TASK DETAILhttps://phabricator.wikimedia.org/T193641EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: WMDE-leszek, Tbayer, Aklapper, GerritBot, JAllemandou, Jonas, RazShuty
JAllemandou added a comment.
Works for me :) I assume the system would work similarly to the existing XML dumps, meaning that dumps would be generated in the same date folder (1st, 8th, 15th, 22nd of every month for instance), one after the other, providing information on availability in a json
JAllemandou added a comment.
@ArielGlenn : Could we decide on regular day-in-month patterns for the various entity-dumps that need to be generated?
Here is my suggestion::
EntitiesFormatsCurrent Frequency New suggested frequency
alljson / nt / ttlEvery monday1st, 8th, 15th, 22nd of every month
JAllemandou added a comment.
Many thanks @ArielGlenn :)TASK DETAILhttps://phabricator.wikimedia.org/T216160EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, Nandana, Akovalyov, Lahi, Gq86
JAllemandou added a comment.
Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query:
spark.sql("SET spark.sql.shuffle.partitions=128")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/w
JAllemandou added a project: Analytics.
TASK DETAILhttps://phabricator.wikimedia.org/T216160EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JAllemandouCc: hoo, Smalyshev, Addshore, ArielGlenn, JAllemandou, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic
JAllemandou created this task.JAllemandou added projects: Wikidata, Dumps-Generation.
TASK DESCRIPTIONCurrently wikidata-entities dumps are generated on a fixed weekday basis (monday every two weeks for instance). It would be easier for some use-cases to get a fixed day-of-month basis (1st day
JAllemandou added a comment.
@diego :
This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) :
spark.sql("SET spark.sql.shuffle.partitions=512")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/w
JAllemandou added a comment.
Thanks @Isaac for reformulating the question I tried to explain above :)
@diego: Can you confirm there is value for you in having revisions tied to wikidata-items regardless of when the link happened?TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL
JAllemandou added a comment.
Hi folks - Sorry for late answer, I was at WMF all-hands last week and did not check tasks.
I have started work work on having the wikidata-json dumps imported on the cluster, and while some data is available for ad-hoc analysis (see hdfs:///user/joal/wmf/data/wmf
JAllemandou added a comment.
@GoranSMilovanovic : You're welcome :) At some point I'll manage to have that
productionize ;)
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc
JAllemandou added a comment.
A lot trickier :)
We have the `wmf_raw.mediawiki_private_cu_changes` table in hive, allowing us
to compute geo-editors (editors-by-country, aggregated). This table only
contains 3 month of data for PII removal reasons. It's probably not enough for
what you're
JAllemandou added a comment.
Hi @Lea_WMDE and @GoranSMilovanovic - I think the answer the your problem is
solved in this month snapshot with the `revision_tags` field of
mediawiki_history:
spark.sql("""
SELECT
substr(event_timestamp, 0, 4) as year,
JAllemandou added a comment.
Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for
helping driving this :)
TASK DETAIL
https://phabricator.wikimedia.org/T216160
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc
JAllemandou added a comment.
The analytics hadoop cluster could also be of use here: the task can easily
take advantage of parallelization.
TASK DETAIL
https://phabricator.wikimedia.org/T94019
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a comment.
I think this problem could be related to T226730 (preventing most
`Special:XXX` pages to be flagged as pageviews).
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a subscriber: Groceryheist.
JAllemandou added a comment.
New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also
generated the items per page.
hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1
drwxr-xr-x - analytics
JAllemandou added a comment.
Chiming in: I suggest using Spark for investigations - Given the size of the
dataset, parallel computation should help. This means another hop for the data:
--> stat1004 --> HDFS. Please ping if you want/need help :)
TASK DETAIL
JAllemandou added a comment.
Does this being closed mean we can access data on kafka?
TASK DETAIL
https://phabricator.wikimedia.org/T101013
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dcausse, JAllemandou
Cc: Igorkim78, JAllemandou, Ottomata
JAllemandou added a project: Analytics-Kanban.
TASK DETAIL
https://phabricator.wikimedia.org/T239471
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Addshore, JAllemandou
Cc: JAllemandou, Addshore, Aklapper, 4748kitoko, Hook696, Daryl-TTMG
JAllemandou added a comment.
this is done @GoranSMilovanovic.
Raw data is here
`/user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902` and parquet
data is here `/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902`
TASK DETAIL
https://phabricator.wikimedia.org/T209655
JAllemandou added a comment.
As I was working on getting a better idea of the queries, I got some results
relatively easily:
Since beginning of year:
- Internal cluster: No request using `isBlank()`, 481202298 requests total
- External cluster: 54669 requests using `isBlank
JAllemandou added a comment.
Events using `isBlank` since the beginning of year are now stored here:
`/user/joal/wdqs_queries/2020_use_isBlank/wdqs_use_is_blank_202002.json`.
There are ~56k events stored in json format in a single file to facilitate
analysis.
TASK DETAIL
https
JAllemandou added a subtask: T243832: Fix hdfs-rsync`prune-empty-dirs` feature.
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian
JAllemandou claimed this task.
JAllemandou added a project: Analytics-Kanban.
JAllemandou set the point value for this task to "5".
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAll
JAllemandou renamed this task from "Copy Wikidata dumps to HDFS" to "Copy
Wikidata dumps to HDFS + parquet".
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc:
JAllemandou added a project: Analytics-Kanban.
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Ladsgroup, JAllemandou
Cc: Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher
JAllemandou added a comment.
The patch merged by @Nuria had a bug. I commented on the already merged patch
on a solution. For the moment the job is not started.
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel
JAllemandou added a comment.
Patch needs to be deployed before the dashboard shows data.
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Ladsgroup, JAllemandou
Cc: Milimetric, Ladsgroup, Nuria
JAllemandou added a comment.
Will make it a lot easier to analyze than to have to build the 'in-flight'
view of queries!
TASK DETAIL
https://phabricator.wikimedia.org/T261937
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc
JAllemandou added a comment.
In term of logging-size, it probably depends on the result type: in case of
descriptions or other text-heavy fields, this could get bigger if high or no
`LIMIT` are set in the number of returned rows. We should set a limit :)
TASK DETAIL
https
JAllemandou added a comment.
I continued my analysis today looking at top-100 parsed user-agents from both
queries-with-referer subset, and queries-without-referer subset, over the month
of September.
See https://phabricator.wikimedia.org/P12933
- The queries-with-referer have
JAllemandou added a comment.
Some more info on this aspect: I have done a quick analysis over September
queries today and found that my assumption that long queries were made by users
from UI is wrong.
First, total numbers of request and sum of query-time split by queries taking
more
JAllemandou added a comment.
Heya - I'm sorry I completely missed the ping :S
Quick analysis:
spark.sql("SELECT (http.request_headers['referer'] IS NOT NULL) as
defined_referer, count(1) as c from event.wdqs_external_sparql_query where year
= 2020 and month = 9
JAllemandou added a comment.
An idea: How about sending back to kafka the update stream and make THAT one
retention higher?
Moving retention to 30 days for revision-create will make a lot of data stay
that wouldn't be necessary (about half of the data), while keeping only the
updates
JAllemandou closed this task as "Resolved".
JAllemandou updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T249319
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Milimetric, Aklapper, Addshore,
JAllemandou added a comment.
@GoranSMilovanovic I finally published a wiki page with most of the results I
found: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Traffic_Analysis
Sorry for the delay ...
TASK DETAIL
https://phabricator.wikimedia.org/T248308
EMAIL PREFERENCES
https
JAllemandou added a comment.
@GoranSMilovanovic I have indeed done some analysis using Apache Jena parser
to extract algebraic representation of queries. Not yet to the level of
completion I like though. I'll be on holidays until August 15th starting
tonight - let's discuss when I come back
JAllemandou added a comment.
> First step: analyze the frequency distribution of the user_agent field
(string) from wmf.webrequest where queries are SPARQL.
I suggest you use events instead fo webrequest:
`event.wdqs_internal_sparql_query` and `event.wdqs_external_sparql_query`.
JAllemandou added a comment.
SELECT
http.request_headers['user-agent'],
user_agent_map,
count(1) as c
FROM event.wdqs_external_sparql_query
WHERE year = 2020 and month = 5 and day = 1
GROUP BY
http.request_headers['user-agent
JAllemandou added a comment.
Planned deadline was end of last month. I've gone through various issues
preventing to achieve it. I'have started the actual work today (I gave it
thought but didn't code) and wish to present results before the end of the
month.
TASK DETAIL
https
JAllemandou added a comment.
Ah! I realize I have not updated that task. The analysis can be found here:
https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis
@CBogen : I let you handle the definition of done, and whether this task
should be closed or not :)
TASK DETAIL
JAllemandou created this task.
JAllemandou added projects: Analytics, Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
Restricted Application added a project: Wikidata.
TASK DESCRIPTION
This task is about running regular query-parsing jobs for WDQS and storing
JAllemandou claimed this task.
TASK DETAIL
https://phabricator.wikimedia.org/T273854
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, JAllemandou, MPhamWMF, CBogen, Akuckartz, 4748kitoko, Nandana,
Namenlos314, Akovalyov
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
We wish, for the month of June 2021:
- Report the number of parsing errors when generating parsed queries
information
- Provide
JAllemandou added a subtask: T285465: Document and analyze the number of
parsing errors for parsed WDQS queries.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T285465
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, AKhatun_WMF, JAllemandou, MPhamWMF, CBogen
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T282129
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, MPhamWMF
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
Once ready locally with unit-tests, apply the triple-analysis method to
bigger data in spark (a day).
TASK DETAIL
https
JAllemandou added a subtask: T282129: Test triple-analysis functions over a
large dataset with Spark.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: AKhatun_WMF
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
As a way to get familiar with the data, please provide quantitative
information over the dataset using spark in a notebook (probably using
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T282130
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
Being able to save the information in Parquet will be very useful as it
allows to automatically process the queries as the y flow in (hourly
JAllemandou added a subtask: T282130: Provide a way to save extracted
query-information in parquet format.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: AKhatun_WMF
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
Extract a set of queries to be used as unit-tests (10 queries) from the
events.
This should facilitate making sure the code is doing what
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T282127
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF
JAllemandou added a subtask: T282127: Add unit-tests to WDQS analysis toolkit.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: AKhatun_WMF, Aklapper, CBogen, dcausse
JAllemandou added a comment.
@AKhatun_WMF That's great! could you please provide some info on expected
data-size in parquet (for daily data for instance)? Many thanks.
TASK DETAIL
https://phabricator.wikimedia.org/T282130
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings
JAllemandou closed subtask T282130: Provide a way to save extracted
query-information in parquet format as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc
JAllemandou closed this task as "Resolved".
JAllemandou added a comment.
Great ! Thanks for that :) Closing the ticket.
TASK DETAIL
https://phabricator.wikimedia.org/T282130
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, J
JAllemandou closed this task as "Resolved".
TASK DETAIL
https://phabricator.wikimedia.org/T282129
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, Invadibot
JAllemandou added a comment.
Closing this task :) Thanks fro the great work @AKhatun_WMF
TASK DETAIL
https://phabricator.wikimedia.org/T282129
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: CBogen, AKhatun_WMF
JAllemandou closed subtask T282129: Test triple-analysis functions over a large
dataset with Spark as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc
JAllemandou created this task.
JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review,
Discovery-Search (Current work).
Restricted Application removed a project: Patch-For-Review.
TASK DESCRIPTION
The job should process data hourly.
Expected parameters to be passed
1 - 100 of 136 matches
Mail list logo