dcausse moved this task from In Progress to Needs review on the
Discovery-Search (Current work) board.
dcausse added a comment.
The new approach seems to work.
- Backfill period: `2020-11-06T23:00:01` -> `2020-11-20T13:40:00`
- Dump reconciliation: `2020-11-06T23:00:01` -> `2020-11-12T03:12:51`
The spurious events are almost all related to the dump reconciliation period
(oldest to newest revision in the dumps):
+----+---+---+-------------------+---------+---------------+------+
|y |m |d |inconsistency |status |event_type |count |
+----+---+---+-------------------+---------+---------------+------+
|2020|11 |6 |newer_revision_seen|CREATED |revision-create|14998 |
|2020|11 |7 |newer_revision_seen|CREATED |revision-create|380191|
|2020|11 |8 |newer_revision_seen|CREATED |revision-create|435488|
|2020|11 |9 |newer_revision_seen|CREATED |revision-create|310343|
|2020|11 |10 |newer_revision_seen|CREATED |revision-create|180821|
|2020|11 |10 |newer_revision_seen|UNDEFINED|page-delete |2 |
|2020|11 |11 |newer_revision_seen|CREATED |revision-create|136742|
|2020|11 |11 |newer_revision_seen|UNDEFINED|page-delete |18 |
|2020|11 |12 |newer_revision_seen|CREATED |revision-create|7234 |
|2020|11 |17 |newer_revision_seen|CREATED |revision-create|1 |
+----+---+---+-------------------+---------+---------------+------+
Note that at the time of exporting this data the pipeline had fully
backfilled and was reading current events (2020-11-20 events).
The kind of inconsistencies we see during the reconciliation period:
- page-delete -> newer_revision_seen|UNDEFINED, means the item was deleted
during that period but before being exported to the dump (rare).
- revision-create -> newer_revision_seen|CREATED, means the revision create
read was already exported in the dump (frequent)
The inconsistency on `2020-11-17` is real but in line with our expectections
to have a couple inconsistencies per day.
Details are:
+----------+--------------------+----------+---------------+-------+----------+
|item |event_time |revision |parent_revision|status |rev
|
+----------+--------------------+----------+---------------+-------+----------+
|Q102046169|2020-11-17T15:09:47Z|1308001440|1308001360
|CREATED|1308001440|
+----------+--------------------+----------+---------------+-------+----------+
Which seems to indicate a duplicate event sent by changeprop (revision create
for 1308001440 while 1308001440 is in the state). I'll let the pipeline run for
the week-end.
TASK DETAIL
https://phabricator.wikimedia.org/T267029
WORKBOARD
https://phabricator.wikimedia.org/project/board/1227/
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dcausse
Cc: dcausse, Aklapper, Alter-paule, Beast1978, CBogen, Un1tY, Akuckartz,
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420,
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE,
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan,
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs,
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs