dcausse moved this task from In Progress to Needs review on the 
Discovery-Search (Current work) board.
dcausse added a comment.


  The new approach seems to work.
  
  - Backfill period: `2020-11-06T23:00:01` -> `2020-11-20T13:40:00`
  - Dump reconciliation: `2020-11-06T23:00:01` -> `2020-11-12T03:12:51`
  
  The spurious events are almost all related to the dump reconciliation period 
(oldest to newest revision in the dumps):
  
    +----+---+---+-------------------+---------+---------------+------+
    |y   |m  |d  |inconsistency      |status   |event_type     |count |
    +----+---+---+-------------------+---------+---------------+------+
    |2020|11 |6  |newer_revision_seen|CREATED  |revision-create|14998 |
    |2020|11 |7  |newer_revision_seen|CREATED  |revision-create|380191|
    |2020|11 |8  |newer_revision_seen|CREATED  |revision-create|435488|
    |2020|11 |9  |newer_revision_seen|CREATED  |revision-create|310343|
    |2020|11 |10 |newer_revision_seen|CREATED  |revision-create|180821|
    |2020|11 |10 |newer_revision_seen|UNDEFINED|page-delete    |2     |
    |2020|11 |11 |newer_revision_seen|CREATED  |revision-create|136742|
    |2020|11 |11 |newer_revision_seen|UNDEFINED|page-delete    |18    |
    |2020|11 |12 |newer_revision_seen|CREATED  |revision-create|7234  |
    |2020|11 |17 |newer_revision_seen|CREATED  |revision-create|1     |
    +----+---+---+-------------------+---------+---------------+------+
  
  Note that at the time of exporting this data the pipeline had fully 
backfilled and was reading current events (2020-11-20 events).
  
  The kind of inconsistencies we see during the reconciliation period:
  
  - page-delete -> newer_revision_seen|UNDEFINED, means the item was deleted 
during that period but before being exported to the dump (rare).
  - revision-create -> newer_revision_seen|CREATED, means the revision create 
read was already exported in the dump (frequent)
  
  The inconsistency on `2020-11-17` is real but in line with our expectections 
to have a couple inconsistencies per day.
  Details are:
  
    
+----------+--------------------+----------+---------------+-------+----------+
    |item      |event_time          |revision  |parent_revision|status |rev     
  |
    
+----------+--------------------+----------+---------------+-------+----------+
    |Q102046169|2020-11-17T15:09:47Z|1308001440|1308001360     
|CREATED|1308001440|
    
+----------+--------------------+----------+---------------+-------+----------+
  
  Which seems to indicate a duplicate event sent by changeprop (revision create 
for 1308001440 while 1308001440 is in the state). I'll let the pipeline run for 
the week-end.

TASK DETAIL
  https://phabricator.wikimedia.org/T267029

WORKBOARD
  https://phabricator.wikimedia.org/project/board/1227/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: dcausse, Aklapper, Alter-paule, Beast1978, CBogen, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, 
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to