dcausse created this task.
dcausse added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
Restricted Application added a project: Wikidata.

TASK DESCRIPTION
  As a maintainer of the wdqs updater pipeline I want to tune the flink 
application to discard very few events because of lateness so that the 
divergences remains rare and limited.
  
  While trying to tune the pipeline to properly handle a backfill the idleness 
was reduced to 2secs instead of the 1minute that was initially tested. It 
allowed the pipeline to keep running but at the cost of late events:
  
    import org.apache.spark.sql.functions._
    
    val df_late = 
spark.read.parquet("/wmf/discovery/streaming_updater/late_events/2020-1*-*")
    df_late.filter("ingestion_time > '2020-10-13T06:46:05Z'")
        .withColumn("ingestion_time_ts", to_timestamp(col("ingestion_time")))
        .select("*")
        .groupBy(
            year(col("ingestion_time_ts")) as "y",
            month(col("ingestion_time_ts")) as "m",
            dayofmonth(col("ingestion_time_ts")) as "d",
            hour(col("ingestion_time_ts")) as "h")
        .count()
        .orderBy("y", "m", "d", "h")
        .select(concat($"y", lit("-"), lpad($"m", 2, "0"), lit("-"),  
lpad($"d", 2, "0"), lit("T"), lpad($"h", 2, "0"), lit(":00:00Z")) as "time", 
$"count")
        .show(100, false)
  
    +--------------------+-------+
    |time                |count  |
    +--------------------+-------+
    |2020-10-13T06:00:00Z|4868946|
    |2020-10-13T10:00:00Z|123980 |
    |2020-10-13T11:00:00Z|1069057|
    |2020-10-18T11:00:00Z|540427 |
    |2020-10-24T12:00:00Z|2      |
    |2020-10-28T02:00:00Z|1      |
    |2020-10-28T09:00:00Z|2      |
    |2020-10-28T21:00:00Z|2      |
    |2020-10-29T12:00:00Z|2      |
    |2020-10-29T14:00:00Z|1      |
    |2020-10-29T22:00:00Z|3      |
    |2020-10-30T16:00:00Z|2501   |
    |2020-10-31T00:00:00Z|3      |
    |2020-11-01T13:00:00Z|5      |
    |2020-11-01T14:00:00Z|6      |
    +--------------------+-------+
  
  While in general (running for more than 17 days) there are almost no late 
events there are few cases we see a huge spike:
  
  - oct 13 from 6am to 11am, this is the bulk of the late events and correspond 
to the backfill period
  - oct 18 11am, was during a week-end the pipeline seems to have failed on oct 
17 1am and was restarted on 34hours later (probably the same king of problem 
related to backfill).
  - oct 30 16, the reason is unclear but the output topic ceased to receive 
events from the pipeline during several minutes, latencies recorded during this 
period as follow:
  
    +--------------------+-----+--------------+-----------+
    |time                |count|FLOOR(latency)|max_latency|
    +--------------------+-----+--------------+-----------+
    |2020-10-30T16:25:00Z|594  |160           |204        |
    |2020-10-30T16:29:00Z|380  |175           |213        |
    |2020-10-30T16:32:00Z|730  |120           |167        |
    |2020-10-30T16:39:00Z|124  |90            |245        |
    |2020-10-30T16:42:00Z|393  |173           |189        |
    |2020-10-30T16:43:00Z|280  |114           |159        |
    +--------------------+-----+--------------+-----------+
  
  AC:
  
  - determine proper settings that allow a backfill and normal operations 
without dropping events because of lateness (we should tolerate a max of 10 
late events per day)

TASK DETAIL
  https://phabricator.wikimedia.org/T267029

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: dcausse, Aklapper, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to