dcausse created this task.
dcausse added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
Restricted Application added a project: Wikidata.
TASK DESCRIPTION
As a maintainer of the wdqs updater pipeline I want to tune the flink
application to discard very few events because of lateness so that the
divergences remains rare and limited.
While trying to tune the pipeline to properly handle a backfill the idleness
was reduced to 2secs instead of the 1minute that was initially tested. It
allowed the pipeline to keep running but at the cost of late events:
import org.apache.spark.sql.functions._
val df_late =
spark.read.parquet("/wmf/discovery/streaming_updater/late_events/2020-1*-*")
df_late.filter("ingestion_time > '2020-10-13T06:46:05Z'")
.withColumn("ingestion_time_ts", to_timestamp(col("ingestion_time")))
.select("*")
.groupBy(
year(col("ingestion_time_ts")) as "y",
month(col("ingestion_time_ts")) as "m",
dayofmonth(col("ingestion_time_ts")) as "d",
hour(col("ingestion_time_ts")) as "h")
.count()
.orderBy("y", "m", "d", "h")
.select(concat($"y", lit("-"), lpad($"m", 2, "0"), lit("-"),
lpad($"d", 2, "0"), lit("T"), lpad($"h", 2, "0"), lit(":00:00Z")) as "time",
$"count")
.show(100, false)
++---+
|time|count |
++---+
|2020-10-13T06:00:00Z|4868946|
|2020-10-13T10:00:00Z|123980 |
|2020-10-13T11:00:00Z|1069057|
|2020-10-18T11:00:00Z|540427 |
|2020-10-24T12:00:00Z|2 |
|2020-10-28T02:00:00Z|1 |
|2020-10-28T09:00:00Z|2 |
|2020-10-28T21:00:00Z|2 |
|2020-10-29T12:00:00Z|2 |
|2020-10-29T14:00:00Z|1 |
|2020-10-29T22:00:00Z|3 |
|2020-10-30T16:00:00Z|2501 |
|2020-10-31T00:00:00Z|3 |
|2020-11-01T13:00:00Z|5 |
|2020-11-01T14:00:00Z|6 |
++---+
While in general (running for more than 17 days) there are almost no late
events there are few cases we see a huge spike:
- oct 13 from 6am to 11am, this is the bulk of the late events and correspond
to the backfill period
- oct 18 11am, was during a week-end the pipeline seems to have failed on oct
17 1am and was restarted on 34hours later (probably the same king of problem
related to backfill).
- oct 30 16, the reason is unclear but the output topic ceased to receive
events from the pipeline during several minutes, latencies recorded during this
period as follow:
++-+--+---+
|time|count|FLOOR(latency)|max_latency|
++-+--+---+
|2020-10-30T16:25:00Z|594 |160 |204|
|2020-10-30T16:29:00Z|380 |175 |213|
|2020-10-30T16:32:00Z|730 |120 |167|
|2020-10-30T16:39:00Z|124 |90|245|
|2020-10-30T16:42:00Z|393 |173 |189|
|2020-10-30T16:43:00Z|280 |114 |159|
++-+--+---+
AC:
- determine proper settings that allow a backfill and normal operations
without dropping events because of lateness (we should tolerate a max of 10
late events per day)
TASK DETAIL
https://phabricator.wikimedia.org/T267029
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dcausse
Cc: dcausse, Aklapper, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86,
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst,
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll,
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs