[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-04-11 Thread Gehel
Gehel closed this task as "Resolved".
Gehel claimed this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-04-11 Thread Gehel
Gehel closed subtask T305068: Alert when flink does not have the number of 
expected task managers as Resolved.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread JMeybohm
JMeybohm added a comment.


  In T301147#7821813 , 
@dcausse wrote:
  
  > The additional PODs won't be used as a flink job does not automatically 
scale so it would be a pure waste of resources (2.5G of reserved mem per 
additional POD). It would help I guess to improve redundancy in this scenario 
only if k8s assigns every POD to a distinct machine, in which case even with a 
single machine misbehaving flink would have enough redundancy to allocate the 
job to the spare POD. If k8s does do allocation randomly or that there are not 
enough k8s worker nodes (1 spare POD in our case would mean spreading the PODs 
over 8 different machines) then it's probably not worth the waste of resources.
  
  K8s will try to schedule replicas of one Deployment onto different Nodes by 
default and we can also force it to do so. But tbh I would not so that in this 
case as in most of the cases it should be just fine. I expect this situation to 
be a rare exception (and I probably jinxed that now) as we have not seen it 
before or happen again. So as long as it's not super critical, I would refrain 
from trying to optimize the workload for this type of failure. Ultimately this 
should be taken care of by k8s so we should invest there - especially if should 
happen again.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread dcausse
dcausse updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread dcausse
dcausse added a comment.


  Thanks for the quick answer! (response inline)
  
  In T301147#7821582 , 
@JMeybohm wrote:
  
  >> - If the above is not possible could we mitigate this problem by 
over-allocating resources (increase the number of replicas) to the deployment 
to increase the chances of proper recovery if this situation happens again?
  >
  > If that makes sense from your POV you could do that ofc. I can't speak on 
how problematic this situation was compared to the potential waste of resources 
another pod means. But if the current workload is already maxing out the 
capacity of the 6 replicas you have, maybe bumping that to 7 might be smart 
anyways to account for peaks?
  
  The additional PODs won't be used as a flink job does not automatically scale 
so it would be a pure waste of resources (2.5G of reserved mem per additional 
POD). It would help I guess to improve redundancy in this scenario only if k8s 
assigns every POD to a distinct machine, in which case even with a single 
machine misbehaving flink would have enough redundancy to allocate the job to 
the spare POD. If k8s does do allocation randomly or that there are not enough 
k8s worker nodes (1 spare POD in our case would mean spreading the PODs over 8 
different machines) then it's probably not worth the waste of resources.
  
  > In T301147#7821422 , 
@dcausse wrote:
  >
  >> @JMeybohm do you see any additional action items that would improve the 
resilience of k8s in such scenario?
  >
  > Unfortunately we don't have any data on what went wrong on the node. I 
think T277876  would be a step in 
the right direction but I also doubt it would have fully prevented this issue 
(ultimately I can't say).
  
  Thanks, I'm adding it to the ticket description as a possible improvement.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread JMeybohm
JMeybohm added a comment.


  > To be discussed with service ops:
  >
  > - Investigate and address the reasons why after a node failure k8s did not 
fulfill its promise of making sure that the rdf-streaming-updater deployment 
have 6 working replicas
  
  The problem was more that the node did not really fail (to it's complete 
extend). It was heavily overloaded (for an unknown reason) and that's 
potentially why containers/processed running there seemed dead. But from K8s 
perspective the Pods where still running and a new pod was scheduled as soon as 
I power cycled the node (e.g. K8s was able to detect a mismatch in desired end 
existing replicas).
  
  > - If the above is not possible could we mitigate this problem by 
over-allocating resources (increase the number of replicas) to the deployment 
to increase the chances of proper recovery if this situation happens again?
  
  If that makes sense from your POV you could do that ofc. I can't speak on how 
problematic this situation was compared to the potential waste of resources 
another pod means. But if the current workload is already maxing out the 
capacity of the 6 replicas you have, maybe bumping that to 7 might be smart 
anyways to account for peaks?
  
  In T301147#7821422 , 
@dcausse wrote:
  
  > @JMeybohm do you see any additional action items that would improve the 
resilience of k8s in such scenario?
  
  Unfortunately we don't have any data on what went wrong on the node. I think 
T277876  would be a step in the 
right direction but I also doubt it would have fully prevented this issue 
(ultimately I can't say).

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread dcausse
dcausse moved this task from Ready for Development to Needs review on the 
Discovery-Search (Current work) board.
dcausse added a comment.


  Tentatively moving this ticket to //needs review// as I'm not sure sure we 
can do much more from the search team perspective.
  I think the last point to discuss was to investigate the reasons why a single 
k8s node that misbehaves could make a deployment unstable.
  @JMeybohm do you see any additional action items that would improve the 
resilience of k8s in such scenario?

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

WORKBOARD
  https://phabricator.wikimedia.org/project/board/1227/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-30 Thread dcausse
dcausse updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-30 Thread dcausse
dcausse updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-17 Thread Gehel
Gehel closed subtask T302330: Wikidata MaxLag above 10 for 1hr as 
Resolved.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-17 Thread Gehel
Gehel closed subtask T302340: codfw wdqs updater failures as 
Resolved.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-28 Thread Gehel
Gehel added a subtask: T302340: codfw wdqs updater failures.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-28 Thread Gehel
Gehel added a subtask: T302330: Wikidata MaxLag above 10 for 1hr.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-28 Thread Gehel
Gehel set the point value for this task to "3".
Gehel added a comment.


  Discussion with service ops will happen on this ticket. Other action items 
will be tracked separately.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-21 Thread MPhamWMF
MPhamWMF moved this task from Incoming to Current work on the 
Wikidata-Query-Service board.
MPhamWMF added a project: Discovery-Search (Current work).

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

WORKBOARD
  https://phabricator.wikimedia.org/project/board/891/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-21 Thread Gehel
Gehel updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-14 Thread Gehel
Gehel added subscribers: bking, RKemper, Gehel.
Gehel added a comment.


  @RKemper or @bking will create an incident report from this ticket. If any 
actionable are identified, they will be tracked on their own tasks.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Gehel, RKemper, bking, toan, Addshore, JMeybohm, Michael, Aklapper, 
dcausse, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-10 Thread dcausse
dcausse added a comment.


  In T301147#7692414 , 
@JMeybohm wrote:
  
  > In T301147#7689837 , 
@dcausse wrote:
  >
  >> @JMeybohm we're still investigating why the application did not properly 
recover while kubernetes1014 went down but if you have ideas on the two 
questions in the ticket description this would be very helpful, thanks!
  >
  > Unfortunately I'm not exactly sure what happened to the node. What I know 
is that the system load surged (potentially due to high iowait) on the system, 
leaving running processes practically starving but the system was still 
responding to ICMP and kubernetes status heartbeats still (mostly) worked. 
Leaving the node flipping between Ready/NotReady state.
  > That means the node was not actually down from k8s POV, which is why no new 
Pods where created until I drained the node respectively before I powercycled 
it (as evicting pods was actually hanging as well, as k8s tries to be nice and 
the node still was in it's overloaded state).
  
  Thanks! I've updated the task description with few action items, please let 
us know if you see something else we should do to improve this.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: toan, Addshore, JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-10 Thread dcausse
dcausse updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: toan, Addshore, JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-08 Thread JMeybohm
JMeybohm added a comment.


  In T301147#7689837 , 
@dcausse wrote:
  
  > @JMeybohm we're still investigating why the application did not properly 
recover while kubernetes1014 went down but if you have ideas on the two 
questions in the ticket description this would be very helpful, thanks!
  
  Unfortunately I'm not exactly sure what happened to the node. What I know is 
that the system load surged (potentially due to high iowait) on the system, 
leaving running processes practically starving but the system was still 
responding to ICMP and kubernetes status heartbeats still (mostly) worked. 
Leaving the node flipping between Ready/NotReady state.
  That means the node was not actually down from k8s POV, which is why no new 
Pods where created until I drained the node respectively before I powercycled 
it (as evicting pods was actually hanging as well, as k8s tries to be nice and 
the node still was in it's overloaded state).

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Addshore, JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-08 Thread dcausse
dcausse added a comment.


  k8s seems to have tried to kill the container for the whole period according 
messages like: Container flink-session-cluster-main-taskmanager failed liveness 
probe, will be restarted 

 (searching for 
`k8s_event.involvedObject.uid:"1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a"`).

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Addshore, JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-07 Thread dcausse
dcausse added a subscriber: JMeybohm.
dcausse added a comment.


  @JMeybohm we're still investigating why the application did not properly 
recover while kubernetes1014 went down but if you have ideas on the two 
questions in the ticket description this would be very helpful, thanks!

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-07 Thread Maintenance_bot
Maintenance_bot added a project: Wikidata.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Maintenance_bot
Cc: Michael, Aklapper, dcausse, Invadibot, MPhamWMF, maantietaja, CBogen, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-07 Thread RKemper
RKemper updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: RKemper
Cc: Michael, Aklapper, dcausse, MPhamWMF, CBogen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-07 Thread dcausse
dcausse created this task.
dcausse added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  For 7 hours (`2022-02-06T23:00:00` to `2022-02-07T06:20:00`) the streaming 
updater in `eqiad` stopped working properly preventing edits to flow to all the 
wdqs machines in eqiad.
  The lag started to rise in eqiad and caused edits to be throttled during this 
period:
  
  F34944091: Capture d’écran du 2022-02-07 11-40-08.png 

  
  Investigations:
  
  - the streaming updater for WCQS went down from `2022-02-06T16:32:00` to 
`2022-02-06T23:00:00`
  - the streaming updater for WDQS went down from `2022-02-06T23:00:00` to 
`2022-02-07T06:20:00`
  - the number of total task slots went down to 20 from 24 (4tasks == 1pod) 
between `2022-02-06T16:32:00` and `2022-02-07T06:20:00` causing resource 
starvation and preventing both jobs from running at the same time 
(`flink_jobmanager_taskSlotsTotal{kubernetes_namespace="rdf-streaming-updater"}`)
  - kubernetes1014 (T301099 ) seemed 
to have showed problems during this same period (`2022-02-06T16:32:00` to 
`2022-02-07T06:20:00`)
  - the deployment used by the updater used one POD 
(`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a`) from kubernetes1014
  - the flink session cluster was able to regain its 24 slots after after 
`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` came back (at `2022-02-07T08:07:00`), 
then this POD disappeared again in favor of another one and the service 
successfully restarted.
  - during the whole incident k8s metrics & flink metrics seem to disagree:
- flink says that it lost 4 task managers (1 POD)
- k8s always reports at least 6 PODS 
(`count(container_memory_usage_bytes{namespace="rdf-streaming-updater", 
container="flink-session-cluster-main-taskmanager"})`)
  
  Questions:
  
  - why do flink and k8s metrics disagree (active PODs vs number of task 
manager)?
  - why a new POD was not created after kubernetes1014 went down (making 
`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` unavailable to the deployment)?
  
  What could we have done better:
  
  - we could have route wdqs traffic to codfw during the outage and avoid 
throttling edits

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Aklapper, dcausse, MPhamWMF, CBogen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org