Gehel added a comment.
Actually, it seems that only the WDQS_Lag check is not reported to the wdqs-admins group. Patch coming up.TASK DETAILhttps://phabricator.wikimedia.org/T144948EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel
Gehel added a comment.
So it seems that puppet failures on wdqs1001 are notified on IRC (# wikidata), but 'WDQS HTTP Port' are not. Looking at the check definition on neon:/etc/icinga/puppet_services.cfg I don't see a significant difference:
define service {
# --PUPPET_NAME-- wdqs1001
Gehel created subtask T146158: Configure varnish to include wdqs nodes in codfw.
TASK DETAILhttps://phabricator.wikimedia.org/T144380EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: thcipriani, Stashbot, gerritbot, Gehel, Aklapper, Smalyshev
Gehel created this task.Gehel added projects: Wikidata, Wikidata-Query-Service, Operations, Discovery.Herald added a subscriber: Aklapper.
TASK DESCRIPTIONWDQS does not seem to have an LVS service configured, so wdqs servers are configured directly in the cache role. I'm not entirely sure how we
Gehel edited the task description. (Show Details)
EDIT DETAILSWDQS does not seem to have an LVS service configured, so wdqs servers are configured directly in the [[ https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/cache/misc.pp | cache role ]]. I'm
Gehel added a comment.
The configuration of those servers is tracked on T144380. This task can probably be closed, unless it is still used to track some work on the procurement side. @RobH, I'll let you close unless you need it.TASK DETAILhttps://phabricator.wikimedia.org/T138637EMAIL
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, Discovery.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONSome of WDQS metrics are collected through an external PHP script, not integrated with diamond or our usual metric
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint.
TASK DETAILhttps://phabricator.wikimedia.org/T132457EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: ema, Gehel, BBlack, Aklapper, mschwarzer, Avner, debt, D3r1ck01, Jonas, FloNight
Gehel removed subscribers: Aklapper, gerritbot, Stashbot, thcipriani.Gehel removed a project: Discovery-Wikidata-Query-Service-Sprint.
TASK DETAILhttps://phabricator.wikimedia.org/T146207EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev
Gehel created subtask T146207: publish lag and response time for wdqs codfw to graphite.
TASK DETAILhttps://phabricator.wikimedia.org/T144380EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: thcipriani, Stashbot, gerritbot, Gehel, Aklapper, Smalyshev
Gehel created this task.Gehel added projects: Wikidata, Wikidata-Query-Service, Operations, Discovery, Discovery-Wikidata-Query-Service-Sprint, Patch-For-Review.Herald removed a project: Patch-For-Review.
TASK DESCRIPTIONSome metrics are collected on wdqs outside of diamond, and not deployed
Gehel added a comment.
@Smalyshev yes there is: adding some cache-control headers. Change submitted. Thanks for reminding me!TASK DETAILhttps://phabricator.wikimedia.org/T137238EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Smalyshev
Gehel added a comment.
logs on wdqs1002 show error opening sockets:
Sep 25 14:35:08 wdqs1002 bash[10426]: 2016-09-25 14:30:56.279:WARN:oejs.ServerConnector:qtp927028538-34-acceptor-0@6b2731c9-ServerConnector@74cdad37{HTTP/1.1}{localhost:}:
Sep 25 14:35:08 wdqs1002 bash[10426
Gehel added a comment.
Restarting blazegraph on wdqs1002 seems to have solved the issue. I'll dig into the logs to see if I find something that would explain the issue.TASK DETAILhttps://phabricator.wikimedia.org/T146576EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel
Gehel added a comment.
It seems that wdqs1002 started showing replication lag issues around 16:00 UTC on Saturday Sept 24. Shortly after, there is a hole in that metric in Graphite. HTTP 502 went up much later, around 19:00 UTC.TASK DETAILhttps://phabricator.wikimedia.org/T146576EMAIL
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery, Operations.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONThe graphite check for wikidata query service is regularly alerting since Saturday October 1st. Some investigation have
Gehel closed this task as "Resolved".Gehel claimed this task.Gehel added a comment.
As @jcrespo pointed out, the current issue is different that the one raised here. I created T147130 to track this new issue. And I'm closing this one.TASK DETAILhttps://phabricator.wikimedia.org/T1
Gehel added a comment.
Work is happening in sub tasks, but it is happening. This should be done by the end of next week.TASK DETAILhttps://phabricator.wikimedia.org/T124627EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: greg, Southparkfan, hoo, Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, Discovery.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONwdqs100[12] are the host generating the most traffic to logstash. A quick look seems to indicate that they send
Gehel reassigned this task from Gehel to Deskana.Gehel added a comment.
@Deskana I reassign this to you. Let me know if you need more details or if you want me to move forward and open a hardware request for this.TASK DETAILhttps://phabricator.wikimedia.org/T148747EMAIL PREFERENCEShttps
Gehel claimed this task.Gehel added a comment.
Side note: at this point, the need to increase hardware is more for availability than for scalability.
Constraints:
We want to be able to continue operations in the case where we loose a datacenter, including normal maintenance operations.
Each
Gehel closed subtask T146158: Configure varnish to include wdqs nodes in codfw as "Resolved".
TASK DETAILhttps://phabricator.wikimedia.org/T144380EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: thcipriani, Stashbot, gerritbot, Gehel
Gehel closed this task as "Resolved".Gehel added a comment.
This has been resolved by implementing LVS as described on T132457.TASK DETAILhttps://phabricator.wikimedia.org/T146158EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper,
Gehel added a comment.
I propose to start by enabling connection limiting with a fairly high limit and lowering it after a week, once we get some feedback.TASK DETAILhttps://phabricator.wikimedia.org/T108488EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences
Gehel closed this task as "Resolved".Gehel added a comment.
This is resolved by https://gerrit.wikimedia.org/r/315651TASK DETAILhttps://phabricator.wikimedia.org/T148015EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Smalys
Gehel closed this task as "Declined".Gehel added a comment.
We only have logstash servers in eqiad (yes, I know, we should have some in codfw as well, but that's not the case).TASK DETAILhttps://phabricator.wikimedia.org/T147674EMAIL PREFERENCEShttps://phabricator.wikimedia.org/sett
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint, Operations.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONThe merge of https://gerrit.wikimedia.org/r/#/c/312225/ changed the metrics
Gehel triaged this task as "High" priority.
TASK DETAILhttps://phabricator.wikimedia.org/T148015EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper, mschwarzer, Avner, debt, D3r1ck01, Jonas, FloNight, Xmlizer, Iz
Gehel added a subscriber: mpopov.Gehel added a comment.
A quick analysis of the situation after 2 days of limiting connections:
around 2k connections have been rate limited over a 24h period (grep limiting /var/log/nginx/error.log.1 | wc -l), which is ~ 2% of the non cached hits.
the 95-%ile
Gehel added a comment.
Of course, one option could be to make the backend also listen to IPv6.TASK DETAILhttps://phabricator.wikimedia.org/T151889EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Smalyshev, Gehel, EBjune, mschwarzer, Avner
Gehel triaged this task as "High" priority.
TASK DETAILhttps://phabricator.wikimedia.org/T151889EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Smalyshev, Gehel, EBjune, mschwarzer, Avner, Zppix, debt, D3r1ck01, Jonas, FloNigh
Gehel created this task.Gehel added projects: Operations, Wikidata-Query-Service, Discovery.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONnginx logs on wdqs100[12] show errors (see below for example) that seem to indicate that nginx tries to connect
Gehel closed this task as "Resolved".Gehel added a comment.
nginx is now configured to talk to the backend 127.0.0.1 and not localhost, so resolution to IPv6 is bypassed. No more errors seen in the logs.TASK DETAILhttps://phabricator.wikimedia.org/T151889EMAIL PREFER
Gehel added a parent task: T124627: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline..
TASK DETAILhttps://phabricator.wikimedia.org/T152644EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Southparkfan, Smalyshev
Gehel added subtasks: T152643: rack/setup/install wdqs1003, T152644: rack/setup/install wdqs2003.
TASK DETAILhttps://phabricator.wikimedia.org/T124627EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: greg, Southparkfan, hoo, Gehel, RobH, Deskana, Tfinc
Gehel added a parent task: T124627: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline..
TASK DETAILhttps://phabricator.wikimedia.org/T152643EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Cmjohnson, GehelCc: Southparkfan
Gehel closed subtask T144380: Install and configure new WDQS nodes on codfw as "Resolved".
TASK DETAILhttps://phabricator.wikimedia.org/T124862EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: hoo, Gehel, Tfinc, Deskana, Joe, EBernhardson
Gehel closed this task as "Resolved".Gehel added a comment.
Oops... yes, it has been done for some time... We now have 2 new servers (T152643 and T152644) but that's a different task...TASK DETAILhttps://phabricator.wikimedia.org/T144380EMAIL PREFERENCEShttps://phabricator.wikimedia.or
Gehel closed this task as "Resolved".Gehel claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T124862EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: hoo, Gehel, Tfinc, Deskana, Joe, EBernhardson, Aklapper, Smalyshev, Th3d3v1
Gehel closed subtask T124862: Deploy WDQS nodes on codfw as "Resolved".
TASK DETAILhttps://phabricator.wikimedia.org/T124627EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: greg, Southparkfan, hoo, Gehel, RobH, Deskana, Tfinc, Smalyshev, Akl
Gehel added a comment.
At this point, the only workable option is the "single LDF server" (appart from abandoning LDF completely). So let's see how we could implement that and see what feedback we get.
limitations
As far as I can see, LVS does not support an active / passive fa
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONT159574 has been fixed by sending all LDF traffic to a single server. This has obvious
Gehel moved this task from Needs review to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
This is done. Longer term solution is tracked on T161240.TASK DETAILhttps://phabricator.wikimedia.org/T159574WORKBOARDhttps://phabricator.wikimedia.org/project/board/1239
Gehel added a comment.
Varnish patch deployed. I'll keep an eye on logs to make sure all request are routed as we expect. We still need to find a longer term solution, but that's another ticket.TASK DETAILhttps://phabricator.wikimedia.org/T159574EMAIL PREFERENCEShttps://phabricator.wikimedia.org
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Search (Current work), Operations.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONTraffic is ready for active / active applications, WDQS is ready to be active / active
Gehel triaged this task as "High" priority.
TASK DETAILhttps://phabricator.wikimedia.org/T162111EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, Salgo60, Avner, Z
Gehel edited projects, added Traffic; removed Patch-For-Review.Gehel added a subscriber: BBlack.Gehel added a comment.
I'm not sure the change is effective. While I do see a few requests (outside of pyball / icinga) in the nginx logs on the wdqs codwf servers, I don't see as many as I would expect
Gehel added a comment.
Using the following curl to test, I don't see an entry in the nginx access log:
curl 'https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%23Streets%20without%20a%20city%0ASELECT%20%3Fstreet%20%3FstreetLabel%0AWHERE%0A%7B%0A%20%20%20%20%3Fstreet%20wdt%3AP31%2Fwdt
Gehel moved this task from Backlog to Done on the Discovery-Search (Current work) board.Gehel added a comment.
grafana dashboard was wrongly filtering on eqiad only (that's why I did not see any traffic there). More tests and checking x-cache and x-served-by headers show that indeed traffic
Gehel added a comment.
Initial import is completed, wdqs-updater is restarted and is catching up on the differences since last export.TASK DETAILhttps://phabricator.wikimedia.org/T162111EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot
Gehel added subscribers: ema, BBlack.Gehel added a comment.
For some more context:
LDF is a way to cheaply get large lists of triples from WDQS, and displace some logic on the clients. Retrieving this list is done page by page. We already have use cases for this. The iteration order is just
Gehel added a comment.
An early report based on GC logs is available. This report analyses only 14h of data, so no hard conclusions yet. Still, a few things to note:
We use G1, I have not much experience with it, I need to learn!
Over the period analysed, we rarely go over 8Go of heap after GC
Gehel added a comment.
Elasticsearch configuration is done through puppet (https://github.com/wikimedia/puppet/blob/production/modules/elasticsearch/manifests/init.pp#L132-L141)TASK DETAILhttps://phabricator.wikimedia.org/T159248EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel
Gehel added a comment.
The UNKNOWN disappeared now that we are active/active. Previously, when no traffic was sent to codfw, we had no meaningful data about response time. This can be closed again.TASK DETAILhttps://phabricator.wikimedia.org/T119915EMAIL PREFERENCEShttps
Gehel added a comment.
@Cmjohnson now that I am back, do you need anything from me to move forward on this?TASK DETAILhttps://phabricator.wikimedia.org/T171210EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Cmjohnson, GehelCc: PokestarFan, Aklapper, Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Search (Current work).Herald added subscribers: PokestarFan, Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONwdq-beta is still running on ubuntu trusty. This would allow to remove a significant
Gehel added a comment.
After discussion with @Smalyshev, the data reload procedure should be:
1 server at a time on eqiad
we can reload 2 servers at a the same time on codfw as the traffic there is minimal
care must be taken to send LDF traffic away from wdqs1001 before reloading
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added subscribers: PokestarFan, Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONNow that T170860 is deployed, having a look at what users are being throttled
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added subscribers: PokestarFan, Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONOn Aug 08 11:12:50, wdqs-updater on wdqs1001 was throttled. It stopped doing any
Gehel added a comment.
16 hours after deployment, ~900 requests have been throttled (for a total of ~30k requests, so ~3% of requests are being throttled).
~650 of those requests are obviously by bots. We might want to contact the bots owner (where possible) to see if the restriction is affecting
Gehel created this task.Gehel added projects: Ops-Access-Requests, Operations, Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added subscribers: PokestarFan, Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONFor maintenance operations, Stas (or other wdqs
Gehel added a project: Discovery-Search (Current work).
TASK DETAILhttps://phabricator.wikimedia.org/T172713EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, gerritbot, Aklapper, PokestarFan, Gehel, Smalyshev, Lordiis, GoranSMilovanovic
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations.Herald added subscribers: PokestarFan, Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONwdqs log messages woudl be much easier to read / analyze if they were sent to logstash.
wdqs logging
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONAs seen multiple times on grafana, wdqs-updater get stuck after the wdqs-blazegraph restart that happen during deployment of new
Gehel created this task.Gehel added a project: Wikidata-Query-Service.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONDuring a recent high edit rate on wikidata, wikidata query service stopped to process updates. The updater logs show:
May 26 10:14
Gehel updated the task description. (Show Details)
CHANGES TO TASK DESCRIPTION...Looking at [[ https://github.com/wikimedia/wikidata-query-rdf/blob/master/tools/src/main/java/org/wikidata/query/rdf/tool/change/TailingChangesPoller.java#L64-L96 | TailingChangesPoller ]], it looks like the case
Gehel created this task.Gehel added a project: Wikidata-Query-Service.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONToday (May 29th) since ~ noon UTC, wdqs1002 is lagging behind on replication. Updates are still happening, just not fast enough
Gehel added a project: Operations.Gehel added a comment.
Looking at dmesg, there are a lot of warnings about CPU temperature and throttling:
[9098037.343804] CPU23: Package temperature above threshold, cpu clock throttled (total events = 52647618)
This has been going on at least since May 22
Gehel added a subscriber: Cmjohnson.Gehel added a comment.
@Cmjohnson is this high temperature an indication that you should do some magic with thermal paste?TASK DETAILhttps://phabricator.wikimedia.org/T166524EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences
Gehel added a subscriber: RobH.Gehel added a comment.
@RobH: from racktables, it looks like wdqs1002 is 4.5 years old (purchase date = 2012-12-05, same as wdqs1001 - other servers are newer). I'm not sure about the warranty status, or when we should think about renewing those servers. Any idea
Gehel added a project: Discovery-Search (Current work).Gehel added a comment.
Taking wdqs1002 out of LVS seems to have given it sufficient breathing space to catch up on replication. I added it back and it seems stable so far. I'm still not trusting it entirely...TASK DETAILhttps
Gehel added a project: Discovery-Search (Current work).
TASK DETAILhttps://phabricator.wikimedia.org/T166378EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Stashbot, Smalyshev, Gehel, Aklapper, GoranSMilovanovic, Adik2382, Th3d3v1ls
Gehel claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T166378EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Stashbot, Smalyshev, Gehel, Aklapper, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune
Gehel claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T166524EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: hoo, Jonas, Lydia_Pintscher, RobH, Cmjohnson, Stashbot, Alphos, Smalyshev, Gehel, Aklapper, GoranSMilovanovic, Th3d3v1ls
Gehel claimed this task.Gehel added a comment.
thermal paste has been added by @Cmjohnson, this can be closed.TASK DETAILhttps://phabricator.wikimedia.org/T166524EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: hoo, Jonas, Lydia_Pintscher, RobH
Gehel created this task.Gehel added projects: hardware-requests, Wikidata-Query-Service.Herald added projects: Operations, Wikidata, Discovery.
TASK DESCRIPTIONWe had a recent issue on wdqs1002 (T166524) with CPU overheating. After a chat with @RobH and since wdqs1001 and wdqs1002 are almost 5
Gehel moved this task from In progress to Done on the Discovery-Search (Current work) board.Gehel reassigned this task from Gehel to Cmjohnson.Gehel added a comment.
wdqs1002 has not had any issue since then. Hardware request is done on a separate ticket. It still probably make sense to have
Gehel added a comment.
Looking at wdqs-updater GC logs on wdqs1004, in the last 7 days:
heap before GC peaks at ~1.4GB (with a few higher peaks at ~2GB)
heap after full GC is ~512MB
max heap size is configured at 2GB
allocation rate over that period is ~70MB/s (but probably peaks much higher
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONA quick look at GC logs on wdqs1004 show that almost 5% of time is spent in GC. That seems
Gehel added a comment.
I had a chat with @Yurik to dig a bit more into this issue. He is seeing a 75% GC overhead, which seems very different than the behaviour we observe on our own wdqs instances. We are doing our own investigation into GC tuning (T175919).
I did gave @Yurik a few idea to try
Gehel closed this task as "Declined".
TASK DETAILhttps://phabricator.wikimedia.org/T176192EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Yurik, Smalyshev, Gehel, Aklapper, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, mer
Gehel added a comment.
My bad... There is now a role::wdqs::labs class that can be applied on VMs on Horizon. This class includes everything in the role::wdqs except the few things that should not be enabled in WMCS. I already applied the class an ran puppet. There is still an error, since
Gehel added a comment.
New GC configuration deployed, all servers restarted. I'll wait a few hours and I'll have a look at GC logs to find out if we see improvements.
For reference, the JVM options before:
-XX:+UseG1GC
-Xms12g
-Xmx12g
-Xloggc:/var/log/wdqs/wdqs-blazegraph_jvm_gc.%p.log
-XX
Gehel added a comment.
See F9995047: wdqs gc logs for an example of problematic GC log.TASK DETAILhttps://phabricator.wikimedia.org/T175919EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, gerritbot, Smalyshev, Gehel, Aklapper, Gq86, Lordiis
Gehel added a comment.
for reference:
JVM:
gehel@wdqs1004:~$ java -version
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-1~bpo8+1-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
JVM options (full command line):
java -server -XX:+UseG1
Gehel added a comment.
Using a demo version of Jclarity Censum, I see the following:
Problems
Premature promotion:
There are a number of possible causes for this problem:
Survivor spaces are too small.
Young gen is too small.
The -XX:MaxTenuringThreshold flag may have been set too low
Gehel added a comment.
It would be interesting to see if allocation rate goes up when we see the JVM locking up. gceasy does not graph allocation rate over time, but I think that JClarity Censum might do that (closed source, but demo license available).TASK DETAILhttps://phabricator.wikimedia.org
Gehel added a comment.
Transparent huge pages seem to be disabled:
gehel@wdqs1004:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
Patch coming up to test other options.TASK DETAILhttps://phabricator.wikimedia.org/T175919EMAIL PREFERENCEShttps
Gehel added a comment.
A thread on Friends of JClarity suggests:
adding -XX:+ParallelRefProcEnabled (since we seem to have really high Reference processing times
turn off PrintTenuringThreshold, since it is rarely useful with G1.
check transparent huge pages are disabled
activate
Gehel added a comment.
Comments inline. Keep in mind that my understanding of GC is limited, I am most probably wrong in a lot of what I write below. And my understanding of G1 is even more limited...
In T175919#3647588, @Smalyshev wrote:
So I took a look at our logs from Sep 29 with http
Gehel moved this task from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
Looking at a few days of logs, after the recent configuration changes:
overall, things look better, GC overhead is under control, < 5%
we still have long GC pauses (betw
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONAfter an interesting discussion on the Friends of JClarity Google group, Martijn Verburg
Gehel added a comment.
In T175948#3630631, @Smalyshev wrote:
The tricky part here is that once we deploy the change, we'd need to reload immediately (since the dictionary would be incompatible). But we don't want to take all servers down at the same time. So we need a way to run server
Gehel updated the task description. (Show Details)
CHANGES TO TASK DESCRIPTION...[ ] [[ https://www.mediawiki.org/wiki/Developer_access | Request developer access ]] (should be done directly by Martijn and Kirk, does not need the NDA to be signed, so we can move forward here already)
[ ] [[ https
Gehel added a comment.
Looking at Grafana, we can already see a decrease in overall GC time. Looking good!TASK DETAILhttps://phabricator.wikimedia.org/T175919EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, gerritbot, Smalyshev, Gehel
Gehel added a comment.
According to Kirk Pepperdine, we might have run into a G1 bug... I'll try to help the JVM guys debug it, and we might get a long term solution at some point...TASK DETAILhttps://phabricator.wikimedia.org/T175919EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings
Gehel added a comment.
Oh, I was expecting %{HOSTNAME} to be interpreted by logstash itself, not as a ref in the same document. There is something about HOSTNAME being lazy loaded in recent version of logback. I'll try to find the reference...TASK DETAILhttps://phabricator.wikimedia.org
Gehel added a comment.
initial data import is done, wdqs100[45] can now be pooled.TASK DETAILhttps://phabricator.wikimedia.org/T171210EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: ops-monitoring-bot, gerritbot, PokestarFan, Aklapper, Gehel
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint.
TASK DETAILhttps://phabricator.wikimedia.org/T175017EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Aklapper, Smalyshev, Gehel, Lordiis, Lucas_Werkmeister_WMDE
Gehel added a subscriber: EBjune.Gehel added a comment.
@EBjune: you should have access to kibana already. You need to use your wikitech (LDAP) username. The authorization is done on the ops / nda / wmf groups, and I expect that you are already in the nda and wmf groups. Ping me when you
101 - 200 of 2396 matches
Mail list logo