Repository: brooklyn-docs Updated Branches: refs/heads/master 7e8166fa1 -> 30aff82ff
Troubleshooting tips for slow Brooklyn Project: http://git-wip-us.apache.org/repos/asf/brooklyn-docs/repo Commit: http://git-wip-us.apache.org/repos/asf/brooklyn-docs/commit/669b2e94 Tree: http://git-wip-us.apache.org/repos/asf/brooklyn-docs/tree/669b2e94 Diff: http://git-wip-us.apache.org/repos/asf/brooklyn-docs/diff/669b2e94 Branch: refs/heads/master Commit: 669b2e94ee46446de7b1f0947e79e97c7f23d78a Parents: 74a25d1 Author: Aled Sage <aled.s...@gmail.com> Authored: Tue May 31 01:04:15 2016 +0100 Committer: Aled Sage <aled.s...@gmail.com> Committed: Mon Jun 6 23:56:14 2016 +0100 ---------------------------------------------------------------------- .../troubleshooting/detailed-support-report.md | 43 ++++ guide/ops/troubleshooting/index.md | 2 + guide/ops/troubleshooting/slow-unresponsive.md | 237 +++++++++++++++++++ website/documentation/faq.md | 2 +- 4 files changed, 283 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/brooklyn-docs/blob/669b2e94/guide/ops/troubleshooting/detailed-support-report.md ---------------------------------------------------------------------- diff --git a/guide/ops/troubleshooting/detailed-support-report.md b/guide/ops/troubleshooting/detailed-support-report.md new file mode 100644 index 0000000..6e3c741 --- /dev/null +++ b/guide/ops/troubleshooting/detailed-support-report.md @@ -0,0 +1,43 @@ +--- +layout: website-normal +title: Detailed Support Report +toc: /guide/toc.json +--- + +If you wish to send a detailed report, then depending on the nature of the problem, consider +collecting the following information. + +See [Brooklyn Slow or Unresponse](slow-unresponsive.html) docs for details of these commands. + +{% highlight bash %} +BROOKLYN_HOME=/home/users/brooklyn/apache-brooklyn-0.9.0-bin +BROOKLYN_PID=$(cat $BROOKLYN_HOME/pid_java) +REPORT_DIR=/tmp/brooklyn-report/ +DEBUG_LOG=${BROOKLYN_HOME}/brooklyn.debug.log + +uname -a > ${REPORT_DIR}/uname.txt +df -h > ${REPORT_DIR}/df.txt +cat /proc/cpuinfo > ${REPORT_DIR}/cpuinfo.txt +cat /proc/meminfo > ${REPORT_DIR}/meminfo.txt +ulimit -a > ${REPORT_DIR}/ulimit.txt +cat /proc/${BROOKLYN_PID}/limits >> ${REPORT_DIR}/ulimit.txt +top -n 1 -b > ${REPORT_DIR}/top.txt +lsof -p ${BROOKLYN_PID} > ${REPORT_DIR}/lsof.txt +netstat -an > ${REPORT_DIR}/netstat.txt + +jmap -histo:live ${BROOKLYN_PID} > ${REPORT_DIR}/jmap-histo.txt +jmap -heap ${BROOKLYN_PID} > ${REPORT_DIR}/jmap-heap.txt +for i in {1..10}; do + jstack ${BROOKLYN_PID} > ${REPORT_DIR}/jstack.${i}.txt + sleep 1 +done +grep "brooklyn gc" ${DEBUG_LOG} > ${REPORT_DIR}/brooklyn-gc.txt +grep "events for subscriber" ${DEBUG_LOG} > ${REPORT_DIR}/events-for-subscriber.txt +tar czf brooklyn-report.tgz ${REPORT_DIR} +{% endhighlight %} + +Also consider providing your log files and persisted state, though extreme care should be taken if +these might contain cloud or machine credentials (especially if +[Externalised Configuration](({{ site.path.guide }}/ops/externalized-configuration.html) +is not being used for credential storage). + http://git-wip-us.apache.org/repos/asf/brooklyn-docs/blob/669b2e94/guide/ops/troubleshooting/index.md ---------------------------------------------------------------------- diff --git a/guide/ops/troubleshooting/index.md b/guide/ops/troubleshooting/index.md index ee8dfd7..ebbce45 100644 --- a/guide/ops/troubleshooting/index.md +++ b/guide/ops/troubleshooting/index.md @@ -5,6 +5,8 @@ children: - { path: overview.md, title: Overview } - { path: deployment.md, title: Deployment } - { path: connectivity.md, title: Server Connectivity } +- { path: unresponsive.md, title: Brooklyn Slow or Unresponsive } +- { path: detailed-support-report.md, title: Detailed Support Report } - { path: softwareprocess.md, title: SoftwareProcess Entities } - { path: going-deep-in-java-and-logs.md, title: Going Deep in Java and Logs } --- http://git-wip-us.apache.org/repos/asf/brooklyn-docs/blob/669b2e94/guide/ops/troubleshooting/slow-unresponsive.md ---------------------------------------------------------------------- diff --git a/guide/ops/troubleshooting/slow-unresponsive.md b/guide/ops/troubleshooting/slow-unresponsive.md new file mode 100644 index 0000000..0b90e83 --- /dev/null +++ b/guide/ops/troubleshooting/slow-unresponsive.md @@ -0,0 +1,237 @@ +--- +layout: website-normal +title: Brooklyn Slow or Unresponsive +toc: /guide/toc.json +--- + +There are many possible causes for a Brooklyn server becoming slow or unresponsive. This guide +describes some possible reasons, and some commands and tools that can help diagnose the problem. + +Possible reasons include: +* CPU is max'ed out +* Memory usage is extremely high +* SSH'ing is very slow due (e.g. due to lack of entropy) +* Out of disk space + +See [Brooklyn Requirements]({{ site.path.guide }}/ops/requirements.html) for details of server +requirements. + + +## Machine Diagnostics + +The following commands will collect OS-level diagnostics about the machine, and about the AMP +process. The commands below assume use of CentOS 6.x. Minor adjustments may be required for +other platforms. + + +#### OS and Machine Details + +To display system information, run: + +{% highlight bash %} +uname -a +{% endhighlight %} + +To show details of the CPU and memory available to the machine, run: + +{% highlight bash %} +cat /proc/cpuinfo +cat /proc/meminfo +{% endhighlight %} + + +#### User Limits + +To display information about user limits, run the command below (while logged in as the same user +who runs Brooklyn): + +{% highlight bash %} +ulimit -a +{% endhighlight %} + +If Brooklyn is run as a different user (e.g. with user name "adalovelace"), then instead run: + +{% highlight bash %} +ulimit -a -u adalovelace +{% endhighlight %} + +Of particular interest is the limit for "open files". + + +#### Disk Space + +The command below will list the disk size for each partition, including the amount used and +available. If the AMP base directory, persistence directory or logging directory are close +to 0% available, this can cause serious problems: + +{% highlight bash %} +df -h +{% endhighlight %} + + +#### CPU and Memory Usage + +To view the CPU and memory usage of all processes, and of the machine as a whole, one can use the +`top` command. This runs interactively, updating every few seconds. To collect the output once +(e.g. to share diagnostic information in a bug report), run: + +{% highlight bash %} +top -n 1 -b > top.txt +{% endhighlight %} + + +#### File and Network Usage + +To count the number of open files for the Brooklyn process (which includes open socket connections): + +{% highlight bash %} +BROOKLYN_HOME=/home/users/brooklyn/apache-brooklyn-0.9.0-bin +BROOKLYN_PID=$(cat $BROOKLYN_HOME/pid_java) +lsof -p $BROOKLYN_PID | wc -l +{% endhighlight %} + +To count (or view the number of "established" internet connections, run: + +{% highlight bash %} +netstat -an | grep ESTABLISHED | wc -l +{% endhighlight %} + + +#### Linux Kernel Entropy + +A lack of entropy can cause random number generation to be extremely slow. This can cause +tasks like ssh to also be extremely slow. See +[linux kernel entropy]({{ site.path.website }}/documentation/increase-entropy.html) +for details of how to work around this. + + +## Process Diagnostics + +#### Thread and Memory Usage + +To get memory and thread usage for the Brooklyn (Java) process, two useful tools are `jstack` +and `jmap`. These require the "development kit" to also be installed +(e.g. `yum install java-1.7.0-openjdk-devel`). Some useful commands are shown below: + +{% highlight bash %} +BROOKLYN_HOME=/home/users/brooklyn/apache-brooklyn-0.9.0-bin +BROOKLYN_PID=$(cat $BROOKLYN_HOME/pid_java) + +jstack $BROOKLYN_PID +jmap -histo:live $BROOKLYN_PID +jmap -heap $BROOKLYN_PID +{% endhighlight %} + + +#### Runnable Threads + +The [jstack-active](https://github.com/apache/brooklyn-dist/blob/master/scripts/jstack-active.sh) +script is a convenient light-weight way to quickly see which threads of a running Brooklyn +server are attempting to consume the CPU. It filters the output of `jstack`, to show only the +"really-runnable" threads (as opposed to those that are blocked). + +{% highlight bash %} +BROOKLYN_HOME=/home/users/brooklyn/apache-brooklyn-0.9.0-bin +BROOKLYN_PID=$(cat $BROOKLYN_HOME/pid_java) + +curl -O https://raw.githubusercontent.com/apache/brooklyn-dist/master/scripts/jstack-active.sh + +jstack-active $BROOKLYN_PID +{% endhighlight %} + + +#### Profiling + +If an in-depth investigation of the CPU usage (and/or object creation) of a Brooklyn Server is +requiring, there are many profiling tools designed specifically for this purpose. These generally +require that the process be launched in such a way that a profiler can attach, which may not be +appropriate for a production server. + + +#### Remote Debugging + +If the Brooklyn Server was originally run to allow a remote debugger to connect (strongly +discouraged in production!), then this provides a convenient way to investigate why Brooklyn +is being slow or unresonsive. See the Debugging Tips in the +tip [Debugging Remote Brooklyn][({{ site.path.guide }}/dev/tips/debugging-remote-brooklyn.html) +and the the [IDE docs](See [Brooklyn Requirements]({{ site.path.guide }}/dev/env/ide/) for more +information. + + +## Log Files + +Apache Brooklyn will by default create brooklyn.info.log and brooklyn.debug.log files. See the +[Logging](({{ site.path.guide }}/ops/logging.html) docs for more information. + +The following are useful log messages to search for (e.g. using `grep`). Note the wording of +these messages (or their very presence) may change in future version of Brooklyn. + + +#### Normal Logging + +The lines below are commonly logged, and can be useful to search for when finding the start of a section of logging. + +{% highlight %} +2016-05-30 17:05:51,458 INFO o.a.b.l.BrooklynWebServer [main]: Started Brooklyn console at http://127.0.0.1:8081/, running classpath://brooklyn.war +2016-05-30 17:06:04,098 INFO o.a.b.c.m.h.HighAvailabilityManagerImpl [main]: Management node tF3GPvQ5 running as HA MASTER autodetected +2016-05-30 17:06:08,982 INFO o.a.b.c.m.r.InitialFullRebindIteration [brooklyn-execmanager-rvpnFTeL-0]: Rebinding from /home/compose/compose-amp-state/brooklyn-persisted-state/data for master rvpnFTeL... +2016-05-30 17:06:11,105 INFO o.a.b.c.m.r.RebindIteration [brooklyn-execmanager-rvpnFTeL-0]: Rebind complete (MASTER) in 2s: 19 apps, 54 entities, 50 locations, 46 policies, 704 enrichers, 0 feeds, 160 catalog items +{% endhighlight %} + + +#### Memory Usage + +The debug log includes (every minute) a log statement about the memory usage and task activity. For example: + +{% highlight %} +2016-05-27 12:20:19,395 DEBUG o.a.b.c.m.i.BrooklynGarbageCollector [brooklyn-gc]: brooklyn gc (before) - using 328 MB / 496 MB memory (5.58 kB soft); 69 threads; storage: {datagrid={size=7, createCount=7}, refsMapSize=0, listsMapSize=0}; tasks: 10 active, 33 unfinished; 78 remembered, 1696906 total submitted) +2016-05-27 12:20:19,395 DEBUG o.a.b.c.m.i.BrooklynGarbageCollector [brooklyn-gc]: brooklyn gc (after) - using 328 MB / 496 MB memory (5.58 kB soft); 69 threads; storage: {datagrid={size=7, createCount=7}, refsMapSize=0, listsMapSize=0}; tasks: 10 active, 33 unfinished; 78 remembered, 1696906 total submitted) +{% endhighlight %} + +These can be extremely useful if investigating a memory or thread leak, or to determine whether a +surprisingly high number of tasks are being executed. + + +#### Subscriptions + +One source of high CPU in Brooklyn is when a subscription (e.g. for a policy or enricher) is being +triggered many times (i.e. handling many events). A log message like that below will be logged on +every 1000 events handled by a given single subscription. + +{% highlight %} +2016-05-30 17:29:09,125 DEBUG o.a.b.c.m.i.LocalSubscriptionManager [brooklyn-execmanager-rvpnFTeL-8]: 1000 events for subscriber Subscription[SCFnav9g;CanopyComposeApp{id=gIeTwhU2}@gIeTwhU2:webapp.url] +{% endhighlight %} + +If a subscription is handling a huge number of events, there are a couple of common reasons: +* first, it could be subscribing to too much activity - e.g. a wildcard subscription for all + events from all entities. +* second it could be an infinite loop (e.g. where an enricher responds to a sensor-changed event + by setting that same sensor, thus triggering another sensor-changed event). + + +#### User Activity + +All activity triggered by the REST API or web-console will be logged. Some examples are shown below: + +{% highlight %} +2016-05-19 17:52:30,150 INFO o.a.b.r.r.ApplicationResource [brooklyn-jetty-server-8081-qtp1058726153-17473]: Launched from YAML: name: My Example App +location: aws-ec2:us-east-1 +services: +- type: org.apache.brooklyn.entity.webapp.tomcat.TomcatServer + +2016-05-30 14:46:19,516 DEBUG brooklyn.REST [brooklyn-jetty-server-8081-qtp1104967201-20881]: Request Tisj14 starting: POST /v1/applications/NiBy0v8Q/entities/NiBy0v8Q/expunge from 77.70.102.66 +{% endhighlight %} + + +#### Entity Activity + +If investigating the behaviour of a particular entity (e.g. on failure), it can be very useful to +`grep` the info and debug log for the entity's id. For a software process, the debug log will +include the stdout and stderr of all the commands executed by that entity. + +It can also be very useful to search for all effector invocations, to see where the behaviour +has been triggered: + +{% highlight %} +2016-05-27 12:45:43,529 DEBUG o.a.b.c.m.i.EffectorUtils [brooklyn-execmanager-gvP7MuZF-14364]: Invoking effector stop on TomcatServerImpl{id=mPujYmPd} +{% endhighlight %} http://git-wip-us.apache.org/repos/asf/brooklyn-docs/blob/669b2e94/website/documentation/faq.md ---------------------------------------------------------------------- diff --git a/website/documentation/faq.md b/website/documentation/faq.md index 7af5f80..483d686 100644 --- a/website/documentation/faq.md +++ b/website/documentation/faq.md @@ -31,7 +31,7 @@ You could encounter this error when running with many entities. Please **increase the ulimit** if you see such error: On the VM running Apache Brooklyn, we recommend ensuring nproc and nofile are reasonably high (e.g. higher than 1024, which is often the default). -We recommend setting it limits to a value above 16000. +We recommend setting it limits to a value of 16384 or higher. If you want to check the current limits run `ulimit -a`.