Yeah, I did’t pay attention to the cached memory at all, my bad!

I remember running into a similar situation a couple of years ago, one of the 
things to investigate our memory profile was to produce a full heap dump and 
manually analyse that using a tool like MAT.

Cheers,
-patrick




On 17/03/2016, 21:58, "Otis Gospodnetić" <otis.gospodne...@gmail.com> wrote:

>Hi,
>
>On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje <pplaa...@gmail.com>
>wrote:
>
>> Hi,
>>
>> From the sar output you supplied, it looks like you might have a memory
>> issue on your hosts. The memory usage just before your crash seems to be
>> *very* close to 100%. Even the slightest increase (Solr itself, or possibly
>> by a system service) could caused the system crash. What are the
>> specifications of your hosts and how much memory are you allocating?
>
>
>That's normal actually - http://www.linuxatemyram.com/
>
>You *want* Linux to be using all your memory - you paid for it :)
>
>Otis
>--
>Monitoring - Log Management - Alerting - Anomaly Detection
>Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>
>>
>
>
>>
>>
>> On 16/03/2016, 14:52, "YouPeng Yang" <yypvsxf19870...@gmail.com> wrote:
>>
>> >Hi
>> > It happened again,and worse thing is that my system went to crash.we can
>> >even not connect to it with ssh.
>> > I use the sar command to capture the statistics information about it.Here
>> >are my details:
>> >
>> >
>> >[1]cpu(by using sar -u),we have to restart our system just as the red font
>> >LINUX RESTART in the logs.
>>
>> >--------------------------------------------------------------------------------------------------
>> >03:00:01 PM     all      7.61      0.00      0.92      0.07      0.00
>> >91.40
>> >03:10:01 PM     all      7.71      0.00      1.29      0.06      0.00
>> >90.94
>> >03:20:01 PM     all      7.62      0.00      1.98      0.06      0.00
>> >90.34
>> >03:30:35 PM     all      5.65      0.00     31.08      0.04      0.00
>> >63.23
>> >03:42:40 PM     all     47.58      0.00     52.25      0.00      0.00
>> > 0.16
>> >Average:        all      8.21      0.00      1.57      0.05      0.00
>> >90.17
>> >
>> >04:42:04 PM       LINUX RESTART
>> >
>> >04:50:01 PM     CPU     %user     %nice   %system   %iowait    %steal
>> >%idle
>> >05:00:01 PM     all      3.49      0.00      0.62      0.15      0.00
>> >95.75
>> >05:10:01 PM     all      9.03      0.00      0.92      0.28      0.00
>> >89.77
>> >05:20:01 PM     all      7.06      0.00      0.78      0.05      0.00
>> >92.11
>> >05:30:01 PM     all      6.67      0.00      0.79      0.06      0.00
>> >92.48
>> >05:40:01 PM     all      6.26      0.00      0.76      0.05      0.00
>> >92.93
>> >05:50:01 PM     all      5.49      0.00      0.71      0.05      0.00
>> >93.75
>>
>> >--------------------------------------------------------------------------------------------------
>> >
>> >[2]mem(by using sar -r)
>>
>> >--------------------------------------------------------------------------------------------------
>> >03:00:01 PM   1519272 196633272     99.23    361112  76364340 143574212
>> >47.77
>> >03:10:01 PM   1451764 196700780     99.27    361196  76336340 143581608
>> >47.77
>> >03:20:01 PM   1453400 196699144     99.27    361448  76248584 143551128
>> >47.76
>> >03:30:35 PM   1513844 196638700     99.24    361648  76022016 143828244
>> >47.85
>> >03:42:40 PM   1481108 196671436     99.25    361676  75718320 144478784
>> >48.07
>> >Average:      5051607 193100937     97.45    362421  81775777 142758861
>> >47.50
>> >
>> >04:42:04 PM       LINUX RESTART
>> >
>> >04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
>> >%commit
>> >05:00:01 PM 154357132  43795412     22.10     92012  18648644 134950460
>> >44.90
>> >05:10:01 PM 136468244  61684300     31.13    219572  31709216 134966548
>> >44.91
>> >05:20:01 PM 135092452  63060092     31.82    221488  32162324 134949788
>> >44.90
>> >05:30:01 PM 133410464  64742080     32.67    233848  32793848 134976828
>> >44.91
>> >05:40:01 PM 132022052  66130492     33.37    235812  33278908 135007268
>> >44.92
>> >05:50:01 PM 130630408  67522136     34.08    237140  33900912 135099764
>> >44.95
>> >Average:    136996792  61155752     30.86    206645  30415642 134991776
>> >44.91
>>
>> >--------------------------------------------------------------------------------------------------
>> >
>> >
>> >As the blue font parts show that my hardware crash from 03:30:35.It is
>> hung
>> >up until I restart it manually at 04:42:04
>> >ALl the above information just snapshot the performance when it crashed
>> >while there is nothing cover the reason.I have also
>> >check the /var/log/messages and find nothing useful.
>> >
>> >Note that I run the command- sar -v .It shows something abnormal:
>>
>> >------------------------------------------------------------------------------------------------
>> >02:50:01 PM  11542262      9216     76446       258
>> >03:00:01 PM  11645526      9536     76421       258
>> >03:10:01 PM  11748690      9216     76451       258
>> >03:20:01 PM  11850191      9152     76331       258
>> >03:30:35 PM  11972313     10112    132625       258
>> >03:42:40 PM  12177319     13760    340227       258
>> >Average:      8293601      8950     68187       161
>> >
>> >04:42:04 PM       LINUX RESTART
>> >
>> >04:50:01 PM dentunusd   file-nr  inode-nr    pty-nr
>> >05:00:01 PM     35410      7616     35223         4
>> >05:10:01 PM    137320      7296     42632         6
>> >05:20:01 PM    247010      7296     42839         9
>> >05:30:01 PM    358434      7360     42697         9
>> >05:40:01 PM    471543      7040     42929        10
>> >05:50:01 PM    583787      7296     42837        13
>>
>> >------------------------------------------------------------------------------------------------
>> >
>> >and I check the man info about the -v option :
>>
>> >------------------------------------------------------------------------------------------------
>> >*-v*  Report status of inode, file and other kernel tables.  The following
>> >values are displayed:
>> >       *dentunusd*
>> >Number of unused cache entries in the directory cache.
>> >*file-nr*
>> >Number of file handles used by the system.
>> >*inode-nr*
>> >Number of inode handlers used by the system.
>> >*pty-nr*
>> >Number of pseudo-terminals used by the system.
>>
>> >------------------------------------------------------------------------------------------------
>> >
>> >Is the any clue about the crash? Would you please give me some
>> suggestions?
>> >
>> >
>> >Best Regards.
>> >
>> >
>> >2016-03-16 14:01 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>:
>> >
>> >> Hello
>> >>    The problem appears several times ,however I could not capture the
>> top
>> >> output .My script is as follows code.
>> >> I check the sys cpu usage whether it exceed 30%.the other metric
>> >> information can be dumpped successfully except the top .
>> >> Would you like to check my script that I am not able to figure out what
>> is
>> >> wrong.
>> >>
>> >>
>> >>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> #!/bin/bash
>> >>
>> >> while :
>> >>   do
>> >>     sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk '{if($6 <
>> >> 30) print 1; else print 0;}' )
>> >>
>> >>     if [ $sysusage -eq 0 ];then
>> >>         #echo $sysusage
>> >>         #perf record -o perf$(date +%Y%m%d%H%M%S).data  -a -g -F 1000
>> >> sleep 30
>> >>         file=$(date +%Y%m%d%H%M%S)
>> >>         top -n 2 >> top$file.data
>> >>         iotop -b -n 2  >> iotop$file.data
>> >>         iostat >> iostat$file.data
>> >>         netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in state)
>> >> print i,"\t",state[i]}' >> netstat$file.data
>> >>     fi
>> >>     sleep 5
>> >>   done
>> >> You have new mail in /var/spool/mail/root
>> >>
>> >>
>> >>
>> >>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >>
>> >> 2016-03-08 21:39 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>:
>> >>
>> >>> Hi all
>> >>>   Thanks for your reply.I do some investigation for much time.and I
>> will
>> >>> post some logs of the 'top' and IO in a few days when the crash come
>> again.
>> >>>
>> >>> 2016-03-08 10:45 GMT+08:00 Shawn Heisey <apa...@elyograg.org>:
>> >>>
>> >>>> On 3/7/2016 2:23 AM, Toke Eskildsen wrote:
>> >>>> > How does this relate to YouPeng reporting that the CPU usage
>> increases?
>> >>>> >
>> >>>> > This is not a snark. YouPeng mentions kernel issues. It might very
>> well
>> >>>> > be that IO is the real problem, but that it manifests in a
>> >>>> non-intuitive
>> >>>> > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I
>> am
>> >>>> > not so sure. Can high kernel load (Sy% in *nix top) indicate that
>> the
>> >>>> IO
>> >>>> > system is struggling, even if IO-Wait is low?
>> >>>>
>> >>>> It might turn out to be not directly related to memory, you're right
>> >>>> about that.  A very high query rate or particularly CPU-heavy queries
>> or
>> >>>> analysis could cause high CPU usage even when memory is plentiful, but
>> >>>> in that situation I would expect high user percentage, not kernel.
>> I'm
>> >>>> not completely sure what might cause high kernel usage if iowait is
>> low,
>> >>>> but no specific information was given about iowait.  I've seen iowait
>> >>>> percentages of 10% or less with problems clearly caused by iowait.
>> >>>>
>> >>>> With the available information (especially seeing 700GB of index
>> data),
>> >>>> I believe that the "not enough memory" scenario is more likely than
>> >>>> anything else.  If the OP replies and says they have plenty of memory,
>> >>>> then we can move on to the less common (IMHO) reasons for high CPU
>> with
>> >>>> a large index.
>> >>>>
>> >>>> If the OS is one that reports load average, I am curious what the 5
>> >>>> minute average is, and how many real (non-HT) CPU cores there are.
>> >>>>
>> >>>> Thanks,
>> >>>> Shawn
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>>

Reply via email to