It would be extremely helpful to get some info about your heap.  At a bare 
minimum, a histogram of the heap dump would be useful, but ideally a full heap 
dump would be best.

jmap  -dump:live,format=b,file=heap.bin PID

Taking a look at that in YourKit should give some pretty quick insight into 
what kinds of objects are allocated then we can get to the bottom of the issue. 
 This should be moved to a JIRA 
(https://issues.apache.org/jira/secure/Dashboard.jspa 
<https://issues.apache.org/jira/secure/Dashboard.jspa>) in order to track and 
fix it, if you could attach that heap dump it would be very helpful.

Jon

> On Feb 7, 2018, at 6:11 AM, Nicolas Guyomar <nicolas.guyo...@gmail.com> wrote:
> 
> Ok then, following up on the wild guess : because you have quite a lot of 
> concurrent compactors, maybe it is too much concurrent compactions for the 
> jvm to deal with (taking into account that your load average of 106 seems 
> really high IMHO)
> 
> 55Gb of data is not that much, you can try to reduce those concurrent 
> compactor to make sure your box is not under too much stress (how many 
> compaction do you have in parallel during boostrap ? )
> 
> In the end, it does seem that you're gonna have to share some heap dump for 
> further investigation (sorry I'm not gonna help lot on this matter)
> 
> On 7 February 2018 at 14:43, Jürgen Albersdorfer <jalbersdor...@gmail.com 
> <mailto:jalbersdor...@gmail.com>> wrote:
> Hi Nicolas,
> 
> Do you know how many sstables is this new node suppose to receive ?
> 
> If I can find out this via nodetool netstats, then this would be 619 as 
> following:
> 
> # nodetool netstats
> Bootstrap b95371e0-0c0a-11e8-932b-f775227bf21c
>     /192.168.1.215 <http://192.168.1.215/> - Receiving 71 files, 7744612158 
> bytes total. Already received 0 files, 893897583 bytes total
>     /192.168.1.214 <http://192.168.1.214/> - Receiving 58 files, 5693392001 
> bytes total. Already received 0 files, 1078372756 bytes total
>     /192.168.1.206 <http://192.168.1.206/> - Receiving 52 files, 3389096409 
> bytes total. Already received 3 files, 508592758 bytes total
>     /192.168.1.213 <http://192.168.1.213/> - Receiving 59 files, 6041633329 
> bytes total. Already received 0 files, 1038760653 bytes total
>     /192.168.1.231 <http://192.168.1.231/> - Receiving 79 files, 7579181689 
> bytes total. Already received 4 files, 38387859 bytes total
>     /192.168.1.208 <http://192.168.1.208/> - Receiving 51 files, 3272885123 
> bytes total. Already received 3 files, 362450903 bytes total
>     /192.168.1.207 <http://192.168.1.207/> - Receiving 56 files, 3028344200 
> bytes total. Already received 3 files, 57790197 bytes total
>     /192.168.1.232 <http://192.168.1.232/> - Receiving 79 files, 7268716317 
> bytes total. Already received 1 files, 1127174421 bytes total
>     /192.168.1.209 <http://192.168.1.209/> - Receiving 114 files, 21381846105 
> bytes total. Already received 1 files, 961497222 bytes total
> 
> 
> does disabling compaction_throughput_mb_per_sec or increasing 
> concurrent_compactors has any effect ?
> 
> I will give it a try:
> 
> # nodetool getcompactionthroughput
> Current compaction throughput: 128 MB/s
> 
> # nodetool setcompactionthroughput 0
> 
> # nodetool getcompactionthroughput
> Current compaction throughput: 0 MB/s
> 
> # nodetool getconcurrentcompactors
> Current concurrent compactors in the system is:
> 16
> 
> 
> Which memtable_allocation_type are you using ?
>  
> # grep memtable_allocation_type /etc/cassandra/conf/cassandra.yaml
> memtable_allocation_type: heap_buffers
> 
> 
> thanks so far, regards
> Juergen
> 
> 2018-02-07 14:29 GMT+01:00 Nicolas Guyomar <nicolas.guyo...@gmail.com 
> <mailto:nicolas.guyo...@gmail.com>>:
> Hi Jurgen,
> 
> It does feel like some OOM during boostrap from previous C* v2, but that 
> sould be fixed in your version.
> 
> Do you know how many sstables is this new node suppose to receive ? 
> 
> Juste a wild guess : it may have something to do with compaction not keeping 
> up because every other nodes are streaming data to this new one (resulting in 
> long lived object in the heap), does disabling 
> compaction_throughput_mb_per_sec or increasing concurrent_compactors has any 
> effect ? 
> 
> Which memtable_allocation_type are you using ? 
> 
> 
> On 7 February 2018 at 12:38, Jürgen Albersdorfer <jalbersdor...@gmail.com 
> <mailto:jalbersdor...@gmail.com>> wrote:
> Hi, I always face an issue when bootstrapping a Node having less than 184GB 
> RAM (156GB JVM HEAP) on our 10 Node C* 3.11.1 Cluster.
> During bootstrap, when I watch the cassandra.log I observe a growth in JVM 
> Heap Old Gen which gets not significantly freed any more.
> I know that JVM collects on Old Gen only when really needed. I can see 
> collections, but there is always a remainder which
> seems to grow forever without ever getting freed.
> After the Node successfully Joined the Cluster, I can remove the extra 128GB 
> of RAM I have given it for bootstrapping without any further effect.
> 
> It feels like Cassandra will not forget about every single byte streamed over 
> the Network over time during bootstrapping, - which would be a memory leak 
> and a major problem, too.
> 
> Or is there something I'm doing wrong? - Any Ideas?
> 
> Here my observations on a failing Bootstrap - The following Node has 72GB RAM 
> installed, 64GB of it are configured for JVM Heap Space.
> 
> cassandra.log (truncated):
> INFO  [Service Thread] 2018-02-07 11:12:49,604 GCInspector.java:284 - G1 
> Young Generation GC in 984ms.  G1 Eden Space: 14763950080 -> 0; G1 Old Gen: 
> 36960206856 -> 39661338640; G1 Survivor Space: 2785017856 -> 1476395008;
> INFO  [Service Thread] 2018-02-07 11:13:00,108 GCInspector.java:284 - G1 
> Young Generation GC in 784ms.  G1 Eden Space: 18387828736 
> <tel:(838)%20782-8736> -> 0; G1 Old Gen: 39661338640 -> 41053847560; G1 
> Survivor Space: 1476395008 -> 1845493760;
> INFO  [Service Thread] 2018-02-07 11:13:08,639 GCInspector.java:284 - G1 
> Young Generation GC in 718ms.  G1 Eden Space: 16743661568 -> 0; G1 Old Gen: 
> 41053847560 -> 42832232472; G1 Survivor Space: 1845493760 -> 1375731712;
> INFO  [Service Thread] 2018-02-07 11:13:18,271 GCInspector.java:284 - G1 
> Young Generation GC in 546ms.  G1 Eden Space: 15535702016 -> 0; G1 Old Gen: 
> 42831004832 -> 44206736544; G1 Survivor Space: 1375731712 -> 1006632960;
> INFO  [Service Thread] 2018-02-07 11:13:35,364 GCInspector.java:284 - G1 
> Young Generation GC in 638ms.  G1 Eden Space: 14025752576 
> <tel:(402)%20575-2576> -> 0; G1 Old Gen: 44206737048 -> 45213369488; G1 
> Survivor Space: 1778384896 -> 1610612736;
> INFO  [Service Thread] 2018-02-07 11:13:42,898 GCInspector.java:284 - G1 
> Young Generation GC in 614ms.  G1 Eden Space: 13388218368 -> 0; G1 Old Gen: 
> 45213369488 -> 46152893584; G1 Survivor Space: 1610612736 -> 1006632960;
> INFO  [Service Thread] 2018-02-07 11:13:58,291 GCInspector.java:284 - G1 
> Young Generation GC in 400ms.  G1 Eden Space: 13119782912 -> 0; G1 Old Gen: 
> 46136116376 -> 47171400848; G1 Survivor Space: 1275068416 -> 771751936;
> INFO  [Service Thread] 2018-02-07 11:14:23,071 GCInspector.java:284 - G1 
> Young Generation GC in 303ms.  G1 Eden Space: 11676942336 -> 0; G1 Old Gen: 
> 47710958232 -> 48239699096; G1 Survivor Space: 1207959552 -> 973078528;
> INFO  [Service Thread] 2018-02-07 11:14:46,157 GCInspector.java:284 - G1 
> Young Generation GC in 305ms.  G1 Eden Space: 11005853696 -> 0; G1 Old Gen: 
> 48903342232 -> 49289001104; G1 Survivor Space: 939524096 -> 973078528;
> INFO  [Service Thread] 2018-02-07 11:14:53,045 GCInspector.java:284 - G1 
> Young Generation GC in 380ms.  G1 Eden Space: 10569646080 -> 0; G1 Old Gen: 
> 49289001104 -> 49586732696; G1 Survivor Space: 973078528 -> 1308622848;
> INFO  [Service Thread] 2018-02-07 11:15:04,692 GCInspector.java:284 - G1 
> Young Generation GC in 360ms.  G1 Eden Space: 9294577664 
> <tel:(929)%20457-7664> -> 0; G1 Old Gen: 50671712912 -> 51269944472; G1 
> Survivor Space: 905969664 -> 805306368;
> WARN  [Service Thread] 2018-02-07 11:15:07,317 GCInspector.java:282 - G1 
> Young Generation GC in 1102ms.  G1 Eden Space: 2617245696 -> 0; G1 Old Gen: 
> 51269944472 -> 47310521496; G1 Survivor Space: 805306368 -> 301989888;
> ....
> INFO  [Service Thread] 2018-02-07 11:16:36,535 GCInspector.java:284 - G1 
> Young Generation GC in 377ms.  G1 Eden Space: 7683964928 -> 0; G1 Old Gen: 
> 51958433432 -> 52658554008; G1 Survivor Space: 1073741824 -> 1040187392;
> INFO  [Service Thread] 2018-02-07 11:16:41,756 GCInspector.java:284 - G1 
> Young Generation GC in 340ms.  G1 Eden Space: 7046430720 
> <tel:(704)%20643-0720> -> 0; G1 Old Gen: 52624999576 -> 53299987616; G1 
> Survivor Space: 1040187392 -> 805306368;
> WARN  [Service Thread] 2018-02-07 11:16:44,087 GCInspector.java:282 - G1 
> Young Generation GC in 1005ms.  G1 Eden Space: 2617245696 -> 0; G1 Old Gen: 
> 53299987616 -> 49659331752; G1 Survivor Space: 805306368 -> 436207616;
> ...
> INFO  [Service Thread] 2018-02-07 11:25:40,902 GCInspector.java:284 - G1 
> Young Generation GC in 254ms.  G1 Eden Space: 11475615744 -> 0; G1 Old Gen: 
> 48904357040 -> 48904357544; G1 Survivor Space: 704643072 -> 805306368;
> INFO  [Service Thread] 2018-02-07 11:26:11,424 GCInspector.java:284 - G1 
> Young Generation GC in 202ms.  G1 Eden Space: 11005853696 -> 0; G1 Old Gen: 
> 48904357544 -> 49321014960; G1 Survivor Space: 939524096 -> 536870912;
> WARN  [Service Thread] 2018-02-07 11:26:44,484 GCInspector.java:282 - G1 
> Young Generation GC in 1295ms.  G1 Eden Space: 2617245696 -> 0; G1 Old Gen: 
> 49321014960 -> 46255753384; G1 Survivor Space: 805306368 -> 402653184;
> ...
> INFO  [Service Thread] 2018-02-07 11:30:37,828 GCInspector.java:284 - G1 
> Young Generation GC in 958ms.  G1 Eden Space: 2785017856 -> 0; G1 Old Gen: 
> 51196393000 -> 50629766184; G1 Survivor Space: 637534208 -> 436207616;
> INFO  [Service Thread] 2018-02-07 11:30:45,036 GCInspector.java:284 - G1 
> Young Generation GC in 270ms.  G1 Eden Space: 10267656192 -> 0; G1 Old Gen: 
> 50629766184 -> 50626254144; G1 Survivor Space: 436207616 -> 738197504;
> INFO  [Service Thread] 2018-02-07 11:31:48,128 GCInspector.java:284 - G1 
> Young Generation GC in 984ms.  G1 Eden Space: 2617245696 -> 0; G1 Old Gen: 
> 51086410272 -> 50443965480; G1 Survivor Space: 805306368 -> 369098752;
> 
> 
> jvm.options as following (comments removed):
> ## Use the Hotspot garbage-first collector.
> -XX:+UseG1GC
> -XX:MaxGCPauseMillis=1000
> -XX:InitiatingHeapOccupancyPercent=70
> -XX:ParallelGCThreads=16
> -XX:ConcGCThreads=16
> 
> ### GC logging options -- uncomment to enable
> -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintPromotionFailure
> #-XX:PrintFLSStatistics=1
> #-Xloggc:/var/log/cassandra/gc.log
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=10
> -XX:GCLogFileSize=10M
> 
> I tried this with ParNewGC and ConcMarkSweepGC, too - and I have the same 
> behavior there, too.
> 
> From nodetool netstats I see that it wants to Stream about 55,9 GB. 
> After 1,5h of streaming with more than 10MB/s (about 54GB seen with dstat) 
> nodetool netstats displays that only 3,5GB of 55,9 GB already received.
> 
> uptime
>  11:30:52 up  1:36,  3 users,  load average: 106.01, 87.54, 66.01
> 
> nodetool netstats (truncated for better reading)
> Wed Feb  7 11:19:07 CET 2018
> Mode: JOINING
> Bootstrap 56d204d0-0be9-11e8-ae30-617216855b4a
>     /192.168.1.213 <http://192.168.1.213/> - Receiving 68 files, 
> 6.774.831.556 bytes total. Already received 3 files, 279.238.740 bytes total
>     /192.168.1.215 <http://192.168.1.215/> - Receiving 68 files, 
> 5.721.460.494 bytes total. Already received 4 files, 109.051.913 bytes total
>     /192.168.1.214 <http://192.168.1.214/> - Receiving 68 files, 
> 7.497.726.056 bytes total. Already received 4 files, 870.592.708 bytes total
>     /192.168.1.207 <http://192.168.1.207/> - Receiving 63 files, 
> 4.945.809.501 bytes total. Already received 4 files, 700.599.427 bytes total
>     /192.168.1.232 <http://192.168.1.232/> - Receiving 91 files, 
> 7.344.537.682 bytes total. Already received 3 files, 237.482.005 bytes total
>     /192.168.1.209 <http://192.168.1.209/> - Receiving 102 files, 
> 15.931.849.729 bytes total. Already received 3 files, 1.108.754.920 bytes 
> total
>     /192.168.1.231 <http://192.168.1.231/> - Receiving 92 files, 
> 7.927.882.516 bytes total. Already received 4 files, 269.514.936 bytes total
> 
> 
> nodetool status:
> Datacenter: main
> ================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID                    
>            Rack
> UN  192.168.1.232  83,31 GiB  256          ?       
> 510a0068-ee2b-4d1f-9965-9e29602d2f8f  rack04
> UN  192.168.1.206  51,41 GiB  256          ?       
> a168b632-52e7-408a-ae7f-6ba6b9c55cea  rack01
> UN  192.168.1.207  57,66 GiB  256          ?       
> 7401ab8f-114d-41b4-801d-53a4b042de52  rack01
> UN  192.168.1.208  56,47 GiB  256          ?       
> 767980ef-52f2-4c21-8567-324fc1db274c  rack01...
> UJ  192.168.1.160  68,95 GiB  256          ?       
> a3a5a169-512f-4e1f-8c0b-419c828f23e1  rack02
> UN  192.168.1.209  94,27 GiB  256          ?       
> 8757cb4a-183e-4828-8212-7715b5563935  rack02
> UN  192.168.1.213  78,26 GiB  256          ?       
> b1e9481c-4ba2-4396-837a-84be35737fe7  rack05
> UN  192.168.1.214  80,66 GiB  256          ?       
> 457fc606-7002-49ad-8da5-309b92093acf  rack06
> UN  192.168.1.231  87,5 GiB   256          ?       
> 2017a9e8-3638-465e-bc4a-5e59e693fb49  rack03
> UN  192.168.1.215  86,97 GiB  256          ?       
> 5dfe4c35-8f8a-4305-824a-4610cec9411b  rack07
> 
> thanks, and kind regards
> Juergen
> 
> 
> 

Reply via email to