[ 
https://issues.apache.org/jira/browse/KAFKA-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803506#comment-17803506
 ] 

Nicholas Telford commented on KAFKA-16086:
------------------------------------------

As discussed on Slack:

{{rocksdb::port::cacheline_aligned_alloc}} is called by {{StatisticsImpl}} 
_once per-core_ to allocate a block of memory for storing stats tickers. The 
size of this block of memory looks to be _at least_ 2112 bytes (enough to store 
199 tickers and 62 histograms, aligned to the cache line size).

For example, if the running machine has 16 cores, this would be 16*2112 = 33 
KiB invocation.

Our temporary {{Options}} object passes the global {{DBOptions}} object in its 
constructor. This invokes the copy-constructor on {{DBOptions}} copying the 
{{Statistics}} that was configured on {{{}DBOptions{}}}. Since we never 
{{close()}} the {{{}Options{}}}, this copied {{Statistics}} leaks.

> Kafka Streams has RocksDB native memory leak
> --------------------------------------------
>
>                 Key: KAFKA-16086
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16086
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.7.0
>            Reporter: Lucas Brutschy
>            Assignee: Nicholas Telford
>            Priority: Blocker
>              Labels: streams
>         Attachments: image.png
>
>
> The current 3.7 and trunk versions are leaking native memory while running 
> Kafka streams over several hours. This will likely kill any real workload 
> over time, so this should be treated as a blocker bug for 3.7.
> This is discovered in a long-running soak test. Attached is the memory 
> consumption, which steadily approaches 100% and then the JVM is killed.
> Rerunning the same test with jemalloc native memory profiling, we see these 
> allocated objects after a few hours:
>  
> {noformat}
> (jeprof) top
> Total: 13283138973 B
> 10296829713 77.5% 77.5% 10296829713 77.5% 
> rocksdb::port::cacheline_aligned_alloc
> 2487325671 18.7% 96.2% 2487325671 18.7% 
> rocksdb::BlockFetcher::ReadBlockContents
> 150937547 1.1% 97.4% 150937547 1.1% 
> rocksdb::lru_cache::LRUHandleTable::LRUHandleTable
> 119591613 0.9% 98.3% 119591613 0.9% prof_backtrace_impl
> 47331433 0.4% 98.6% 105040933 0.8% 
> rocksdb::BlockBasedTable::PutDataBlockToCache
> 32516797 0.2% 98.9% 32516797 0.2% rocksdb::Arena::AllocateNewBlock
> 29796095 0.2% 99.1% 30451535 0.2% Java_org_rocksdb_Options_newOptions
> 18172716 0.1% 99.2% 20008397 0.2% rocksdb::InternalStats::InternalStats
> 16032145 0.1% 99.4% 16032145 0.1% 
> rocksdb::ColumnFamilyDescriptorJni::construct
> 12454120 0.1% 99.5% 12454120 0.1% std::_Rb_tree::_M_insert_unique{noformat}
>  
>  
> The first hypothesis is that this is caused by the leaking `Options` object 
> introduced in this line:
>  
> [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBStore.java#L312|https://github.com/apache/kafka/pull/14852]
>  
> Introduced in this PR: 
> [https://github.com/apache/kafka/pull/14852|https://github.com/apache/kafka/pull/14852]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to