Re: "Out of memory during read" errors instead of key eviction

2022-08-25 Thread dormando
Took another quick look...

Think there's an easy patch that might work:
https://github.com/memcached/memcached/pull/924

If you wouldn't mind helping validate? An external validator would help me
get it in time for the next release :)

Thanks,
-Dormando

On Wed, 24 Aug 2022, dormando wrote:

> Hey,
>
> Thanks for the info. Yes; this generally confirms the issue. I see some of
> your higher slab classes with "free_chunks 0", so if you're setting data
> that requires these chunks it could error out. The "stats items" confirms
> this since there are no actual items in those lower slab classes.
>
> You're certainly right a workaround of making your items < 512k would also
> work; but in general if I have features it'd be nice if they worked well
> :) Please open an issue so we can improve things!
>
> I intended to lower the slab_chunk_max default from 512k to much lower, as
> that actually raises the memory efficiency by a bit (less gap at the
> higher classes). That may help here. The system should also try ejecting
> items from the highest LRU... I need to double check that it wasn't
> already intending to do that and failing.
>
> Might also be able to adjust the page mover but not sure. The page mover
> can probably be adjusted to attempt to keep one page in reserve, but I
> think the algorithm isn't expecting slabs with no items in it so I'd have
> to audit that too.
>
> If you're up for experiments it'd be interesting to know if setting
> "-o slab_chunk_max=32768" or 16k (probably not more than 64) makes things
> better or worse.
>
> Also, crud.. it's documented as kilobytes but that's not working somehow?
> aaahahah. I guess the big EXPERIMENTAL tag scared people off since that
> never got reported.
>
> I'm guessing most people have a mix of small to large items, but you only
> have large items and a relatively low memory limit, so this is why you're
> seeing it so easily. I think most people setting large items have like
> 30G+ of memory so you end up with more spread around.
>
> Thanks,
> -Dormando
>
> On Wed, 24 Aug 2022, Hayden wrote:
>
> > What you're saying makes sense, and I'm pretty sure it won't be too hard to 
> > add some functionality to my writing code to break my large items up into
> > smaller parts that can each fit into a single chunk. That has the added 
> > benefit that I won't have to bother increasing the max item size.
> > In the meantime, though, I reran my pipeline and captured the output of 
> > stats, stats slabs, and stats items both when evicting normally and when 
> > getting
> > spammed with the error.
> >
> > First, the output when I'm in the error state:
> >  Output of stats
> > STAT pid 1
> > STAT uptime 11727
> > STAT time 1661406229
> > STAT version b'1.6.14'
> > STAT libevent b'2.1.8-stable'
> > STAT pointer_size 64
> > STAT rusage_user 2.93837
> > STAT rusage_system 6.339015
> > STAT max_connections 1024
> > STAT curr_connections 2
> > STAT total_connections 8230
> > STAT rejected_connections 0
> > STAT connection_structures 6
> > STAT response_obj_oom 0
> > STAT response_obj_count 1
> > STAT response_obj_bytes 65536
> > STAT read_buf_count 8
> > STAT read_buf_bytes 131072
> > STAT read_buf_bytes_free 49152
> > STAT read_buf_oom 0
> > STAT reserved_fds 20
> > STAT cmd_get 0
> > STAT cmd_set 12640
> > STAT cmd_flush 0
> > STAT cmd_touch 0
> > STAT cmd_meta 0
> > STAT get_hits 0
> > STAT get_misses 0
> > STAT get_expired 0
> > STAT get_flushed 0
> > STAT delete_misses 0
> > STAT delete_hits 0
> > STAT incr_misses 0
> > STAT incr_hits 0
> > STAT decr_misses 0
> > STAT decr_hits 0
> > STAT cas_misses 0
> > STAT cas_hits 0
> > STAT cas_badval 0
> > STAT touch_hits 0
> > STAT touch_misses 0
> > STAT store_too_large 0
> > STAT store_no_memory 0
> > STAT auth_cmds 0
> > STAT auth_errors 0
> > STAT bytes_read 21755739959
> > STAT bytes_written 330909
> > STAT limit_maxbytes 5368709120
> > STAT accepting_conns 1
> > STAT listen_disabled_num 0
> > STAT time_in_listen_disabled_us 0
> > STAT threads 4
> > STAT conn_yields 0
> > STAT hash_power_level 16
> > STAT hash_bytes 524288
> > STAT hash_is_expanding False
> > STAT slab_reassign_rescues 0
> > STAT slab_reassign_chunk_rescues 0
> > STAT slab_reassign_evictions_nomem 0
> > STAT slab_reassign_inline_reclaim 0
> > STAT slab_reassign_busy_items 0
> > STAT slab_reassign_busy_deletes 0
> > STAT slab_reassign_running False
> > STAT slabs_moved 0
> > STAT lru_crawler_running 0
> > STAT lru_crawler_starts 20
> > STAT lru_maintainer_juggles 71777
> > STAT malloc_fails 0
> > STAT log_worker_dropped 0
> > STAT log_worker_written 0
> > STAT log_watcher_skipped 0
> > STAT log_watcher_sent 0
> > STAT log_watchers 0
> > STAT unexpected_napi_ids 0
> > STAT round_robin_fallback 0
> > STAT bytes 5241499325
> > STAT curr_items 4211
> > STAT total_items 12640
> > STAT slab_global_page_pool 0
> > STAT expired_unfetched 0
> > STAT evicted_unfetched 8429
> > STAT evicted_active 0
> > STAT evictions 8429
> > STAT reclaimed 0
> > STAT 

Re: "Out of memory during read" errors instead of key eviction

2022-08-25 Thread dormando
Hey,

Thanks for the info. Yes; this generally confirms the issue. I see some of
your higher slab classes with "free_chunks 0", so if you're setting data
that requires these chunks it could error out. The "stats items" confirms
this since there are no actual items in those lower slab classes.

You're certainly right a workaround of making your items < 512k would also
work; but in general if I have features it'd be nice if they worked well
:) Please open an issue so we can improve things!

I intended to lower the slab_chunk_max default from 512k to much lower, as
that actually raises the memory efficiency by a bit (less gap at the
higher classes). That may help here. The system should also try ejecting
items from the highest LRU... I need to double check that it wasn't
already intending to do that and failing.

Might also be able to adjust the page mover but not sure. The page mover
can probably be adjusted to attempt to keep one page in reserve, but I
think the algorithm isn't expecting slabs with no items in it so I'd have
to audit that too.

If you're up for experiments it'd be interesting to know if setting
"-o slab_chunk_max=32768" or 16k (probably not more than 64) makes things
better or worse.

Also, crud.. it's documented as kilobytes but that's not working somehow?
aaahahah. I guess the big EXPERIMENTAL tag scared people off since that
never got reported.

I'm guessing most people have a mix of small to large items, but you only
have large items and a relatively low memory limit, so this is why you're
seeing it so easily. I think most people setting large items have like
30G+ of memory so you end up with more spread around.

Thanks,
-Dormando

On Wed, 24 Aug 2022, Hayden wrote:

> What you're saying makes sense, and I'm pretty sure it won't be too hard to 
> add some functionality to my writing code to break my large items up into
> smaller parts that can each fit into a single chunk. That has the added 
> benefit that I won't have to bother increasing the max item size.
> In the meantime, though, I reran my pipeline and captured the output of 
> stats, stats slabs, and stats items both when evicting normally and when 
> getting
> spammed with the error.
>
> First, the output when I'm in the error state:
>  Output of stats
> STAT pid 1
> STAT uptime 11727
> STAT time 1661406229
> STAT version b'1.6.14'
> STAT libevent b'2.1.8-stable'
> STAT pointer_size 64
> STAT rusage_user 2.93837
> STAT rusage_system 6.339015
> STAT max_connections 1024
> STAT curr_connections 2
> STAT total_connections 8230
> STAT rejected_connections 0
> STAT connection_structures 6
> STAT response_obj_oom 0
> STAT response_obj_count 1
> STAT response_obj_bytes 65536
> STAT read_buf_count 8
> STAT read_buf_bytes 131072
> STAT read_buf_bytes_free 49152
> STAT read_buf_oom 0
> STAT reserved_fds 20
> STAT cmd_get 0
> STAT cmd_set 12640
> STAT cmd_flush 0
> STAT cmd_touch 0
> STAT cmd_meta 0
> STAT get_hits 0
> STAT get_misses 0
> STAT get_expired 0
> STAT get_flushed 0
> STAT delete_misses 0
> STAT delete_hits 0
> STAT incr_misses 0
> STAT incr_hits 0
> STAT decr_misses 0
> STAT decr_hits 0
> STAT cas_misses 0
> STAT cas_hits 0
> STAT cas_badval 0
> STAT touch_hits 0
> STAT touch_misses 0
> STAT store_too_large 0
> STAT store_no_memory 0
> STAT auth_cmds 0
> STAT auth_errors 0
> STAT bytes_read 21755739959
> STAT bytes_written 330909
> STAT limit_maxbytes 5368709120
> STAT accepting_conns 1
> STAT listen_disabled_num 0
> STAT time_in_listen_disabled_us 0
> STAT threads 4
> STAT conn_yields 0
> STAT hash_power_level 16
> STAT hash_bytes 524288
> STAT hash_is_expanding False
> STAT slab_reassign_rescues 0
> STAT slab_reassign_chunk_rescues 0
> STAT slab_reassign_evictions_nomem 0
> STAT slab_reassign_inline_reclaim 0
> STAT slab_reassign_busy_items 0
> STAT slab_reassign_busy_deletes 0
> STAT slab_reassign_running False
> STAT slabs_moved 0
> STAT lru_crawler_running 0
> STAT lru_crawler_starts 20
> STAT lru_maintainer_juggles 71777
> STAT malloc_fails 0
> STAT log_worker_dropped 0
> STAT log_worker_written 0
> STAT log_watcher_skipped 0
> STAT log_watcher_sent 0
> STAT log_watchers 0
> STAT unexpected_napi_ids 0
> STAT round_robin_fallback 0
> STAT bytes 5241499325
> STAT curr_items 4211
> STAT total_items 12640
> STAT slab_global_page_pool 0
> STAT expired_unfetched 0
> STAT evicted_unfetched 8429
> STAT evicted_active 0
> STAT evictions 8429
> STAT reclaimed 0
> STAT crawler_reclaimed 0
> STAT crawler_items_checked 4212
> STAT lrutail_reflocked 0
> STAT moves_to_cold 11872
> STAT moves_to_warm 0
> STAT moves_within_lru 0
> STAT direct_reclaims 9
> STAT lru_bumps_dropped 0
> END
>  Output of stats slabs
> STAT 2:chunk_size 120
> STAT 2:chunks_per_page 8738
> STAT 2:total_pages 1
> STAT 2:total_chunks 8738
> STAT 2:used_chunks 4211
> STAT 2:free_chunks 4527
> STAT 2:free_chunks_end 0
> STAT 2:get_hits 0
> STAT 2:cmd_set 0
> STAT 2:delete_hits 0
> STAT 2:incr_hits 0
> STAT 2:decr_hits 0
> STAT 2:cas_hits 0
> STAT 

Re: "Out of memory during read" errors instead of key eviction

2022-08-25 Thread Hayden
What you're saying makes sense, and I'm pretty sure it won't be too hard to 
add some functionality to my writing code to break my large items up into 
smaller parts that can each fit into a single chunk. That has the added 
benefit that I won't have to bother increasing the max item size.

In the meantime, though, I reran my pipeline and captured the output of 
stats, stats slabs, and stats items both when evicting normally and when 
getting spammed with the error.

First, the output when I'm in the error state:
 Output of stats 
STAT pid 1
STAT uptime 11727
STAT time 1661406229
STAT version b'1.6.14'
STAT libevent b'2.1.8-stable'
STAT pointer_size 64
STAT rusage_user 2.93837
STAT rusage_system 6.339015
STAT max_connections 1024
STAT curr_connections 2
STAT total_connections 8230
STAT rejected_connections 0
STAT connection_structures 6
STAT response_obj_oom 0
STAT response_obj_count 1
STAT response_obj_bytes 65536
STAT read_buf_count 8
STAT read_buf_bytes 131072
STAT read_buf_bytes_free 49152
STAT read_buf_oom 0
STAT reserved_fds 20
STAT cmd_get 0
STAT cmd_set 12640
STAT cmd_flush 0
STAT cmd_touch 0
STAT cmd_meta 0
STAT get_hits 0
STAT get_misses 0
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT store_too_large 0
STAT store_no_memory 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 21755739959
STAT bytes_written 330909
STAT limit_maxbytes 5368709120
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 16
STAT hash_bytes 524288
STAT hash_is_expanding False
STAT slab_reassign_rescues 0
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 0
STAT slab_reassign_busy_items 0
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running False
STAT slabs_moved 0
STAT lru_crawler_running 0
STAT lru_crawler_starts 20
STAT lru_maintainer_juggles 71777
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT log_watchers 0
STAT unexpected_napi_ids 0
STAT round_robin_fallback 0
STAT bytes 5241499325
STAT curr_items 4211
STAT total_items 12640
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 8429
STAT evicted_active 0
STAT evictions 8429
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 4212
STAT lrutail_reflocked 0
STAT moves_to_cold 11872
STAT moves_to_warm 0
STAT moves_within_lru 0
STAT direct_reclaims 9
STAT lru_bumps_dropped 0
END
 Output of stats slabs
STAT 2:chunk_size 120
STAT 2:chunks_per_page 8738
STAT 2:total_pages 1
STAT 2:total_chunks 8738
STAT 2:used_chunks 4211
STAT 2:free_chunks 4527
STAT 2:free_chunks_end 0
STAT 2:get_hits 0
STAT 2:cmd_set 0
STAT 2:delete_hits 0
STAT 2:incr_hits 0
STAT 2:decr_hits 0
STAT 2:cas_hits 0
STAT 2:cas_badval 0
STAT 2:touch_hits 0
STAT 30:chunk_size 66232
STAT 30:chunks_per_page 15
STAT 30:total_pages 1
STAT 30:total_chunks 15
STAT 30:used_chunks 3
STAT 30:free_chunks 12
STAT 30:free_chunks_end 0
STAT 30:get_hits 0
STAT 30:cmd_set 0
STAT 30:delete_hits 0
STAT 30:incr_hits 0
STAT 30:decr_hits 0
STAT 30:cas_hits 0
STAT 30:cas_badval 0
STAT 30:touch_hits 0
STAT 31:chunk_size 82792
STAT 31:chunks_per_page 12
STAT 31:total_pages 1
STAT 31:total_chunks 12
STAT 31:used_chunks 6
STAT 31:free_chunks 6
STAT 31:free_chunks_end 0
STAT 31:get_hits 0
STAT 31:cmd_set 0
STAT 31:delete_hits 0
STAT 31:incr_hits 0
STAT 31:decr_hits 0
STAT 31:cas_hits 0
STAT 31:cas_badval 0
STAT 31:touch_hits 0
STAT 32:chunk_size 103496
STAT 32:chunks_per_page 10
STAT 32:total_pages 19
STAT 32:total_chunks 190
STAT 32:used_chunks 183
STAT 32:free_chunks 7
STAT 32:free_chunks_end 0
STAT 32:get_hits 0
STAT 32:cmd_set 0
STAT 32:delete_hits 0
STAT 32:incr_hits 0
STAT 32:decr_hits 0
STAT 32:cas_hits 0
STAT 32:cas_badval 0
STAT 32:touch_hits 0
STAT 33:chunk_size 129376
STAT 33:chunks_per_page 8
STAT 33:total_pages 50
STAT 33:total_chunks 400
STAT 33:used_chunks 393
STAT 33:free_chunks 7
STAT 33:free_chunks_end 0
STAT 33:get_hits 0
STAT 33:cmd_set 0
STAT 33:delete_hits 0
STAT 33:incr_hits 0
STAT 33:decr_hits 0
STAT 33:cas_hits 0
STAT 33:cas_badval 0
STAT 33:touch_hits 0
STAT 34:chunk_size 161720
STAT 34:chunks_per_page 6
STAT 34:total_pages 41
STAT 34:total_chunks 246
STAT 34:used_chunks 245
STAT 34:free_chunks 1
STAT 34:free_chunks_end 0
STAT 34:get_hits 0
STAT 34:cmd_set 0
STAT 34:delete_hits 0
STAT 34:incr_hits 0
STAT 34:decr_hits 0
STAT 34:cas_hits 0
STAT 34:cas_badval 0
STAT 34:touch_hits 0
STAT 35:chunk_size 202152
STAT 35:chunks_per_page 5
STAT 35:total_pages 231
STAT 35:total_chunks 1155
STAT 35:used_chunks 1155
STAT 35:free_chunks 0
STAT 35:free_chunks_end 0
STAT 35:get_hits 0
STAT 35:cmd_set 0
STAT 35:delete_hits 0
STAT 35:incr_hits 0
STAT