> I can reproduce this with a huge load using dsbulk, but still can't
determine the cause of the problem.

Can you get a thread dump (jstack <pid>) when the system freezes? This
might be helpful to determine the cause of the freeze.

Also, can you reproduce this in a simpler environment (ccm + dsbulk)?

Em sex., 25 de fev. de 2022 às 07:03, Bowen Song <bo...@bso.ng> escreveu:

> Okay, that ruled it out. Anything interesting in the GC logs? Was
> Cassandra stuck at a GC safepoint? You may need to enable the detailed
> GC logs to see these.
>
> On 25/02/2022 10:02, Azamat Hackimov wrote:
> > Hello!
> >
> > No, I have a directly attached NVMe disk, and there are no IO or network
> issues.
> >
> > пт, 25 февр. 2022 г. в 12:50, Bowen Song <bo...@bso.ng>:
> >> Do you have any network based mountpoint, such as NFS or samba? I have
> >> seen similar behaviour on other Java based applications at GC safepoint
> >> when the network based filesystem loses their connection and reconnects.
> >>
> >> On 25/02/2022 06:09, Azamat Hackimov wrote:
> >>> Hello!
> >>>
> >>> I recently migrated Cassandra from 3.11.x to 4.0 and got strange
> >>> freezes on heavy load. It looks like some nodes in DC stopped
> >>> responding and got DN status.
> >>> I cannot check status directly on node via nodetool status or even
> >>> restart Cassandra with `systemctl restart cassandra` command. Only
> >>> viable method is to `kill -9` hanging process and restart Cassandra
> >>> again. On 3.11.x there are no such problems.
> >>>
> >>> I have 2 DC with 8 nodes each deployed on good hardware servers, on
> >>> CentOS 7 and Java 11 environments with slightly changed default
> >>> settings inherited from 3.11.x installation.
> >>>
> >>> The problem shows randomly, I can't determine its source, in the
> >>> system.log and debug.log the last event that I could trace does not
> >>> have anything to do with the hang. The service just stops responding
> >>> and freezes. I can reproduce this with a huge load using dsbulk, but
> >>> still can't determine the cause of the problem.
> >>>
> >>> Has anyone encountered a similar problem, and is there any way other
> >>> than rolling back to the previous version?
> >>>
> >>> Here my config:
> >>>
> >>> cluster_name: 'mycluster'
> >>> num_tokens: '256'
> >>> allocate_tokens_for_local_replication_factor: 3
> >>> hinted_handoff_enabled: true
> >>> max_hint_window_in_ms: 10800000
> >>> hinted_handoff_throttle_in_kb: 1024
> >>> max_hints_delivery_threads: 2
> >>> hints_directory: /data/cassandra/hints
> >>> hints_flush_period_in_ms: 10000
> >>> max_hints_file_size_in_mb: 128
> >>> batchlog_replay_throttle_in_kb: 1024
> >>> authenticator: PasswordAuthenticator
> >>> authorizer: AllowAllAuthorizer
> >>> role_manager: CassandraRoleManager
> >>> network_authorizer: AllowAllNetworkAuthorizer
> >>> roles_validity_in_ms: 2000
> >>> permissions_validity_in_ms: 2000
> >>> credentials_validity_in_ms: 2000
> >>> partitioner: org.apache.cassandra.dht.Murmur3Partitioner
> >>> data_file_directories:
> >>>       - /data/cassandra/data
> >>> commitlog_directory: /data/cassandra/commitlog
> >>> cdc_enabled: false
> >>> disk_failure_policy: stop
> >>> commit_failure_policy: stop
> >>> prepared_statements_cache_size_mb:
> >>> key_cache_size_in_mb:
> >>> key_cache_save_period: 14400
> >>> row_cache_size_in_mb: 0
> >>> row_cache_save_period: 0
> >>> counter_cache_size_in_mb:
> >>> counter_cache_save_period: 7200
> >>> saved_caches_directory: /data/cassandra/saved_caches
> >>> commitlog_sync: periodic
> >>> commitlog_sync_period_in_ms: 10000
> >>> commitlog_segment_size_in_mb: 32
> >>> seed_provider:
> >>>       - class_name: org.apache.cassandra.locator.SimpleSeedProvider
> >>>         parameters:
> >>>             - seeds: 'node1-1,node1-4,node2-1,node2-4'
> >>> concurrent_reads: 32
> >>> concurrent_writes: 32
> >>> concurrent_counter_writes: 32
> >>> concurrent_materialized_view_writes: 32
> >>> file_cache_size_in_mb: '1024'
> >>> memtable_allocation_type: heap_buffers
> >>> index_summary_capacity_in_mb:
> >>> index_summary_resize_interval_in_minutes: 60
> >>> trickle_fsync: false
> >>> trickle_fsync_interval_in_kb: 10240
> >>> storage_port: 7000
> >>> ssl_storage_port: 7001
> >>> listen_address:
> >>> start_native_transport: true
> >>> native_transport_port: 9042
> >>> native_transport_allow_older_protocols: true
> >>> rpc_address:
> >>> rpc_keepalive: true
> >>> incremental_backups: false
> >>> snapshot_before_compaction: false
> >>> auto_snapshot: true
> >>> snapshot_links_per_second: 0
> >>> column_index_size_in_kb: 64
> >>> column_index_cache_size_in_kb: 2
> >>> concurrent_compactors: 5
> >>> concurrent_materialized_view_builders: 1
> >>> compaction_throughput_mb_per_sec: 200
> >>> sstable_preemptive_open_interval_in_mb: 50
> >>> read_request_timeout_in_ms: 5000
> >>> range_request_timeout_in_ms: 10000
> >>> write_request_timeout_in_ms: 2000
> >>> counter_write_request_timeout_in_ms: 5000
> >>> cas_contention_timeout_in_ms: 1000
> >>> truncate_request_timeout_in_ms: 60000
> >>> request_timeout_in_ms: 10000
> >>> slow_query_log_timeout_in_ms: 500
> >>> endpoint_snitch: GossipingPropertyFileSnitch
> >>> dynamic_snitch_update_interval_in_ms: 100
> >>> dynamic_snitch_reset_interval_in_ms: 600000
> >>> dynamic_snitch_badness_threshold: 1.0
> >>> server_encryption_options:
> >>>       internode_encryption: none
> >>>       enable_legacy_ssl_storage_port: false
> >>>       keystore: conf/.keystore
> >>>       keystore_password: cassandra
> >>>       require_client_auth: false
> >>>       truststore: conf/.truststore
> >>>       truststore_password: cassandra
> >>>       require_endpoint_verification: false
> >>> client_encryption_options:
> >>>       enabled: false
> >>>       keystore: conf/.keystore
> >>>       keystore_password: cassandra
> >>>       require_client_auth: false
> >>> internode_compression: dc
> >>> inter_dc_tcp_nodelay: false
> >>> tracetype_query_ttl: 86400
> >>> tracetype_repair_ttl: 604800
> >>> enable_user_defined_functions: false
> >>> enable_scripted_user_defined_functions: false
> >>> windows_timer_interval: 1
> >>> transparent_data_encryption_options:
> >>>       enabled: false
> >>>       chunk_length_kb: 64
> >>>       cipher: AES/CBC/PKCS5Padding
> >>>       key_alias: testing:1
> >>>       key_provider:
> >>>         - class_name: org.apache.cassandra.security.JKSKeyProvider
> >>>           parameters:
> >>>             - keystore: conf/.keystore
> >>>               keystore_password: cassandra
> >>>               store_type: JCEKS
> >>>               key_password: cassandra
> >>> tombstone_warn_threshold: 1000
> >>> tombstone_failure_threshold: 100000
> >>> replica_filtering_protection:
> >>>       cached_rows_warn_threshold: 2000
> >>>       cached_rows_fail_threshold: 32000
> >>> batch_size_warn_threshold_in_kb: 5
> >>> batch_size_fail_threshold_in_kb: 50
> >>> unlogged_batch_across_partitions_warn_threshold: 10
> >>> compaction_large_partition_warning_threshold_mb: 100
> >>>
> >>> audit_logging_options:
> >>>       enabled: true
> >>>       logger:
> >>>         - class_name: BinAuditLogger
> >>>       excluded_categories: DML,QUERY,PREPARE
> >>>       max_log_size: 1073741824
> >>>
> >>> diagnostic_events_enabled: false
> >>> repaired_data_tracking_for_range_reads_enabled: false
> >>> repaired_data_tracking_for_partition_reads_enabled: false
> >>> report_unconfirmed_repaired_data_mismatches: false
> >>>
> >>> enable_materialized_views: true
> >>> enable_sasi_indexes: false
> >>> enable_transient_replication: false
> >>> enable_drop_compact_storage: false
> >>>
> >
> >
>

Reply via email to