Hello,

We are experiencing stability issues in our Kafka architecture during
chunked file transfers, especially during load spikes or after broker
restarts. Here are the details:
------------------------------
📦 *Architecture*:

   -

   KafkaJS v2.2.4 used as both *producer and consumer*.
   -

   Kafka cluster with *3 brokers*, connected to a *shared NAS*.
   -

   Files are *split into 500 KB chunks* and sent to a Kafka topic.
   -

   A dedicated consumer reassembles files on the NAS.
   -

   Each file is assigned to the *topic/broker of the first chunk* to
   preserve order.

------------------------------
❌ *Observed issues*:

   1.

   *Request Produce(version: 7) timed out error during load spikes*:
   -

      For about *15 minutes*, KafkaJS producers fail with the error:

      *Request Produce(key: 0, version: 7) timed out*

      -

      This generally occurs *during traffic spikes*, typically when *multiple
      files are being sent simultaneously*.
      2.

   *Network behavior – TCP Window Full*:
   -

      During this period, *TCP Window Full messages* appear on the network,
      -

      No *CPU or RAM spikes are observed during the blockage*,
      -

      However, a significant *resource usage increase (CPU/RAM)* happens *when
      the system recovers*, suggesting a *sudden backlog clearance*.
      3.

   *Recovery after blockage*:
   -

      Connections reset,
      -

      A rollback seems to occur, then messages are processed quickly
      without errors,
      -

      The issue may reoccur at the next volume spike.
      4.

   *Amplified behavior and prolonged instability after rolling restart*:
   -

      This problem is more frequent after a *rolling restart of brokers*
      (spaced 20 to 30 minutes apart),
      -

      Instability can persist for *several days or even weeks* before
      decreasing,
      -

      This suggests a *desynchronization or prolonged delay in partition
      reassignment, metadata updates, or coordination*.

------------------------------
⚙️ *KafkaJS configuration*:

*{*
*  requestTimeout: 30000, // 30s
  retry: {
    initialRetryTime: 1000, // 1s
    retries: 1500 // max value, but in practice we rarely exceed 30
retries (~15 minutes)
  }
}*

------------------------------
❓ *Questions*:

   1.

   Is the Request Produce(version: 7) timed out error generally
related to *broker
   congestion*, *network issues*, or *partition imbalance*?
   2.

   Do the TCP Window Full messages indicate *network or broker buffer
   saturation*? Are there Kafka logs or metrics recommended to monitor?
   3.

   Could assigning each file strictly to a single broker/topic lead to *local
   saturation*?
   4.

   Could the KafkaJS retry configuration (1500 max retries, but rarely
   exceeding 30 retries) *exacerbate congestion*? Would a *progressive
   backoff strategy* be preferable?
   5.

   What are the *best KafkaJS practices* for chunked file flows:
   compression, batching, flush intervals, etc.?
   6.

   Can a *rolling restart* of brokers cause *temporary metadata
   desynchronization* or *excessive client wait times*? And why might this
   instability last so long?

Thank you in advance for your help. Any diagnostic insights or optimization
recommendations would be greatly appreciated.

Best regards,
Mathias VANHEMS

Reply via email to