Hi, we are running a streaming job that processes about 500 events per 20s batches and uses updateStateByKey to accumulate Web sessions (with a 30 Minute live time).
The checkpoint intervall is set to 20xBatchInterval, that is 400s. Cluster size is 8 nodes. We are having trouble with the amount of files and directories created on the shared file system (GlusterFS) - there are about 100 new directories per second. Is that the expected magnitude of number of created directories? Or should we expect something different? What might we be doing wrong? Can anyone share a pointer to material that explains the details of checkpointing? The checkpoint directories have UUIDs as names - ist that correct? Jan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org