I've started reproducing this issue with more statistics, but have not reached the worst performance point yet, but somethings are starting to become clearer:
The DataStreamer hashes the affinity key to partition, and then maps the partition to a node, and fills a single buffer at a time for the node. A DataStreamer thread on the node therefore get a buffer's worth of requests grouped by the time of the addData() call, with no per thread grouping by affinity key (as I had originally assumed). The test I was running was using a large amount of data where the average number of keys for each unique affinity key is 3, with some outliers up to 50K. One of the caches being updated in the optimistic transaction in the StreamReceiver contains an object whose key is the affinity key, and whose contents are the set of keys that have that affinity key. We expect some temporal locality for objects with the same affinity key. We had a number of worker threads on a client node, but only one data streamer, where we increased the buffer count. Once we understood how the data streamer actually worked, we made each worker have its own DataStreamer. This way, each worker could issue a flush, without affecting the other workers. That, in turn, allowed us to use smaller batches per worker, decreasing the odds of temporal locality. So it seems like we would get updates for the same affinity key on different data streamer threads, and they could conflict updating the common record. The more keys per affinity key the more likely a conflict, and the more data would need to be saved. A flush operation could stall multiple workers, and the flush operation might be dependent on requests that are conflicting. We chose to use OPTIMISTIC transactions because of their lack-of-deadlock characteristics, rather than because we thought there would be high contention. I do think this behavior suggests something sub-optimal about the OPTIMISTIC lock implementation, because I see a dramatic decrease in throughput, but not a dramatic increase in transaction restarts. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
