Hi Ryan, Thanks for your quick reply. I've been trying to change a few settings today. From having the executors to 1 to have it at a different number. Also worth mentioning is that the system i'm testing this with does not have a very high message input rate right now, so I wouldn't expect to need to do any special tunning. I'm roughly at about 100 messages per minute, which is really not much.
After trying with the executors on a different value I can confirm the issue still exists. I do see also quite a number of messages like this one: Discarding stale fetch response for partition indexing-0 since its offset 2565827 does not match the expected offset 2565828 Regarding ackers, I was under the impression that it was something slightly different than committing. So you do ack a message and you commit it also, but it's not exactly the same. Am I right? Thanks 2017-07-31 19:40 GMT+02:00 Ryan Merriman <[email protected]>: > Guillem, > > I think this ended up being caused by not having enough acker threads to > keep up. This is controlled by the "topology.ackers.executors" Storm > property that you will find in the indexing topology flux remote.yaml > file. It is exposed in Ambari in the "elasticsearch-properties" property > which is itself a list of properties. Within that there is an > "indexing.executors" property. If that is set to 0 it would definitely be > a problem and I think that may even be the default in 0.4.0. Try changing > that to match the number of partitions dedicated to the indexing topic. > > You could also change the property directly in the flux file > ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from > the command line to verify this fixes it. If you do use this strategy to > test, make sure you eventually make the change in Ambari so your changes > don't get overriden on a restart. Changing this setting is confusing and > there have been some recent commits that have addressed that, exposing > "topology.ackers.executors" directly in Ambari in a dedicated indexing > topology section. > > You might want to also check out the performance tuning guide we did > recently: https://github.com/apache/metron/blob/master/metron- > platform/Performance-tuning-guide.md. If my guess is wrong and it's not > the acker thread setting, the answer is likely in there. > > Hope this helps. If you're still stuck send us some more info and we'll > try to help you figure it out. > > Ryan > > On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <[email protected]> > wrote: > >> Hi, >> >> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman >> discussed in May. >> >> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing topology >> stops indexing messages when hitting the 10.000 (10k) message mark. This is >> related, as previously found by Christian, to the Kafka strategy, and after >> further debugging, I could track it down to the number of uncommitted >> offsets (maxUncommittedOffsets). This is specified in the Kafka spout and I >> could confirm that by providing a higher or lower value (5k or 15k) the >> point at which the indexing stops, is exactly that of maxUncommitedOffsets. >> >> I understand the workaround suggested (changing the strategy from >> UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I >> would guess the topology shouldn't really need a change on that parameter >> to properly ingest data without failing. What seems to happen is that by >> changing to LATEST the messages do successfully get committed to Kafka >> while on the other, UNCOMMITTED_EARLIEST, at some point that might not >> happen. >> >> When I run the topology with 'LATEST' I usually see messages like this >> one on the Kafka Spout (indexing topology): >> >> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka >> [{indexing-0=OffsetAndMetadata{offset=2307113, >> metadata='{topic-partition=indexing-0 >> >> I do not see such messages on the Kafka Spout when I have the issue and >> i'm running UNCOMMITTED_EARLIEST. >> >> Any suggestion on what may be the real source of the issue here? I did >> some tests before and it did not seem to be an issue on 0.3.0. Could this >> be something related to the new Kafka metron code? Or maybe related to one >> of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment >> messages (METRON-569) and a few on Kafka regarding issues with the commited >> offset (but most were for newer versions of Kafka than Metron is using). >> >> Thanks >> > >
