Hi Guillem, Did you eventually fix the problem?
On 2017-08-01 11:00, Guillem Mateos wrote: > On the elasticsearch.properties file, right now, I have the following > regarding workers and executors: > > ##### Storm ##### > indexing.workers=1 > indexing.executors=1 > topology.worker.childopts= > topology.auto-credentials=[] > > Regarding the flux file for indexing, it is set exactly to what's on github > for metron 0.4.0 > > Should I also double checking the enrichment topology? Could this be caused > somehow by it? > > Thanks > > 2017-08-01 19:33 GMT+02:00 Ryan Merriman <[email protected]>: > > Yes you are correct they are separate concepts. Once a tuple's tree has been > acked in Storm, meaning all the spouts/bolts that are required to ack a tuple > have done so, it is then commited to Kafka in the form of an offset. If a > tuple is not completely acked, it will never be commited to Kafka and the > tuple will be replayed after a timeout. Eventually you'll have too many > tuples in flight and will result > > I think the next step would be to review your configuration. The way the > executor properties are named in Storm can be confusing so it's probably best > if you share your flux/property files. > > On Tue, Aug 1, 2017 at 12:01 PM, Guillem Mateos <[email protected]> wrote: > > Hi Ryan, > > Thanks for your quick reply. I've been trying to change a few settings today. > From having the executors to 1 to have it at a different number. Also worth > mentioning is that the system i'm testing this with does not have a very high > message input rate right now, so I wouldn't expect to need to do any special > tunning. I'm roughly at about 100 messages per minute, which is really not > much. > > After trying with the executors on a different value I can confirm the issue > still exists. I do see also quite a number of messages like this one: > > Discarding stale fetch response for partition indexing-0 since its offset > 2565827 does not match the expected offset 2565828 > > Regarding ackers, I was under the impression that it was something slightly > different than committing. So you do ack a message and you commit it also, > but it's not exactly the same. Am I right? > > Thanks > > 2017-07-31 19:40 GMT+02:00 Ryan Merriman <[email protected]>: > > Guillem, > > I think this ended up being caused by not having enough acker threads to keep > up. This is controlled by the "topology.ackers.executors" Storm property > that you will find in the indexing topology flux remote.yaml file. It is > exposed in Ambari in the "elasticsearch-properties" property which is itself > a list of properties. Within that there is an "indexing.executors" property. > If that is set to 0 it would definitely be a problem and I think that may > even be the default in 0.4.0. Try changing that to match the number of > partitions dedicated to the indexing topic. > > You could also change the property directly in the flux file > ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from the > command line to verify this fixes it. If you do use this strategy to test, > make sure you eventually make the change in Ambari so your changes don't get > overriden on a restart. Changing this setting is confusing and there have > been some recent commits that have addressed that, exposing > "topology.ackers.executors" directly in Ambari in a dedicated indexing > topology section. > > You might want to also check out the performance tuning guide we did > recently: > https://github.com/apache/metron/blob/master/metron-platform/Performance-tuning-guide.md > [1]. If my guess is wrong and it's not the acker thread setting, the answer > is likely in there. > > Hope this helps. If you're still stuck send us some more info and we'll try > to help you figure it out. > > Ryan > > On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <[email protected]> wrote: > > Hi, > > I'm facing an issue like the one Christian Tramnitz and Ryan Merriman > discussed in May. > > I have a Metron deployment using 0.4.0 on 10 nodes. The indexing topology > stops indexing messages when hitting the 10.000 (10k) message mark. This is > related, as previously found by Christian, to the Kafka strategy, and after > further debugging, I could track it down to the number of uncommitted offsets > (maxUncommittedOffsets). This is specified in the Kafka spout and I could > confirm that by providing a higher or lower value (5k or 15k) the point at > which the indexing stops, is exactly that of maxUncommitedOffsets. > > I understand the workaround suggested (changing the strategy from > UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I > would guess the topology shouldn't really need a change on that parameter to > properly ingest data without failing. What seems to happen is that by > changing to LATEST the messages do successfully get committed to Kafka while > on the other, UNCOMMITTED_EARLIEST, at some point that might not happen. > > When I run the topology with 'LATEST' I usually see messages like this one on > the Kafka Spout (indexing topology): > > o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka > [{indexing-0=OffsetAndMetadata{offset=2307113, > metadata='{topic-partition=indexing-0 > > I do not see such messages on the Kafka Spout when I have the issue and i'm > running UNCOMMITTED_EARLIEST. > > Any suggestion on what may be the real source of the issue here? I did some > tests before and it did not seem to be an issue on 0.3.0. Could this be > something related to the new Kafka metron code? Or maybe related to one of > the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment > messages (METRON-569) and a few on Kafka regarding issues with the commited > offset (but most were for newer versions of Kafka than Metron is using). > > Thanks Links: ------ [1] https://github.com/apache/metron/blob/master/metron-platform/Performance-tuning-guide.md
