Hi Ryan,

Thanks for your quick reply. I've been trying to change a few settings
today. From having the executors to 1 to have it at a different number.
Also worth mentioning is that the system i'm testing this with does not
have a very high message input rate right now, so I wouldn't expect to need
to do any special tunning. I'm roughly at about 100 messages per minute,
which is really not much.

After trying with the executors on a different value I can confirm the
issue still exists. I do see also quite a number of messages like this one:

Discarding stale fetch response for partition indexing-0 since its offset
2565827 does not match the expected offset 2565828

Regarding ackers, I was under the impression that it was something slightly
different than committing. So you do ack a message and you commit it also,
but it's not exactly the same. Am I right?

Thanks

2017-07-31 19:40 GMT+02:00 Ryan Merriman <[email protected]>:

> Guillem,
>
> I think this ended up being caused by not having enough acker threads to
> keep up.  This is controlled by the "topology.ackers.executors" Storm
> property that you will find in the indexing topology flux remote.yaml
> file.  It is exposed in Ambari in the "elasticsearch-properties" property
> which is itself a list of properties.  Within that there is an
> "indexing.executors" property.  If that is set to 0 it would definitely be
> a problem and I think that may even be the default in 0.4.0.  Try changing
> that to match the number of partitions dedicated to the indexing topic.
>
> You could also change the property directly in the flux file
> ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from
> the command line to verify this fixes it.  If you do use this strategy to
> test, make sure you eventually make the change in Ambari so your changes
> don't get overriden on a restart.  Changing this setting is confusing and
> there have been some recent commits that have addressed that, exposing
> "topology.ackers.executors" directly in Ambari in a dedicated indexing
> topology section.
>
> You might want to also check out the performance tuning guide we did
> recently:  https://github.com/apache/metron/blob/master/metron-
> platform/Performance-tuning-guide.md.  If my guess is wrong and it's not
> the acker thread setting, the answer is likely in there.
>
> Hope this helps.  If you're still stuck send us some more info and we'll
> try to help you figure it out.
>
> Ryan
>
> On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <[email protected]>
> wrote:
>
>> Hi,
>>
>> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman
>> discussed in May.
>>
>> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing topology
>> stops indexing messages when hitting the 10.000 (10k) message mark. This is
>> related, as previously found by Christian, to the Kafka strategy, and after
>> further debugging, I could track it down to the number of uncommitted
>> offsets (maxUncommittedOffsets). This is specified in the Kafka spout and I
>> could confirm that by providing a higher or lower value (5k or 15k) the
>> point at which the indexing stops, is exactly that of maxUncommitedOffsets.
>>
>> I understand the workaround suggested (changing the strategy from
>> UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I
>> would guess the topology shouldn't really need a change on that parameter
>> to properly ingest data without failing. What seems to happen is that by
>> changing to LATEST the messages do successfully get committed to Kafka
>> while on the other, UNCOMMITTED_EARLIEST, at some point that might not
>> happen.
>>
>> When I run the topology with 'LATEST' I usually see messages like this
>> one on the Kafka Spout (indexing topology):
>>
>> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka
>> [{indexing-0=OffsetAndMetadata{offset=2307113,
>> metadata='{topic-partition=indexing-0
>>
>> I do not see such messages on the Kafka Spout when I have the issue and
>> i'm running UNCOMMITTED_EARLIEST.
>>
>> Any suggestion on what may be the real source of the issue here? I did
>> some tests before and it did not seem to be an issue on 0.3.0. Could this
>> be something related to the new Kafka metron code? Or maybe related to one
>> of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment
>> messages (METRON-569) and a few on Kafka regarding issues with the commited
>> offset (but most were for newer versions of Kafka than Metron is using).
>>
>> Thanks
>>
>
>

Reply via email to