Re: [DISCUSS] Would like to make collective intelligence about Metrics on Storm

Jungtaek Lim Mon, 09 May 2016 19:32:21 -0700

I guess both application performance and ZK are all possible to have
performance issues since stats are recorded from critical path, and it
incurs heartbeat message getting bigger which means more ZK write load.


I thought about asynchronous metrics recording, but it should enqueue
record task to queue, and background thread should be dequeue and
calculate, so I'm wondering it would be still faster than letting each task
recording metrics by itself. If anyone try experiment on it that would be
really nice.

2016년 5월 10일 (화) 오전 12:05, Abhishek Agarwal <[email protected]>님이 작성:

> Adam,
> Performance issue raised by Lim in #2 is not about the application
> performance, but the zookeeper where these metrics are being written to.
> Zookeeper doesn't handle heavy frequent writes well. This problem will
> become more apparent in larger clusters.
>
> On Mon, May 9, 2016 at 7:22 PM, Adam Meyerowitz (BLOOMBERG/ 731 LEX) <
> [email protected]> wrote:
>
>> Jungtaek, thanks for the followup response.
>>
>> For #1, having this in the Storm UI would be very nice and I think of
>> general interest to anyone who is tasked with maintaining Storm deployments
>> and certainly during development for capacity and stress testing. I'm not
>> sure what it takes to get it into the UI, but sounds like a good change.
>>
>> For #2, having metrics reporting impact the realtime system is not the
>> best. Again, I'm not sure how this is all implemented or the challenges
>> involved so it's easy for me to say that, but it seems periodic reporting
>> of aggregated stats done by each task itself in a separate thread would be
>> sufficient and hopefully would not impact performance. That aggregation
>> could include the things we are interested in such as min/max/average,
>> percentile all that good stuff.
>>
>> From: [email protected] At: May 8 2016 23:01:12
>> To: Adam Meyerowitz (BLOOMBERG/ 731 LEX) <[email protected]>,
>> [email protected]
>>
>> Subject: Re: [DISCUSS] Would like to make collective intelligence about
>> Metrics on Storm
>>
>> Hi Adam,
>>
>> Thanks for the great input! Let me share my thought about two things.
>>
>> 1. There's metrics for disruptor queue so if you attach metrics consumer
>> it will be provided to consumer. Sojourn time for queue is also provided
>> (kudos to Li Wang) but it's based on queueing theory and has one
>> precondition so sometimes its value seems not stable (especially
>> problematic tasks).
>>
>> 2. Agreed. There could be latency SLAs for specific topology, then we
>> would really want to see outliers and percentiles, too. Since providing
>> them may affect performance we should address them with care. I believe
>> eventually we will provide various information for latency. Stay tuned.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2016년 5월 6일 (금) 오후 10:29, Adam Meyerowitz (BLOOMBERG/ 731 LEX) <
>> [email protected]>님이 작성:
>>
>>> I recall seeing in another thread a discussion about monitoring metrics
>>> for various queues within a worker. For us this would be pretty key for
>>> each executor input and output LMAX queue as well as the worker level input
>>> and output queues. In our topologies we run one task per executor so it
>>> would help us get a much better understanding of the performance of our
>>> components. If acking is turned off, which it is for our topologies, it's
>>> hard to get the full picture of the performance of the various components
>>> we have. The execute and process latency only tells part of a larger story.
>>> For the queues, generally we would like to see queue utilization and how
>>> long tuples stayed on the queue.
>>>
>>> Also generally we would like more than average. For example,
>>> min/max/average/standard deviation.. percentiles, whatever. Average
>>> definitely smooths the bumps and it's good but we'd gain more insight in
>>> understanding outliers and the larger performance picture.
>>>
>>>
>>>
>>> From: [email protected] At: Apr 20 2016 00:30:05
>>> To: [email protected]
>>> Subject: Re: [DISCUSS] Would like to make collective intelligence about
>>> Metrics on Storm
>>>
>>> Let me start sharing my thought. :)
>>>
>>> 1. Need to enrich docs about metrics / stats.
>>>
>>> In fact, I couldn't see the fact - topology stats are sampled by default
>>> and sample rate is 0.05 - from the docs when I was newbie of Apache
>>> Storm. It made me misleading and made me saying "Why there're
>>> difference between the counts?". I also saw some mails from user@ about
>>> same question. If we include this to guide doc that would be better.
>>>
>>> And Metrics document page
>>> <http://storm.apache.org/releases/1.0.0/Metrics.html> seems not well
>>> written. I think it has appropriate headings but lacks contents on each
>>> heading.
>>> It should be addressed, and introducing some external metrics consumer
>>> plugins (like storm-graphite
>>> <https://github.com/verisign/storm-graphite> from Verisign) would be
>>> great, too.
>>>
>>> 2. Need to increase sample rate or (ideally) no sampling at all.
>>>
>>> Let's postpone considering performance hit at this time.
>>> Ideally, we expect precision of metrics gets better when we increase
>>> sample rate. It affects non-gauge kinds of metrics which are counter,
>>> and latency, and so on.
>>>
>>> Btw, I would like to hear about opinions on latency since I'm not an
>>> expert.
>>> Storm provides only average latency and it's indeed based on sample
>>> rate. Do we feel OK with this? If not how much having also percentiles can
>>> help us?
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 2016년 4월 20일 (수) 오전 10:55, Jungtaek Lim <[email protected]>님이 작성:
>>>
>>>> Hi Storm users,
>>>>
>>>> I'm Jungtaek Lim, committer and PMC member of Apache Storm.
>>>>
>>>> If you subscribed dev@ mailing list, you may have seen that recently
>>>> we're addressing the metrics feature on Apache Storm.
>>>>
>>>> For now, improvements are going forward based on current metrics
>>>> feature.
>>>>
>>>> - Improve (Topology) MetricsConsumer
>>>> <https://issues.apache.org/jira/browse/STORM-1699>
>>>> - Provide topology metrics in detail (metrics per each stream)
>>>> <https://issues.apache.org/jira/browse/STORM-1719>
>>>> - (WIP) Introduce Cluster Metrics Consumer
>>>>
>>>> As I don't maintain large cluster for myself, I really want to collect
>>>> the any ideas for improving, any inconveniences, use cases of Metrics with
>>>> community members, so we're on the right way to go forward.
>>>>
>>>> Let's talk!
>>>>
>>>> Thanks in advance,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>
>>
>
>
> --
> Regards,
> Abhishek Agarwal
>
>

Re: [DISCUSS] Would like to make collective intelligence about Metrics on Storm

Reply via email to