Re: Ownership of proxy.process.ssl.total_success_handshake_count

Shrihari Kalkar Thu, 07 Jul 2016 13:17:44 -0700

James,
Thank you for clarifications. Please find my responses inline.

On 7/7/16, 11:56 AM, "James Peach" <[email protected]> wrote:


>
>> On Jul 7, 2016, at 9:55 AM, Shrihari Kalkar
>><[email protected]> wrote:
>> 
>> Guys,
>> I understand that in code, if a stat is registered under
>>'proxy.process' it is owned by traffic_server.
>
>That is by convention. In the code RECT_PROCESS metrics are owned by
>traffic_server. Ownership of metrics is not a particularly strong concept
>though :-/
Cool!

>
>> Also, if a stat is registered under stats.config.xml or metrics.config,
>>the 'destination' is owned by traffic_manager.
>> 
>> Since proxy.process.ssl.total_success_handshake_count is listed in
>>stats.config.xml and metrics.config, do you think that this is a
>>conflict of ownership?
>
>Grovelling through the git history will probably clear this up for you.
>IIRC total_success_handshake_count was moved to the derived metrics for
>compatibility when total_success_handshake_count_in and
>total_success_handshake_count_out were introduced. Maybe it makes sense
>now to make total_success_handshake_count the sum of in and out?

Yes, I went through https://issues.apache.org/jira/browse/TS-3409 to
understand why this was added. I agree that it should be a derived metric
but I found it odd that it was the only derived metric that had a
destination set to 'proxy.process'.
>
>> I see that on my instance, traffic_manager keeps sending an 'RECG_SET'
>>event for this stat to traffic_server even when nothing is happening. As
>>a result, I am seeing an issue where traffic_server blocks for a very
>>long time for initializing.
>
>Hmm. I'm not sure that slow initialization would be related to this. One
>common cause of initialization lag is setting
>proxy.config.http.wait_for_cache. I think you need to debug this one
>further :)

I did some analysis there and here is what I found out. Hopefully I am not
too off on the wrong path.

As part of startup, traffic_server creates two threads (to begin with).
1. (main) Thread (1) blocks till its signaled by another thread
1. Thread 2 polls for messages from traffic_manager

traffic_server is waiting for a message from traffic_manager which
contains all the configuration required for it to go ahead with
initialization. Hence, I understand that it is critical that the main
Thread (1) wait till it gets the configuration.

Thread 2 which polls for message from traffic_manager works like this:
for(;;) {
  if (pmgmt->require_lm) {     <--- Always True (for us)
    pmgmt->pollLMConnection();  <--- |for (count = 0; count < 10000; count
++) 
                                <--- |  num = mgmt_read_timeout(...) <---- 
Blocking call. returns 0 if
nothing was received for 1 second
                                <--- |  if !num: break <--- Break out of the 
loop and return from
function 
                                <--- |  else: read(fd), add_to_event_queue, 
continue the loop <---
Back to fetching another message
  }
  pmgmt->processEventQueue();  <--  Process the messages received in
pollLMConnection()
  pmgmt->processSignalQueue();
  mgmt_sleep_sec(pmgmt->timeout);
}

I found two possible problems here:

1. When I looked into the above code, I observed that the pollLMConnection
might not return back for a very long time if it keeps getting messages.
As a result, we may not call processEventQueue() (till we receive 10k
messages) which processes the received messages. And unless we process the
messages, we cannot signal the main Thread (1) to continue, which is still
blocked. Hence I saw the issue where traffic_server won't complete
initialization for a very long time.

2. The second problem I wanted to figure out was why is traffic_server
receiving so many messages at boot-up? I added some debug logs and I found
that traffic_server kept getting 'RECG' events for the metric
'proxy.process.ssl.total_success_handshake_count'. Narrowing it down
further (in int RecSetRecord(...))
        From traffic_manager's perspective:
           if (i_am_not_owner_of(stat)) : send traffic_server a notify message.
Ideally, this code should not be triggered because, traffic_manager
controls updates to 'proxy.process.ssl.total_success_handshake_count'.
However, the ownership in the code is determined solely based on the
'string name'. Since the stat name contains 'process', it assumes that it
is owned by traffic_server. This leads to an interesting scenario where
traffic_manger keeps updating its own stat and sends unnecessary events to
traffic_server. I am not still not sure how and when these updates happen.
But if the updates happen frequently, due to 'Problem 1' we can prevent
traffic_server from processing any messages for up to 10,000 seconds! (I
am imagining the case where the message is received just before the
timeout of 1 second happens)


I see this situation very often on a Virtual machine for some reason. Most
of the times, traffic_cop comes and kills traffic_server because it took
too long for it to initialize.

I verified that by removing this derived metric, the problem went away.
However, I have a feeling that we need to fix the 'problem 1' as well.



>
>> I wanted to know if my understanding of ownership is correct. And if it
>>is, can I open a new bug to track the definition of this stat?
>
>Yes, it sounds like it is worth revisiting SSL stats to make sure they
>still make sense. Last time I looked at these, there were a few problems.

Sure, let me open a bug for this. Even if it ends up being invalid, the
discussion can be helpful.
>
>J
>
>
Thanks,
Shrihari

Re: Ownership of proxy.process.ssl.total_success_handshake_count

Reply via email to