James, Thank you for clarifications. Please find my responses inline. On 7/7/16, 11:56 AM, "James Peach" <[email protected]> wrote:
> >> On Jul 7, 2016, at 9:55 AM, Shrihari Kalkar >><[email protected]> wrote: >> >> Guys, >> I understand that in code, if a stat is registered under >>'proxy.process' it is owned by traffic_server. > >That is by convention. In the code RECT_PROCESS metrics are owned by >traffic_server. Ownership of metrics is not a particularly strong concept >though :-/ Cool! > >> Also, if a stat is registered under stats.config.xml or metrics.config, >>the 'destination' is owned by traffic_manager. >> >> Since proxy.process.ssl.total_success_handshake_count is listed in >>stats.config.xml and metrics.config, do you think that this is a >>conflict of ownership? > >Grovelling through the git history will probably clear this up for you. >IIRC total_success_handshake_count was moved to the derived metrics for >compatibility when total_success_handshake_count_in and >total_success_handshake_count_out were introduced. Maybe it makes sense >now to make total_success_handshake_count the sum of in and out? Yes, I went through https://issues.apache.org/jira/browse/TS-3409 to understand why this was added. I agree that it should be a derived metric but I found it odd that it was the only derived metric that had a destination set to 'proxy.process'. > >> I see that on my instance, traffic_manager keeps sending an 'RECG_SET' >>event for this stat to traffic_server even when nothing is happening. As >>a result, I am seeing an issue where traffic_server blocks for a very >>long time for initializing. > >Hmm. I'm not sure that slow initialization would be related to this. One >common cause of initialization lag is setting >proxy.config.http.wait_for_cache. I think you need to debug this one >further :) I did some analysis there and here is what I found out. Hopefully I am not too off on the wrong path. As part of startup, traffic_server creates two threads (to begin with). 1. (main) Thread (1) blocks till its signaled by another thread 1. Thread 2 polls for messages from traffic_manager traffic_server is waiting for a message from traffic_manager which contains all the configuration required for it to go ahead with initialization. Hence, I understand that it is critical that the main Thread (1) wait till it gets the configuration. Thread 2 which polls for message from traffic_manager works like this: for(;;) { if (pmgmt->require_lm) { <--- Always True (for us) pmgmt->pollLMConnection(); <--- |for (count = 0; count < 10000; count ++) <--- | num = mgmt_read_timeout(...) <---- Blocking call. returns 0 if nothing was received for 1 second <--- | if !num: break <--- Break out of the loop and return from function <--- | else: read(fd), add_to_event_queue, continue the loop <--- Back to fetching another message } pmgmt->processEventQueue(); <-- Process the messages received in pollLMConnection() pmgmt->processSignalQueue(); mgmt_sleep_sec(pmgmt->timeout); } I found two possible problems here: 1. When I looked into the above code, I observed that the pollLMConnection might not return back for a very long time if it keeps getting messages. As a result, we may not call processEventQueue() (till we receive 10k messages) which processes the received messages. And unless we process the messages, we cannot signal the main Thread (1) to continue, which is still blocked. Hence I saw the issue where traffic_server won't complete initialization for a very long time. 2. The second problem I wanted to figure out was why is traffic_server receiving so many messages at boot-up? I added some debug logs and I found that traffic_server kept getting 'RECG' events for the metric 'proxy.process.ssl.total_success_handshake_count'. Narrowing it down further (in int RecSetRecord(...)) From traffic_manager's perspective: if (i_am_not_owner_of(stat)) : send traffic_server a notify message. Ideally, this code should not be triggered because, traffic_manager controls updates to 'proxy.process.ssl.total_success_handshake_count'. However, the ownership in the code is determined solely based on the 'string name'. Since the stat name contains 'process', it assumes that it is owned by traffic_server. This leads to an interesting scenario where traffic_manger keeps updating its own stat and sends unnecessary events to traffic_server. I am not still not sure how and when these updates happen. But if the updates happen frequently, due to 'Problem 1' we can prevent traffic_server from processing any messages for up to 10,000 seconds! (I am imagining the case where the message is received just before the timeout of 1 second happens) I see this situation very often on a Virtual machine for some reason. Most of the times, traffic_cop comes and kills traffic_server because it took too long for it to initialize. I verified that by removing this derived metric, the problem went away. However, I have a feeling that we need to fix the 'problem 1' as well. > >> I wanted to know if my understanding of ownership is correct. And if it >>is, can I open a new bug to track the definition of this stat? > >Yes, it sounds like it is worth revisiting SSL stats to make sure they >still make sense. Last time I looked at these, there were a few problems. Sure, let me open a bug for this. Even if it ends up being invalid, the discussion can be helpful. > >J > > Thanks, Shrihari
