[jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper

Ivan Kelly (JIRA) Mon, 19 Jul 2010 02:43:21 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889800#action_12889800
 ]


Ivan Kelly commented on ZOOKEEPER-816:
--------------------------------------

(Assumed JIRA picked up email replies. Seems not >:/)

As far as I've seen, this overhead comes in two forms, CPU and disk.  
CPU overhead is mostly due to formatting. Disk obviously because  
tracing will fill your disk fairly quickly. Perhaps something could be  
done to combat both of these. To fix the formatting problem we could  
use a binary log format. I've seen this done in C++ but not in java.  
The basic idea is that if you have TRACE("operation %x happened to %s  
%p", obj1, obj2, obj3); a preprocessor replaces this with  
TRACE(0x1234, obj1, obj2, obj3) where 0x1234 is an identifier for the  
trace. Then when the trace occurs a binary blob [0x1234, value of  
obj1, value of obj2, value of obj3] is logged. Then when the logs are  
pulled of the machine you run a post processor to do all the  
formatting and you get your full trace.

Regarding the disk overhead, traces are usually only interesting in  
the run up to a failure. We could have a ring buffer in memory that is  
constantly traced to, old traces being overwritten when the ring  
buffer reaches it's limit. These traces should only be dumped to the  
filesystem when an error or fatal level event occurs, thereby giving  
you a trace of what was happening before you fell over.



-Ivan


> Detecting and diagnosing elusive bugs and faults in Zookeeper
> -------------------------------------------------------------
>
>                 Key: ZOOKEEPER-816
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-816
>             Project: Zookeeper
>          Issue Type: New Feature
>            Reporter: Miguel Correia
>            Priority: Minor
>
> Complex distributed systems like Zookeeper tend to fail in strange ways that 
> are hard to diagnose. The objective is to build a tool that helps understand 
> when and where these problems occurred based on Zookeeper's traces (i.e., 
> logs in TRACE level). Minor changes to the server code will be needed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper

Reply via email to