NOTE TO DECLUDE LIST: I was originally going to answer this off-line as it was directed to me, but once I got done writing the response it occurred to me that the same questions and issues might be important to many Declude users. So, finally, I decided to copy the list on this. If I guessed wrong, sorry about the extra bandwidth. _M
On Tuesday, August 31, 2004, 2:41:01 AM, Robert wrote: RES> Hi, RES> A couple of thoughts (not to the list in case I am out of line) Not at all - quite in line actually... RES> Wouldn't a CSV output be much more compact? Log data once normalized into a RES> standard format would be identical row after row so why burden it with the RES> embedded xml schema? With a good CSV output, it can be sucked into any RES> spreadsheet or database system easily. The MDLP project already has a schema for a set of CSV files that relate directly to tables. The tables that will be produced by these are designed to provide a normalized structure for test performance analysis and per-message drill-down. There are also a number of other outputs and capabilities that will be built into MDLP... The XML output format has the quirk of being the first output that represents the completed internal structure of a message object/graph. So, it's out there first for two reasons: 1. Testing with this helps us understand that we've captured and normalized the message data properly. 2. The XML format will likely help us to understand if we have missed any important data points so that we can go back and capture them internally before moving on to other outputs. --- One of the key advantages of XML over CSV is that it provides for structured representations of data. The XML records we're exporting now have an almost identical structure to the internal representation of the message data we are capturing from the logs. A CSV representation of the same data requires that the object is first decomposed to match some operational need and/or relational model. Following that the output is a loose affiliation of tables that do not in themselves represent the organized structure of the data. While XML appears on the surface to be less efficient, in the end it is actually more efficient because it embodies not only the raw data but also the schema within which that data is organized - thus allowing the recipient of the XML to make all of the decisions about how that data should be broken down for their needs. I hope that makes some sense - I've rambled over a few points there. RES> Having looked briefly, but not extensively, at declude log output, is it RES> true that the logging from declude is not normalized? Some events may RES> create more than one line in the log files and related log entries might be RES> interspersed with other events? (I know Imail suffers from this). Yes. Absolutely. One of the more difficult aspects of this program's design has been handling log corruption, aggregation, and normalization in an efficient and appropriate way. (As stated above, the XML output represents the results of that work in a single object per message - Testing at this level lets us know that we've got that right before we begin breaking that data down into other outputs.) -- A side note: Due to the corruption commonly found in these log files there will be messages that don't make it through MDLP to the output. This is the only way to normalize the output data - otherwise we would produce partially populated, disjointed outputs. We may provide an option to capture partially captured messages later, but for now we will simply omit any message that is missing an important piece of data. RES> If true, the biggest thing a utility should do is "normalize" the logs and RES> create an output file with well defined fields that is consistently one RES> event per log entry. This is what the current XML output does. One <Message/> object for each complete message that can be extracted from the log file. RES> With that, the physical format doesn't really matter that much. Again, I RES> think CSV is really universal. Well, yes and no. CSV is a universal import/export format with a loose but widely portable definition. The problem with CSV is that it only represents a single table of data per file. The data we are capturing for each message is far more complex than that. Specifically, each message is attached to a collection of test events. The number and kind of test events is not fixed, so it's not possible to normalize this structure into a single table representation. If it were, then that table would be highly inefficient because the majority of fields in each record would be blank - yet they would all take up space in the database... Put another way, if a single table were defined then we would first need to know all of the possible tests so that we could define columns for them within the table. Assuming that this were possible (it's not really) then each record would have 30-50 extra fields representing each possible test result. For each message record only 0-10 of these would nominally be filled. --- Side note: If you look at Message Sniffer logs you can see another way that flat files can be inefficient. Message Sniffer logs are normalized in such a way that each matching rule is represented on each line --- with a "Final" line injected to represent the final disposition of the message... There are a number of problems with this format that are common when trying to represent structured data in a single flat table. First, the data isn't really "normalized" because some of the records have special meanings. A clear record indicates a message that didn't match any rules. A final record indicates a message that did match rules - but they may have been white rules so Final is not a direct indicator of spam. Also, much of the data on each record is duplicated so that all of the records can be associated with a given message--- this duplication takes up a LOT of space - much more than would be taken up by a good XML schema for example. The solution to this problem is to represent the data using a family of related tables where each is represented in a separate CSV file... this is part of our specification (the next part we're working on in fact). One problem with this concept, however, is that once we produce this set of CSV files we will have made all of the decisions about how the data should be structured. The end user who does not use the XML format will be stuck with our decisions... hopefully they will be good ones ;-) RES> XML is nice, but what does it really buy you in this case? If you have a RES> clean CSV you can wrap a simple schema around it anytime. Conversely, with RES> XML, getting back a pure CSV is not easy unless one is already using lots of RES> XML tools. XML tools are becoming more available, and so XML is becoming easier to use. XMLs raw capabilities are profoundly better than CSV - but I recognize that in practice: If you aren't up the XML curve yet it seems impenetrable and therefore useless. As for wrapping a simple schema around a CSV - the challenge is that your schema can only be either an exact match for the schema defined by the developer - or some subset thereof. That said, of course there are ways to massage a CSV file, or a set of them, into any schema you want --- but that requires more work. What XML buys you, if you are up the curve, is a significant reduction in that work, and the flexibility to more easily change your mind later about how to interpret the data. That is, since the source data also contains an organized structure you are free to reinterpret that structure and reorganize it to fit your needs. RES> I.E., I can envision lots of people wanting declude logs in CSV, but less RES> people wanting XML. CSV only would be fine for everyone, XML only would RES> not. Agreed. A set of CSV files is already in the specification for MDLP for this reason. Those who have an XML capability will appreciate that XML is available. The majority of folks, at least for now, will only have interest in the CSV files. RES> So why do both - spend more effort on the parsing and intelligence part of a RES> good log tool for declude rather than in the pretty-printing/formatting of RES> the output beyond what is needed to be useful. As it turns out, making the log data useful requires that it is first aggregated and rendered free of corruption. There are a lot of events that are recorded asynchronously in the log file, and many lines that are partially overwritten - or missing, etc. Before any reliable data can be extracted, all of these problems must be overcome. So, the extra intelligence is already required. It turns out that once that work is done, it's a very simple exercise to export that data in an XML format --- easier, in fact, than exporting in CSV since the data is already structured and there is no need to do the decomposition work that is required for a good set of CSV files. The relational schema we are going to support with the first version of MDLP requires 3 files/tables. These are required in order to produce a summary of test performance without requiring heavy database processing. One table will represent each message in it's basic form. One table will relate each message to the tests that effected it. One table will summarize the performance of each test within a given hour. With these three tables it is possible to reproduce a drillable version of the Spam Test Quality Analysis using a few basic queries. This will certainly not be the best schema for everyone, but it's a pretty good one. There are a few other mechanisms that will be supported by MDLP eventually... but that's another email ;-) Hope this helps, _M --- [This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)] --- This E-mail came from the Declude.JunkMail mailing list. To unsubscribe, just send an E-mail to [EMAIL PROTECTED], and type "unsubscribe Declude.JunkMail". The archives can be found at http://www.mail-archive.com.