NOTE TO DECLUDE LIST: I was originally going to answer this off-line
as it was directed to me, but once I got done writing the response it
occurred to me that the same questions and issues might be important
to many Declude users. So, finally, I decided to copy the list on
this. If I guessed wrong, sorry about the extra bandwidth. _M

On Tuesday, August 31, 2004, 2:41:01 AM, Robert wrote:

RES> Hi,

RES> A couple of thoughts (not to the list in case I am out of line)

Not at all - quite in line actually...

RES> Wouldn't a CSV output be much more compact?  Log data once normalized into a
RES> standard format would be identical row after row so why burden it with the
RES> embedded xml schema?  With a good CSV output, it can be sucked into any
RES> spreadsheet or database system easily.

The MDLP project already has a schema for a set of CSV files that
relate directly to tables. The tables that will be produced by these
are designed to provide a normalized structure for test performance
analysis and per-message drill-down.

There are also a number of other outputs and capabilities that will be
built into MDLP...

The XML output format has the quirk of being the first output that
represents the completed internal structure of a message object/graph.
So, it's out there first for two reasons:

1. Testing with this helps us understand that we've captured and
normalized the message data properly.

2. The XML format will likely help us to understand if we have missed
any important data points so that we can go back and capture them
internally before moving on to other outputs.

--- One of the key advantages of XML over CSV is that it provides for
structured representations of data. The XML records we're exporting
now have an almost identical structure to the internal representation
of the message data we are capturing from the logs.

A CSV representation of the same data requires that the object is
first decomposed to match some operational need and/or relational
model. Following that the output is a loose affiliation of tables that
do not in themselves represent the organized structure of the data.

While XML appears on the surface to be less efficient, in the end it
is actually more efficient because it embodies not only the raw data
but also the schema within which that data is organized - thus
allowing the recipient of the XML to make all of the decisions about
how that data should be broken down for their needs.

I hope that makes some sense - I've rambled over a few points there.

RES> Having looked briefly, but not extensively, at declude log output, is it
RES> true that the logging from declude is not normalized?  Some events may
RES> create more than one line in the log files and related log entries might be
RES> interspersed with other events? (I know Imail suffers from this).

Yes. Absolutely. One of the more difficult aspects of this program's
design has been handling log corruption, aggregation, and
normalization in an efficient and appropriate way. (As stated above,
the XML output represents the results of that work in a single object
per message - Testing at this level lets us know that we've got that
right before we begin breaking that data down into other outputs.)

-- A side note: Due to the corruption commonly found in these log
files there will be messages that don't make it through MDLP to the
output. This is the only way to normalize the output data - otherwise
we would produce partially populated, disjointed outputs. We may
provide an option to capture partially captured messages later, but
for now we will simply omit any message that is missing an important
piece of data.

RES> If true, the biggest thing a utility should do is "normalize" the logs and
RES> create an output file with well defined fields that is consistently one
RES> event per log entry.

This is what the current XML output does. One <Message/> object for
each complete message that can be extracted from the log file.

RES> With that, the physical format doesn't really matter that much.  Again, I
RES> think CSV is really universal.

Well, yes and no. CSV is a universal import/export format with a loose
but widely portable definition. The problem with CSV is that it only
represents a single table of data per file. The data we are capturing
for each message is far more complex than that. Specifically, each
message is attached to a collection of test events. The number and
kind of test events is not fixed, so it's not possible to normalize
this structure into a single table representation.

If it were, then that table would be highly inefficient because the
majority of fields in each record would be blank - yet they would all
take up space in the database... Put another way, if a single table
were defined then we would first need to know all of the possible
tests so that we could define columns for them within the table.
Assuming that this were possible (it's not really) then each record
would have 30-50 extra fields representing each possible test result.
For each message record only 0-10 of these would nominally be filled.

--- Side note: If you look at Message Sniffer logs you can see another
way that flat files can be inefficient. Message Sniffer logs are
normalized in such a way that each matching rule is represented on
each line --- with a "Final" line injected to represent the final
disposition of the message... There are a number of problems with this
format that are common when trying to represent structured data in a
single flat table. First, the data isn't really "normalized" because
some of the records have special meanings. A clear record indicates a
message that didn't match any rules. A final record indicates a
message that did match rules - but they may have been white rules so
Final is not a direct indicator of spam. Also, much of the data on
each record is duplicated so that all of the records can be associated
with a given message--- this duplication takes up a LOT of space -
much more than would be taken up by a good XML schema for example.

The solution to this problem is to represent the data using a family
of related tables where each is represented in a separate CSV file...
this is part of our specification (the next part we're working on in
fact). One problem with this concept, however, is that once we produce
this set of CSV files we will have made all of the decisions about how
the data should be structured. The end user who does not use the XML
format will be stuck with our decisions... hopefully they will be good
ones ;-)

RES> XML is nice, but what does it really buy you in this case?  If you have a
RES> clean CSV you can wrap a simple schema around it anytime.  Conversely, with
RES> XML, getting back a pure CSV is not easy unless one is already using lots of
RES> XML tools.

XML tools are becoming more available, and so XML is becoming easier
to use. XMLs raw capabilities are profoundly better than CSV - but I
recognize that in practice: If you aren't up the XML curve yet it
seems impenetrable and therefore useless.

As for wrapping a simple schema around a CSV - the challenge is that
your schema can only be either an exact match for the schema defined
by the developer - or some subset thereof.

That said, of course there are ways to massage a CSV file, or a set of
them, into any schema you want --- but that requires more work.

What XML buys you, if you are up the curve, is a significant reduction
in that work, and the flexibility to more easily change your mind
later about how to interpret the data. That is, since the source data
also contains an organized structure you are free to reinterpret that
structure and reorganize it to fit your needs.

RES> I.E., I can envision lots of people wanting declude logs in CSV, but less
RES> people wanting XML.  CSV only would be fine for everyone, XML only would
RES> not.

Agreed. A set of CSV files is already in the specification for MDLP
for this reason. Those who have an XML capability will appreciate that
XML is available. The majority of folks, at least for now, will only
have interest in the CSV files.

RES> So why do both - spend more effort on the parsing and intelligence part of a
RES> good log tool for declude rather than in the pretty-printing/formatting of
RES> the output beyond what is needed to be useful.

As it turns out, making the log data useful requires that it is first
aggregated and rendered free of corruption. There are a lot of events
that are recorded asynchronously in the log file, and many lines that
are partially overwritten - or missing, etc. Before any reliable data
can be extracted, all of these problems must be overcome. So, the
extra intelligence is already required. It turns out that once that
work is done, it's a very simple exercise to export that data in an
XML format --- easier, in fact, than exporting in CSV since the data
is already structured and there is no need to do the decomposition
work that is required for a good set of CSV files.

The relational schema we are going to support with the first version
of MDLP requires 3 files/tables. These are required in order to
produce a summary of test performance without requiring heavy database
processing.

One table will represent each message in it's basic form.
One table will relate each message to the tests that effected it.
One table will summarize the performance of each test within a given
hour.

With these three tables it is possible to reproduce a drillable
version of the Spam Test Quality Analysis using a few basic queries.
This will certainly not be the best schema for everyone, but it's a
pretty good one.

There are a few other mechanisms that will be supported by MDLP
eventually... but that's another email ;-)

Hope this helps,
_M



---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]

---
This E-mail came from the Declude.JunkMail mailing list.  To
unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
type "unsubscribe Declude.JunkMail".  The archives can be found
at http://www.mail-archive.com.

Reply via email to