Re: [SLUG] Re: Why XML bites and why it is NOT a markup language

telford Sat, 11 Jun 2005 07:14:04 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, Jun 11, 2005 at 09:15:27AM +1000, Jamie Honan wrote:
> 
> I can't believe I'm defending xml.


A good mental exercise to remind yourself of why you don't want it
and providing handy straw men for me to knock down.

> > I'd like to coin the name "RML" which
> > stands for "Robust Markup Language" which should have the following
> 
> Don't be bashful here, Telford. I suggest "TOTRML": Telford's One True
> Robust Markup Language.

The acronym has to remain vaguely pronounceable,
RML can be spoken as "rummel" without too much confusion.

> > desirable properties:
> > 
> >   * stream-oriented construction
> 
> Stream is good, yes. PNG, and JPEG are streamable. ASF is streamable
> and AVI suffers cause it isn't. However, not all data is streamable.

You can imprint a record-oriented structure onto a stream format
by using tags in the stream but trying to support a stream by using
a record format is really ugly (not impossible). It is desirable
to have a format that makes it easy to build higher level formats on
top of rather than a format which is already so high level that it
becomes cumbersome for ordinary tasks.

> Thus xml parsers present data as callbacks (kind of streaming),
> or a walkable tree.

Providing you don't want to seek an XML stream...

> >   * byte-oriented construction (no 16 bit encodings at all)
> 
> You mean no unicode? As opposed to no binary mode?

UTF-8 is byte-oriented as far as the parser is concerned,
provided you use 7-bit markers for critical synchronisation
(like start of tag, end of tag, etc).

> >   * supports arbitrary tags
> 
> Ah. You've just lost validation. You now can't prove your data
> conforms to an agreed DTD.

Not at all. Validation is a higher level function and should
be treated as such. Building a layered architecture is far more
reliable, flexible and maintainable than building a monolithic
architecture. Thus the job of the parser is to: read the raw data
stream; identify the tags; identify the data blocks and provide
an API that gives access to these entities (and nothing else).

Any additional data analysis is another layer on top and there
are all sorts of ways to apply data-specific templates over the
top once you have the parser results (not limited by DTD
technology either). This could include testing for particular
sets of tags, particular sequence of tags, maximum data size,
data formats, data checksums and any other sanity check that 
suits the needs of the day. For example, a DTD won't allow you
to test a CRC against the content of a record so you need a
higher level operation to do that sort of checking anyhow;
XML does not remove the requirement to sanity-check the data
you are given.

I have nothing against higher-level libraries providing additional
services providing they don't crash the basic parse layer.

Think about a bison parser with no error-trapping rules,
what do you get when something goes wrong? An error message that
says "parse error" and that's all you get. How useful is this
when you want to know what went wrong? Not very. Not at all really.
I've seen people using Java XML libraries that allow extensive
DTD and XSLT validation so you can define highly complex data
structures very easily. When it tries to read something that might
have a tag in the wrong place it just returns "false"... 
sorry I won't read that because it doesn't validate, and I won't
tell you why so have fun figuring it out yourself.

The yacc developers and users already went through this exercise
about 15 years before XML was even invented and figured out that
just a yes/no answer isn't good enough.

> That's OK, because I suspect Telford is talking Telford protocol
> to Telford at this point.

Maybe yes, but from my point of view XML is a failure, I've tried
it and seen that it is broken and I can explain why it is broken.
I'm just identifying what needs fixing.

> >   * supports parametric tags
> 
> Lot's of people don't like parameters. They think they should be
> in the data part. I don't mind, we are changing the world here.

The parameters are optional.

Data should be in the data part, metadata should be in the tag.
I agree that the boundary is a fuzzy one but in principle...
"if the user of the program can ever see this text under any normal
circumstances then it is data, otherwise it is metadata". You will
note that HTML forms break this rule but I believe the rule is as
solid as any you wil find and easier to express than most.

> >   * never allow tags inside a tag definition
> 
> Hmm. You mean no heirarchical tags? Not sure here. Or you mean
> tags are atomic... Fair enough.

Yes tags are atomic, that's a better way to put it.

> >   * NO guarantee of tags making a perfect tree (but parser can provide
> >     information about tree or partial-tree structures if they exist)
> 
> It's rafferty's rules, anyway.

If you want to call it that. I just think that there are useful
applications of non-tree tagged documents (e.g. breakdown by A4 pages
and breakdown by letter pages in the same document, along with the
normal document section markers... can't be done in XML because it
will always break the perfect tree).

Again, just like any other data analysis or template-based validation,
topology-analysis is a higher level function above the basic parser.
There's nothing wrong with allowing a whole host of additional (optional)
restrictions which can apply above and beyond the basic specification.
It might even be useful to classify additional restriction groups and
provide data analysis libraries to figure out which class your document
belongs to... but all that is an optional add-on. As I mentioned earlier,
you can take a  robust protocol and deliberately make it brittle if that
is useful for your application but you can't go the other way. So start
with something simple and robust and add the brittle bits if and when
you happen to need them.

> >   * when tags are all next to one another, ordering is NOT important
> >     (thus italic/bold is the same as bold/italic)
> 
> Order not important. OK, can't marshall arrays.

Not arrays of pure metadata, no. Nor should you be able to.

Arrays of data are perfectly OK because ordering of the data is the
natural stream ordering and all is well in the world. This restricts
users from trying to get too clever with parametric tags and it provides
another good definition of the data / metadata boundary.

> >   * at most one parameter per tag and not named parameters
> >     (because named parameters bend your head and get very complex and
> >     require special syntax and further because it is always better to
> >     introduce a new tag than introduce a new named parameter)
> 
> Simplicity is good. Damn parameters. I've always hated them in
> subroutines too.

This is a markup language, not a programming language.
There is no reason for it to support the features of a programming
language and I suspect that there are good reasons why it should not
(possibly related to why javascript in browsers is a security risk).

> >   * supports guarantee of resynchronisation to tag boundary after an
> >     arbitrary seek into the file (scanning forwards or backwards) and
> >     something that "seems to be" a tag boundary always IS a tag boundary
> 
> Ah, we need an escape character mechanism.

Yes, I think that's inevitable... mind you the XML escape character
mechanism is cumbersome and ugly. You only need to escape those small
number of characters that are magic synchronisation markers and quite
frankly using commonly used chars such as '<' and '>' for magic
synchronisation markers is exceptionally dumbass, doubly so when you
consider that ASCII already provides a bunch of low-numbered bytes for
exactly that purpose. Using the ESC char as an escape might even be
sensible. I may be guilty of disrespect to the deep thought that has
gone into the XML standard but the drafters of XML are twice as guilty
for their ignorance of the ASCII standard. I might add that ASCII
would have to be the single most successful standard in the computing
world (other than binary two's compliment math).

> >   * case insensitive tag matching (for English at least plus any other
> >     language that sensibly defines mixed case)
> 
> Character encoding set.

ASCII is always supported, that covers case conversion for English.
It also probably covers 90% of tags that are ever going to get used.

After that, you can have some special way to bootstrap up more complex
encodings and case conversion tables. If the up-bootstrap fails for
whatever reason then you are guaranteed a fallback of ASCII.
Since data is opaque lumps anyhow, the parser itself doesn't even need
to know the encoding standard, it just recognises data and tags.
The only time it matters is inside a tag, and even then it only matters
because of trying to match up tags of the same name. The worst that
can happen is some tags don't match when they should do... this is
recoverable if the application knows which tags to expect. Moreover,
the application can rapidly apply several different encodings without
needing to re-parse the data because the structural breakdown doesn't
change. This actually applies to XML too... it's just that the
"perfect data or die" mentality coupled with a "do everything in one
big parsing phase" approach leads to an unlikelihood that such a
system will ever be implemented an an XML parser.


> >   * damaged files can be recovered by an automatic process at least to
> >     the extent that lost data is proportional to the amount of damage
> 
> By resyncing. But how far? DVB, have 187 (?) bytes. BUT that is a transport
> protocol. You put your packets together to make a block of data.
> 
> Of course, if all you ever do is have files, complete atomic
> gobs of data, the 187 bytes and resyncing and escape characters
> is all very inefficient.

Well if you use a guaranteed unique "start of tag" marker then spooling
until you hit one of those should guarantee you found a tag (the CDATA
system in XML hoses any attempt to trust "<" as a "start of tag" marker).
If you have a maximum limit on tag length (a good idea for sanity reasons)
then finding an "end of tag" marker within that limit makes you even
more sure you found a tag. Since you aren't depending on a perfect tree,
losing a few tags won't destroy all the other tags that you did not
lose. Higher level data analysis may be able to detect that your data
is damaged by matching against a template and failing to match on the
missing tags, it can then make a decision about the smallest chunk
that can be flagged as invalid in order to recover some sort of document
whch will match the template. Such a recovery operation is application
dependent, which is why it belongs in a higher level analysis.

The amount of template checking and data recovery you attempt will
depend on how much you care and that too is application dependent.

Then again, you could look at how Wikipedia keeps metadata and that
also supports high quality resynchronisation because it depends on 
particular patterns being detected and anything that doesn't match a 
known pattern is considered ordinary. No metadata tag depends on the
existence of other metadata tags for its own existence. Depending on
pattern matching also suits humans because human brains use a similar
technique for recognition. For example, something that "looks like" a
pair of eyes staring at you out of the jungle is recognised as a
potential predator because being hunted is a high precedence operator
for a survival machine (slightly mixed metaphor, but you see what I'm
getting at).

I mean, if XML is so fantastic, why does every Wiki avoid it
(for the wiki data pages at any rate)?

> That's why you might go for a reliable transport protocol and
> then try to parse known, good data. Whoops, that's the xml premise...

... until you want to seek, or someone manages to introduce a single
byte of bad data into your database though perhaps a bug or simple
user idiocy that no programmer was expecting, or someone wanted to
tweak a record in the data but there was no program function for it
so they just tweaked the record in a text editor (and screwed something
at the same time) or all of those other things that really do happen
especially when everyone is sure they can't happen.

> More power to you Telford. What we need is a special ISO sub comittee
> and some funding to study this problem further, and more in depth.

Do the ISO actually contribute funding to anything? I thought that
their internationally recognised purpose was to soak up funding.
I'm certainly willing to contribute my ability to also soak up funding
if that assists the overall cause.

> Preferrably, somewhere warm at the moment. Let's get our application in
> before those xml b******s cut us off at the knees.

I'm 100% confident that XML will cut itself off at the knees.

Programming fads go in approximately 10 year cycles and I think XML is
about 5 years in and already people are asking "why bother?"
There's always a next big thing that is so much better than the last
big thing. XML largely rode in on the coat-tails of HTML and the WWW
anyhow and enough people are around now who don't feel awed by the
internet that trying to bluff them by saying that they can't transfer
data without a special incantation is getting difficult.

The other cool thing is that since XML is so nicely structured,
any existing documents that DO go through XML parsers can rapidly
be converted to any other tagged format. If a new format comes along
that does everything XML does AND is easier to use AND more robust
AND supports more documents then adoption is relatively painless.

        - Tel  ( http://bespoke.homelinux.net/ )

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iQIVAwUBQqrxvcfOVl0KFTApAQKq2Q//XHCW9Flhz0Qg/CjBuBjVaix4DHW+jIC3
wIdoTpiZAtHy4Z73ic3ERIw+LVujV2nr+uAegeo9nvztROlgtAXsoZmbMzwt7bIb
8g/veWML8qg0n5bSqtww/4FC71//FzZZAOZyuNebF5GjoZ7HM5+DsvDn8OS1zAZp
bOTQ6qbtT14uNjHxNMAn9JBAiJmNC61rcj+bOOLiG+lYNu0i+g5/9CuDWiEp7bX8
NeIA6jhsU6drn3SdOJDjd9dSVa4aO6HyQ3At1a484nY7ck6MM3F0vkyfYXyRbdWY
NShfNn7k6qzyNwfYyLsQJXJhNKu2PowOQnzs1p6BHmcm+j9FC2faNLbjdWiesWta
sUuZb2cfpt5Fg1PIu88cDNmYZKJSGtdApYRkdnMK+iwlo3VHODvtSMovZqf1pYvK
yl+YZRsb5g5jzfthX2ypzarNbro9uVhr00cetjvZwAx2mgAX8klsYAa3ve0Y3u86
a6DHlwo9SBwTdN54HasRhShoGIqXFkIy1g+X11J8nuLG8FudtPA478OLIBlAa1ro
8/hBWbS7ql/fGZU5MYUbr27q/eeR7oIGdx7hKbB4qEuIv1/1aUfKXFK98vJETjs3
8Rxm7QZxHnEz6rCtx/8xWJH+h/Oqjujw4iVgdeDDFXjBkmtkqn/JgpylPyw03xui
PHX1smehBV8=
=UODv
-----END PGP SIGNATURE-----
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Re: [SLUG] Re: Why XML bites and why it is NOT a markup language

Reply via email to