At 04:58 PM 2/12/02 +1100, you wrote:
>Afternoon!
>I have been using HTML tidy from W3C but the Debian package is way out
>of date.
>
>I was wondering if people could recommend other HTML tidiers and
>validators that they use.

Dear Simon,

Firstly i use the W3C validator at least once a week and think it's the
ant's pants.  Also i send the W3C url to people who have really sucky web
pages full of coding errors.  It's amazing the amount of people who have
so-called "state of the art" web pages.  Then you find they dont sit
preoperly, or won't print properly, or were desdigned for 1600 wide screens
and i only have 800 wide on the net.  Guess what - bad pages are full of
coding errors.

Moving right along, I have downloaded and used HTML Tidy.  Actually i have
not found it very satisfactory at all, as it does not fold lines at the
correct place for my needs.  It is not very "intelligent", and does not
greatly improve readibility and does not remove leading rubbish.  I guess
there are lots of html-tidy programmes out there, but possibly my
requirements are at one end of the spectrum.

Basically my spec is as follows....

* Fold all text at around 70 or 75
* If possible fold text at a full stop or at a comma.

* Fold all tags, but more leniently than text.  
* Tags not containing spaces must not be folded, as this could long urls
that fetch counters etc.

* All leading tabs and spaces are removed, i do not support the concept of
indenting all all, it just fills the already congested net with zillions of
spaces and tabs.

* Next, all common <tags> are classified/defined as either "must be moved to
be at the start of a line" or "must come at the end of a line" or "must be
on a line by itself" or (default) can be moved if desired.

* All internal CR/LF characters are then removed, and get replaced by those
inserted by the folding process.  Some html files are only a single line,
these get folded and become readable.

* Excessive space lines are removed.

* Some intelligent folding algorithms are needed to fold lines sensibly
where a line is part tag and part text, or where a line contains several
tags. The first choice answer is to fold when needed at the >< point between
two tags.  Generally processing from the right side of a long line is the
way to figure out how to fold a line.


I have written a first cut of all this in VB3 (of all things) but it only
does the first 6000 characters to date (don't ask, it's part of something
larger).  So far so good, i am much happier with the results than with HTML
tidy.  I have three web sites waiting to be "tidied", but it is critical to
test thoroughly so that the folding is correct, and that characters are not
being lost, and spurious characters are not being added in error.

The critical part is defining the category for each tag.  For example I want
all <table> and </table> tags to be on a line by themselves.  The aim is
readibility and obviousness of what is going on.  I think de-mystifying the
table structures is critical to seeing what is happening on a page.

I have a reasonable collection of html test files collected from all over,
and the programme will have to do a reasonable job on all of them.  The aim
is to have a programme which will do an ok job to most files, and not a
programme with a million options waiting to be set, obviously it could
process anything perfectly, but only after the options are set correctly.

I hope the ideas in the spec above help.  Further contributions and comments
are most welcome.  

Lastly, when "tidying" files one becomes aware of the poor standard of
coding out there, and also the weird code and weird source code layouts
produced by well known html editors. 

Brian

-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug

Reply via email to