Thanks Christopher,
I'm not surprised it doesn't act the same as Netscape - we've all spent
hours on the Netscape/IE/Other (& versions) browsers, HTML compatibility
stuff.
I've only been using the parser for things of interest to a search
engine - i.e. <TITLE>, <META> tags etc, links & body text.
I'm handling all exceptions by the parser, any dead links etc. by just
marking the page 'bad', and this has only resulted in only a handful of
'bad' pages out of thousands copied from the Web. Manually scanning
through the data extracted I haven't seen any obvious problems.
I have customised things a little, to only pull the bits I'm interested
in i.e.:
public class Hparser extends ParserCallback{...
then using something like :
ParserDelegator pd;
pd.parse(in, hparse, true);
Cheers,
Danny.
> Not so serious: The swing parser handled invalid html differently than
> Netscape. Actually, this was serious for me, but it's completely unfair
> to expect anything else. My particular problem was improperly closed
> quotes around attribute values.
> More serious: Some (invalid and very bizarre but guaranteed found
> in nature) input files were giving me exceptions. This isn't really
> a bug in the swing parser, of course, since the input files weren't
> proper HTML.
>
> If you control the original HTML (like for a template system) none
> of the error recovery stuff really matters, since you can always fix the
> HTML to be valid.
>
> -cks
>
> ___________________________________________________________________________
> To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
> of the message "signoff SERVLET-INTEREST".
>
> Archives: http://archives.java.sun.com/archives/servlet-interest.html
> Resources: http://java.sun.com/products/servlet/external-resources.html
> LISTSERV Help: http://www.lsoft.com/manuals/user/user.html
--
Intermittent Web site :
http://members.xoom.com/dayers/_XOOM/dayers/index.html
Alternate email :
[EMAIL PROTECTED]
___________________________________________________________________________
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff SERVLET-INTEREST".
Archives: http://archives.java.sun.com/archives/servlet-interest.html
Resources: http://java.sun.com/products/servlet/external-resources.html
LISTSERV Help: http://www.lsoft.com/manuals/user/user.html