On 9/8/06 7:57 PM, "Richard Gaskin" <[EMAIL PROTECTED]> wrote:
> Jim Ault wrote: > >> Cubist is correct. Any well-formed page will have balanced tags and only >> use the < and > chars to mean tag markers. > > But can one deliver a product which assumes all the html thrown at it > will be well-formed? > > So I have two questions about the sort of variable-based methods for > filtering SGML-style tags and using a field object to so the same: > > 1. Which is more forgiving of html which may not be well-formed? > > 2. Which is faster? My quick comment is to consider the sources of data and the intended use. If gathering content for human review, many tools are possible to build that could refine 'raw content' and even index it using a controlled vocabulary. In my projects, the data needs to be mined and honed without human intervention, so the 'smart' functions need to be applied judiciously. Further, since my text blocks are small, I can afford to do more elaborate steps that involve RegEx and error checking. Additionally, if any data is suspicious, it can be discarded with little penalty. > 1. Which is more forgiving of html which may not be well-formed? I would favor a decision tree that applied specific rules, found exceptions and tried to react to them, thus variable-based methods with elaborate parsing rules. So many html sources are generated by database engines these days, that errors which make no difference to the viewer will be propagated throughout a site. This means that 'bad' html tags to a parser like you make no diff to a site manager, thus there is no reason to fix them. A parallel is the use of OCR (optical character recognition) software. How fast do you want to go to get to 85% correct... 95% correct.. then have a reviewer do the final editing? If you are repeatedly mining the same sites (eg news agencies, competitors) then it is easier. Random authors/sites become more difficult. Hope this gives you a bit of my opinion, but everyone's mileage can and will vary. Jim Ault Las Vegas _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
