Lee.M wrote: >> HTML::Parser can usually handle improper HTML better at the expense >> of speed. > > I think it uses HTML::Truncate under the hood
I think you meant HTML::Truncate uses HTML::Parser under the hood. :P (I checked and that appears to be true.) HTML::Parser is awesome. I've used it for all sorts of things. >> HTML::Strip is wrote in XS and says it's about 5 times quicker than >> regexp. Whether that's true or not is up to someone else to test. > > I doubt that in this case, naturally XS is "fast" and regex can be > considered "slow" but Strip looks to be fairly convoluted: you have to > do an object, set tags, call the parse method, and tell it you're done > (why 'eof' that has nothing to do with what we are doing....). In > other words 10 pounds of XS is still heavier than an ounce of regex :) > > Plus it optionally decodes HTML entities (which *is* a bunch of > regexes), decoding those are really 'clean up' or 'reformatting' not > 'stripping', I dunno, If I just want 100% of all HTML gone I'd almost > bet that HTML::Obliterate would be faster than HTML::Strip, if I > wanted to turn certain entities into their regular version I'd use > HTML::Entities to do it, then rip out the left over HTML (including > entities I don;t want preserved) [SNIP] > HTML::Obliterate: > real 0m0.031s > user 0m0.018s > sys 0m0.008s > > HTML::Strip: > real 0m0.047s > user 0m0.026s > sys 0m0.010s > > On a side note the command using HTML::Strip uses appx 1/3 MB more > memory. > > Also I noticed that as I increased the size of the HTML being parsed > the time's remained about the same relatively *but* HTML::Strip's > memory use grew, HTML::Oblitaerate's did not. I'd say HTML::Strip > needs to put some of it's XS mojo to better use than making misleading > claims :) [SNIP] That's interesting. However, it'd probably be more fair to disable the html entity decoding in HTML::Strip before benchmarking, though. (Although you may need a separate remove regexp to make sure they are killed.) You have to also keep in mind that certain modules handle certain corner cases better. For example, the HTML::Strip docs specifically mention handling this case: <!-- <a href="old.htm">old page</a> -->. I don't think the HTML::Obliterate code will handle that correctly If HTML::Obliterate were to handle all these corner cases using regexp's, then HTML::Strip's claims may actually be correct. It appears the code in the HTML::Obliterate is a very simple couple of regexp's that won't handle all cases. After looking at the source, I wouldn't even bother using the module since you could just use the two simple regexp's. -- Josh _______________________________________________ templates mailing list [email protected] http://mail.template-toolkit.org/mailman/listinfo/templates
