Re: [Templates] Truncate that preserves HTML tags ?

Josh Rosenbaum Fri, 10 Apr 2009 14:47:14 -0700

Lee.M wrote:
>> HTML::Parser can usually handle improper HTML better at the expense  
>> of speed.
> 
> I think it uses HTML::Truncate under the hood


I think you meant HTML::Truncate uses HTML::Parser under the hood. :P (I 
checked and that appears to be true.) HTML::Parser is awesome. I've used it for 
all sorts of things.


>> HTML::Strip is wrote in XS and says it's about 5 times quicker than  
>> regexp. Whether that's true or not is up to someone else to test.
> 
> I doubt that in this case, naturally XS is "fast" and regex can be  
> considered "slow" but Strip looks to be fairly convoluted: you have to  
> do an object, set tags, call the parse method, and tell it you're done  
> (why 'eof' that has nothing to do with what we are doing....). In  
> other words 10 pounds of XS is still heavier than an ounce of regex :)
> 
> Plus it optionally decodes HTML entities (which *is* a bunch of  
> regexes), decoding those are really 'clean up' or 'reformatting' not  
> 'stripping', I dunno, If I just want 100% of all HTML gone I'd almost  
> bet that HTML::Obliterate would be faster than HTML::Strip, if I  
> wanted to turn certain entities into their regular version I'd use  
> HTML::Entities to do it, then rip out the left over HTML (including  
> entities I don;t want preserved)
[SNIP]
> HTML::Obliterate:
>    real       0m0.031s
>    user       0m0.018s
>    sys        0m0.008s
> 
> HTML::Strip:
>    real       0m0.047s
>    user       0m0.026s
>    sys        0m0.010s
> 
> On a side note the command using  HTML::Strip uses appx 1/3 MB more  
> memory.
> 
> Also I noticed that as I increased the size of the HTML being parsed  
> the time's remained about the same relatively *but* HTML::Strip's  
> memory use grew, HTML::Oblitaerate's did not. I'd say HTML::Strip  
> needs to put some of it's XS mojo to better use than making misleading  
> claims :)

[SNIP]

That's interesting. However, it'd probably be more fair to disable the html 
entity decoding in HTML::Strip before benchmarking, though. (Although you may 
need a separate remove regexp to make sure they are killed.)

You have to also keep in mind that certain modules handle certain corner cases 
better. For example, the HTML::Strip docs specifically mention handling this 
case:
<!-- <a href="old.htm">old page</a> -->.

I don't think the HTML::Obliterate code will handle that correctly

If HTML::Obliterate were to handle all these corner cases using regexp's, then 
HTML::Strip's claims may actually be correct. It appears the code in the 
HTML::Obliterate is a very simple couple of regexp's that won't handle all 
cases. After looking at the source, I wouldn't even bother using the module 
since you could just use the two simple regexp's.

-- Josh

_______________________________________________
templates mailing list
[email protected]
http://mail.template-toolkit.org/mailman/listinfo/templates

Re: [Templates] Truncate that preserves HTML tags ?

Reply via email to