[Repoze-dev] [issue103] repoze.bitblt removes doctype

2010-01-28 Thread Brian Sutherland
Brian Sutherland added the comment: Fixed in revision 8095 by using regexes instead of lxml to parse img tags. -- status: chatting -> resolved __ Repoze Bugs __ __

Re: [Repoze-dev] [issue103] repoze.bitblt removes doctype

2010-01-27 Thread Malthe Borch
2010/1/27 Brian Sutherland : > So in the next week or so, I will try re-implement regular expressions to > find and replace the > tags. Given that malthe seems to think it's a reasonable idea I'll do > it inside repoze.bitblt > on a branch first. Please do it on trunk, and do it well. I've want

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2010-01-27 Thread Brian Sutherland
Brian Sutherland added the comment: I've just discovered yet another way in which lxml is mangling my HTML. I'm fed up with fixing around the edges. So in the next week or so, I will try re-implement regular expressions to find and replace the tags. Given that malthe seems to think it's a

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-13 Thread Brian Sutherland
Brian Sutherland added the comment: You're right on the slowness of beautifulsoup compared to lxml: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ __ Repoze Bugs __ ___

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-13 Thread Malthe Borch
Malthe Borch added the comment: Speed matters; ``lxml`` is very fast and does not incur any significant overhead. I feel that BeautifulSoup would (this might not be correct). Regular expressions are very fast, and sometimes brittle. Everything's a compromise. However, image-tags are usually

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-13 Thread Brian Sutherland
Brian Sutherland added the comment: As an alternative to regexes, there's always BeautifulSoup http://www.crummy.com/software/BeautifulSoup/documentation.html -- nosy: +jinty __ Repoze Bugs

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-13 Thread Brian Sutherland
Brian Sutherland added the comment: I also was bitten by this. Attached is the patch I am using, it includes and expands on the originally posted patches using dbaty's "more complex than it should be" method to keep the doctype out of html that didn't already have it. Using regexes does sta

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-11 Thread Malthe Borch
Malthe Borch added the comment: Perhaps we can use ``lxml`` to extract the locations (string start- and end- ranges) for the tags and then simply use regex matching on those. This way, the original document isn't changed, but we don't have the pitfalls of heuristic. __

Re: [Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-11 Thread Malthe Borch
2009/11/11 Damien Baty : > Malthe, I think you have replied to the wrong ticket. The patch I described > has > not been applied (and regular expressions, well, we can use them everywhere, > of > course, but... ;) ) Perhaps we can use ``lxml`` to extract the locations (string start- and end- rang

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-11 Thread Damien Baty
Damien Baty added the comment: Malthe, I think you have replied to the wrong ticket. The patch I described has not been applied (and regular expressions, well, we can use them everywhere, of course, but... ;) ) (Note: I plan to commit the patch next Friday when I have time.) -- topic:

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-11 Thread Malthe Borch
Malthe Borch added the comment: I see this patch has already been applied. Perhaps we should consider using regular expressions to do this. Chances are that it'll be a) faster, b) less intrusive. -- status: unread -> chatting __ Repoze Bugs

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-03 Thread admin
System message: __ Repoze Bugs __ ___ Repoze-dev mailing list Repoze-dev@lists.repoze.org http://lists.repoze.org/listinfo/repoze-dev

[Repoze-dev] [issue103] repoze.bitblt removes doctype

2009-11-03 Thread Damien Baty
New submission from Damien Baty : When rewriting image tags, repoze.bitblt removes the doctype of any (X)HTML content (cf. attached test). It should not. I have found a fix for XHTML code (cf. attached patch) by changing how the content is parsed. However, the bug persists for HTML content (wh