Eric Brunson wrote: > Sebastien Noel wrote: > >> Hi, >> >> I'm doing a little script with the help of the BeautifulSoup HTML parser >> and uTidyLib (HTML Tidy warper for python). >> >> Essentially what it does is fetch all the html files in a given >> directory (and it's subdirectories) clean the code with Tidy (removes >> deprecated tags, change the output to be xhtml) and than BeautifulSoup >> removes a couple of things that I don't want in the files (Because I'm >> stripping the files to bare bone, just keeping layout information). >> >> Finally, I want to remove all trace of layout tables (because the new >> layout will be in css for positioning). Now, there is tables to layout >> things on the page and tables to represent tabular data, but I think it >> would be too hard to make a script that finds out the difference. >> >> My question, since I'm quite new to python, is about what tool I should >> use to remove the table, tr and td tags, but not what's enclosed in it. >> I think BeautifulSoup isn't good for that because it removes what's >> enclosed as well. >> >> > > You want to look at htmllib: http://docs.python.org/lib/module-htmllib.html >
I'm sorry, I should have pointed you to HTMLParser: http://docs.python.org/lib/module-HTMLParser.html It's a bit more straightforward than the HTMLParser defined in htmllib. Everything I was talking about below pertains to the HTMLParser module and not the htmllib module. > If you've used a SAX parser for XML, it's similar. Your parser parses > the file and every time it hit a tag, it runs a callback which you've > defined. You can assign a default callback that simply prints out the > tag as parsed, then a custom callback for each tag you want to clean up. > > It took me a little time to wrap my head around it the first time I used > it, but once you "get it" it's *really* powerful and really easy to > implement. > > Read the docs and play around a little bit, then if you have questions, > post back and I'll see if I can dig up some examples I've written. > > e. > > >> Is re the good module for that? Basically, if I make an iteration that >> scans the text and tries to match every occurrence of a given regular >> expression, would it be a good idea? >> >> Now, I'm quite new to the concept of regular expressions, but would it >> ressemble something like this: re.compile("<table.*>")? >> >> Thanks for the help. >> _______________________________________________ >> Tutor maillist - Tutor@python.org >> http://mail.python.org/mailman/listinfo/tutor >> >> > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor