Hello,
I'm having erratic results with a regex. I'm hoping someone can
pinpoint the problem.
This function removes HTML formatting codes from a text email that is
poorly exported -- it is supposed to be a text version of an HTML
mailing, but it's basically just a text version of the HTML page. I'm
not after anything elaborate, but it has gotten to be a bit of an
itch. ;-)
def parseFile(inFile) :
import re
bSpace = re.compile("^ ")
multiSpace = re.compile(r"\s\s+")
nbsp = re.compile(r" ")
HTMLRegEx =
re.compile(r"(<|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(>|>)
",re.I)
f = open(inFile,"r")
lines = f.readlines()
newLines = []
for line in lines :
line = HTMLRegEx.sub(' ',line)
line = bSpace.sub('',line)
line = nbsp.sub(' ',line)
line = multiSpace.sub(' ',line)
newLines.append(line)
f.close()
return newLines
Now, the main issue I'm looking at is with the multiSpace regex. When
applied, this removes some blank lines but not others. I don't want
it to remove any blank lines, just contiguous multiple spaces in a
line.
BTB, this also illustrates a difference between python and perl -- in
perl, i can change "line" and it automatically changes the entry in
the array; this doesn't work in python. A bit annoying, actually.
;-)
Thanks for any help. If there's a better way to do this, I'm open to
suggestions on that regard, too.
mp
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor