[Tutor] regex problem

Michael Powe Tue, 04 Jan 2005 15:41:07 -0800

Hello,

I'm having erratic results with a regex.  I'm hoping someone can
pinpoint the problem.


This function removes HTML formatting codes from a text email that is
poorly exported -- it is supposed to be a text version of an HTML
mailing, but it's basically just a text version of the HTML page.  I'm
not after anything elaborate, but it has gotten to be a bit of an
itch.  ;-)

def parseFile(inFile) :
    import re
    bSpace = re.compile("^ ")
    multiSpace = re.compile(r"\s\s+")
    nbsp = re.compile(r"&nbsp;")
    HTMLRegEx =
    re.compile(r"(&lt;|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(&gt;|>)
",re.I)

    f = open(inFile,"r")
    lines = f.readlines()
    newLines = []
    for line in lines :
        line = HTMLRegEx.sub(' ',line)
        line = bSpace.sub('',line)
        line = nbsp.sub(' ',line)
        line = multiSpace.sub(' ',line)
        newLines.append(line)
    f.close()
    return newLines

Now, the main issue I'm looking at is with the multiSpace regex.  When
applied, this removes some blank lines but not others.  I don't want
it to remove any blank lines, just contiguous multiple spaces in a
line.

BTB, this also illustrates a difference between python and perl -- in
perl, i can change "line" and it automatically changes the entry in
the array; this doesn't work in python.  A bit annoying, actually.
;-)

Thanks for any help.  If there's a better way to do this, I'm open to
suggestions on that regard, too.

mp
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

[Tutor] regex problem

Reply via email to