I wonder if anyone can help me with an RE. I also wonder if there is an RE mailing list anywhere - I haven't managed to find one.
I'm trying to use this regular expression to delete particular strings from a file before tokenising it.
I want to delete all strings that have a full stop (period) when it is not at the beginning or end of a word, and also when it is not followed by a closing bracket. I want to delete file names (eg. fileX.doc), and websites (when www/http not given) but not file extensions (eg. this is in .jpg format). I also don't want to delete the last word of each sentence just because it precedes a fullstop, or if there's a fullstop followed by a closing bracket.
fullstopRe = re.compile (r'\S+\.[^)}]]+')
There are two problems with this is: - The ] inside the [] group must be escaped like this: [^)}\]] - [^)}\]] matches any whitespace so it will match on the ends of words
It's not clear from your description if the closing bracket must immediately follow the full stop or if it can be anywhere after it. If you want it to follow immediately then use
\S+\.[^)}\]\s]\S*
If you want to allow the bracket anywhere after the stop you must force the match to go to a word boundary otherwise you will match foo.bar when the word is foo.bar]. I think this works:
(\S+\.[^)}\]\s]+)(\s)
but you have to include the second group in your substitution string.
BTW C:\Python23\pythonw.exe C:\Python24\Tools\Scripts\redemo.py is very helpful with questions like this...
Kent
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor