On 08Jun2019 22:27, Sean Murphy <mhysnm1...@gmail.com> wrote:
Windows 10 OS, Python 3.6
Thanks for this.
I have a couple of queries in relation to extracting content using
regular expressions. I understand [...the regexp syntax...]
The challenge I am finding is getting a pattern to
extract specific word(s). Trying to identify the best method to use and how
to use the \1 when using forward and backward search pattern (Hoping I am
using the right term). Basically I am trying to extract specific phrases or
digits to place in a dictionary within categories. Thus if "ROYaL_BANK
123123123" is found, it is placed in a category called transfer funds. Other
might be a store name which likewise is placed in the store category.
I'll tackle your specific examples lower down, and make some
suggestions.
Note, I have found a logic error with "ROYAL_BANK 123123123", but that
isn't a concern. The extraction of the text is.
Line examples:
Royal_bank M-BANKING PAYMENT TRANSFER 123456 to 9922992299
Royal_bank M-BANKING PAYMENT TRANSFER 123456 FROM 9922992299
PAYMENT TO SARWARS-123123123
ROYAL_BANK INTERNET BANKING BPAY Kangaroo Store {123123123}
EFTPOS Amazon
PAY/SALARY FROM foo bar 123123123
PAYMENT TO Tax Man 666
Thanks.
Assuming the below is a cut/paste accident from some code:
result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ',
'ROYAL_BANK ', line)
r'ROYAL_BANK INTERNET BANKING TRANSFER Mouth in foot
And other similar structures. Below is the function I am currently using.
Not sure if the sub, match or search is going to be the best method. The
reason why I am using a sub is to delete the unwanted text. The
searchmatch/findall could do the same if I use a group. Also I have not
used any tests in the below and logically I think I should. As the code will
override the results if not found in the later tests. If there is a more
elegant way to do it then having:
If line.startswith('text string to match'):
Regular expression
el If line.startswith('text string to match'):
regular expression
return result
There is. How far you take it depends on how variable your input it.
Banking statement data I would expect to have relatively few formats
(unless the banking/financ industry is every bit as fragmented as I
sometimes believe, in which case the structure might be less driven by
_your_ bank and instead arbitrarily garbled according the the various
other entities due to getting ad hoc junk as the description).
I would like to know. The different regular expressions I have used
are:
# this sometimes matches and sometimes does not. I want all the text up to
the from or to, to be replaced with "ROYAL_BANK". Ending up with ROYAL_BANK
123123123
result= re.sub(r'ROYAL_BANK M-BANKING PAYMENT TRANSFER \d+ (TO|FROM) ',
'ROYAL_BANK ', line)
Looks superficially ok. Got an example input line where it fails? Not
that the above is case sentitive, so if "to" etc can be in lower case
(as in your example text earlier) this will fail. See the re.I modifier.
# the below returns from STARWARS and it shouldn't. I should just get
STARWARS.
result = re.match(r'PAYMENT TO (SARWARS)-\d+ ', line)
Well, STARWARS seems misseplt above. And you should get a "match"
object, with "STARWARS" in .group(1).
So earlier you're getting a str in result, and here you're getting an
re.match object (or None for a failed match).
# the below should (doesn't work the last time I tested it) should
return the words between the (.)
result = re.match(r'ROYAL_BANK INTERNET BANKING BPAY (.*) [{].*$', '\1',
line)
"should" what? It would help to see the input line you expect this to
match. And re.match is not an re.sub - it looks like you have these
confused here, based on the following '\`',line parameters.
# the below patterns should remove the text at the beginning of the string
result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ',
'ROYAL_BANK ', line)
result = re.sub(r'ROYAL_BANK INTERNET BANKING TRANSFER ', '', line)
result = re.sub(r'EFTPOS ', '', line)
Sure. Got an example line where this does not happen?
# The below does not work and I am trying to use the back or forward
search feature. Is this syntax wrong or the pattern wrong? I cannot work it out
from the information I have read.
result = re.sub(r'PAY/SALARY FROM (*.) \d+$', '\1', line)
result = re.sub(r'PAYMENT TO (*.) \d+', '\1', line)
You've got "*." You probably mean ".*"
Main issues:
1: Your input data seems to be mixed case, but all your regexps are case
sensitive. They will not match if the case is different eg "Royal_Bank"
vs "ROYAL_BANK", "to" vs "TO", etc. Use the re.I modified to make your
regexps case insensitive.
2: You're using re.sub a lot. I'd be inclined to always use re.match and
to pull information from the match object you get back. Untested example
sketch:
m = re.match('(ROYAL_BANK|COMMONER_CREDIT_UNION) INTERNET BANKING FUNDS TFER
TRANSFER (\d+) TO (.*)', line)
if m:
category = m.match(1)
id_number = m.match(2)
recipient = m.match(3)
else:
m = re.match(.......)
... more tests here ...
...
...
else:
... report unmatched line for further consideration ...
3: You use ".*" a lot. This is quite prone to matching too much. You
might find things like "\S+" better, which matches a single
nonwhitespace "word". It depends a bit on your input.
Cheers,
Cameron Simpson <c...@cskk.id.au>
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor