Re: [Tutor] regular expression query

Cameron Simpson Sun, 09 Jun 2019 16:38:41 -0700

On 08Jun2019 22:27, Sean Murphy <[email protected]> wrote:

Windows 10 OS, Python 3.6


Thanks for this.

I have a couple of queries in relation to extracting content usingregular expressions. I understand [...the regexp syntax...]

The challenge I am finding is getting a pattern to
extract specific word(s). Trying to identify the best method to use and how
to use the \1 when using forward and backward search pattern (Hoping I am
using the right term). Basically I am trying to extract specific phrases or
digits to place in a dictionary within categories. Thus if "ROYaL_BANK
123123123" is found, it is placed in a category called transfer funds. Other
might be a store name which likewise is placed in the store category.

I'll tackle your specific examples lower down, and make somesuggestions.

Note, I have found a logic error with "ROYAL_BANK 123123123", but thatisn't a concern. The extraction of the text is.


Line examples:
Royal_bank M-BANKING PAYMENT TRANSFER 123456 to 9922992299
Royal_bank M-BANKING PAYMENT TRANSFER 123456 FROM 9922992299
PAYMENT TO SARWARS-123123123
ROYAL_BANK INTERNET BANKING BPAY Kangaroo Store {123123123}
EFTPOS Amazon
PAY/SALARY FROM foo bar 123123123
PAYMENT TO Tax Man  666


Thanks.

Assuming the below is a cut/paste accident from some code:

 result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ', 
'ROYAL_BANK ', line)
 r'ROYAL_BANK INTERNET BANKING TRANSFER Mouth in foot

And other similar structures. Below is the function I am currently using.
Not sure if the sub, match or search is going to be the best method. The
reason why I am using a sub is to delete the unwanted text. The
searchmatch/findall  could do the same if I use a group. Also I have not
used any tests in the below and logically I think I should. As the code will
override the results if not found in the later tests. If there is a more
elegant  way to do it then having:

If line.startswith('text string to match'):
   Regular expression
el If line.startswith('text string to match'):
   regular expression
return result

There is. How far you take it depends on how variable your input it.Banking statement data I would expect to have relatively few formats(unless the banking/financ industry is every bit as fragmented as Isometimes believe, in which case the structure might be less driven by_your_ bank and instead arbitrarily garbled according the the variousother entities due to getting ad hoc junk as the description).

I would like to know. The different regular expressions I have usedare:


# this sometimes matches and sometimes does not. I want all the text up to
the from or to, to be replaced with "ROYAL_BANK". Ending up with ROYAL_BANK
123123123

   result= re.sub(r'ROYAL_BANK M-BANKING PAYMENT TRANSFER \d+ (TO|FROM) ',
'ROYAL_BANK ', line)

Looks superficially ok. Got an example input line where it fails? Notthat the above is case sentitive, so if "to" etc can be in lower case(as in your example text earlier) this will fail. See the re.I modifier.

# the below  returns from STARWARS and it shouldn't. I should just get
STARWARS.

   result = re.match(r'PAYMENT TO (SARWARS)-\d+ ', line)

Well, STARWARS seems misseplt above. And you should get a "match"object, with "STARWARS" in .group(1).

So earlier you're getting a str in result, and here you're getting anre.match object (or None for a failed match).

# the below should (doesn't work the last time I tested it) shouldreturn the words between the (.)
   result = re.match(r'ROYAL_BANK INTERNET BANKING BPAY (.*) [{].*$', '\1', 
line)

"should" what? It would help to see the input line you expect this tomatch. And re.match is not an re.sub - it looks like you have theseconfused here, based on the following '\`',line parameters.

# the below patterns should remove the text at the beginning of the string
   result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ', 
'ROYAL_BANK ', line)
   result = re.sub(r'ROYAL_BANK INTERNET BANKING TRANSFER ', '', line)
   result = re.sub(r'EFTPOS ', '', line)


Sure. Got an example line where this does not happen?

# The below does not work and I am trying to use the back or forwardsearch feature. Is this syntax wrong or the pattern wrong? I cannot work it out
from the information I have read.

    result = re.sub(r'PAY/SALARY FROM (*.) \d+$', '\1', line)
   result = re.sub(r'PAYMENT TO (*.) \d+', '\1', line)


You've got "*." You probably mean ".*"

Main issues:

1: Your input data seems to be mixed case, but all your regexps are casesensitive. They will not match if the case is different eg "Royal_Bank"vs "ROYAL_BANK", "to" vs "TO", etc. Use the re.I modified to make yourregexps case insensitive.

2: You're using re.sub a lot. I'd be inclined to always use re.match andto pull information from the match object you get back. Untested examplesketch:


 m = re.match('(ROYAL_BANK|COMMONER_CREDIT_UNION) INTERNET BANKING FUNDS TFER 
TRANSFER (\d+) TO (.*)', line)
 if m:
   category = m.match(1)
   id_number = m.match(2)
   recipient = m.match(3)
 else:
   m = re.match(.......)
   ... more tests here ...
   ...
   ...
   else:
     ... report unmatched line for further consideration ...

3: You use ".*" a lot. This is quite prone to matching too much. Youmight find things like "\S+" better, which matches a singlenonwhitespace "word". It depends a bit on your input.


Cheers,
Cameron Simpson <[email protected]>
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression query

Reply via email to