Alex Feddor wrote:
Hi
I am looking for method enables advanced text string search. Method
string.find() or re module seems no supporting what I am looking for.
The idea is as follows:
Text ="FDA meeting was successful. New drug is approved for whole sale
distribution!"
I would like to scan the text using AND and OR operators and gets -1 or
other value if the searching elements haven't found in the text.
Example 01:
search criteria: "FDA" AND ( "approve*" OR "supported")
The catch is that in Text variable FDA and approve words are not one
after another (other words are in between).
Bring on your hardest searches...
class Pattern(object): pass
class Logical(Pattern):
def __init__(self, pat1, pat2):
self.pat1 = pat1
self.pat2 = pat2
def __call__(self, text):
a, b = self.pat1(text), self.pat2(text)
if self.op(a != len(text), b != len(text)):
return min((a, b))
return len(text)
def __str__(self):
return '(%s %s %s)' % (self.pat1, self.op_name, self.pat2)
class P(Pattern):
def __init__(self, pat):
self.pat = pat
def __call__(self, text):
ret = text.find(self.pat)
return ret if ret != -1 else len(text)
def __str__(self):
return '"%s"' % self.pat
class NOT(Pattern):
def __init__(self, pat):
self.op_name = 'NOT'
self.pat = pat
def __call__(self, text):
ret = self.pat(text)
return ret - 1 if ret == len(text) else len(text)
def __str__(self):
return '%s (%s)' % (self.op_name, self.pat)
class XOR(Logical):
def __init__(self, pat1, pat2):
self.op_name = 'XOR'
self.op = lambda a, b: not(a and b) and (a or b)
super().__init__(pat1, pat2)
class OR(Logical):
def __init__(self, pat1, pat2):
self.op_name = 'OR'
self.op = lambda a, b: a or b
super().__init__(pat1, pat2)
class AND(Logical):
def __init__(self, pat1, pat2):
self.op_name = 'AND'
self.op = lambda a, b: a and b
super().__init__(pat1, pat2)
class Suite(object):
def __init__(self, pat):
self.pat = pat
def __call__(self, text):
ret = self.pat(text)
return ret if ret != len(text) else -1
def __str__(self):
return '[%s]' % self.pat
pat1 = P('FDA')
pat2 = P('approve*')
pat3 = P('supported')
p = Suite(AND(pat1, OR(pat2, pat3)))
print(p(''))
print(p('FDA'))
print(p('FDA supported'))
print(p('supported FDA'))
print(p('blah FDA bloh supported blih'))
print(p('blah FDA bleh supported bloh supported blih '))
p = Suite(AND(OR(pat1, pat2), XOR(pat2, NOT(pat3))))
print(p)
print(p(''))
print(p('FDA'))
print(p('FDA supported'))
print(p('supported sdc FDA sd'))
print(p('blah blih FDA bluh'))
print(p('blah blif supported blog'))
#################
I guess I went a bit overboard here (had too much time on hand), the
working is based on function composition, so instead of evaluation, you
composes a function (or more accurately, a callable class) that will
evaluate the logical value and return the index of the first item that
matches the logical expression. It currently uses str's builtin find,
but I guess it wouldn't be very hard to adapt it to use the re myfind()
below (only P class will need to change)
The Suite class is only there to turn the NotFound sentinel from
len(text) to -1 (used len(text) since it simplifies the code a lot...)
Caveat: The NOT class cannot reliably convert a False to True because I
don't know what index number to use.
Code written for efficient vertical space, not the most readable in the
world.
No guarantee no bug.
Idea:
Overrides the operator on Pattern class so we could write it like:
P("Hello") & P("World") instead of AND(P("Hello"), P("World"))
Example 02:
search criteria: "Ben"
The catch is that code sould find only exact Ben words not also words
which that has firts three letters Ben such as Benquick, Benseek etc..
Only Ben is the right word we are looking for.
The second one was easier...
import re
def myfind(pattern, text):
pattern = r'(.*?)\b(%s)\b(.*)' % pattern
m = re.match(pattern, text)
if m:
return len(m.group(1))
textfound = 'This is a Ben test string'
texttrick = 'This is a Benquick Benseek McBen QuickBenSeek string'
textnotfound = 'He is away'
textmulti = 'Our Ben found another Ben which is quite odd'
pat = 'Ben'
print(myfind(pat, textfound)) # 10
print(myfind(pat, texttrick)) # None
print(myfind(pat, textnotfound)) # None
print(myfind(pat, textmulti)) # 4
if you only want to test for existence, simply:
pattern = 'Ben'
if re.match(r'(.*?)\b(%s)\b(.*)' % pattern, text):
pass
I would really appreciated your advice - code sample / links how above
can be achieved! if possible I would appreciated solution achieved
with free of charge module.
Standard library is free of charge, no?
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor