On Tue, Aug 10, 2010 at 02:08:28AM +0100, Martin Gregorie wrote: > On Mon, 2010-08-09 at 17:42 -0700, jdow wrote: > > From: "Martin Gregorie" <mar...@gregorie.org> > > > Something like this will match a sequence of two capitalised name words, > > > including hyphenated ones, and extract the name words: > > > > > > /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/ > > > > > > and should be fairly easy to extend to deal with initials and/or more > > > than one forename. Tested in Python and should also work in Perl. > > > > > > > That solves the Reginald Slovotniksky type names. But, "John Smith"? Dunno. > > > The regex I showed will return 'John' and 'Smith' so the combo can be > queried in the database, which is all I set out to try. However, I was > trying to generalise as a regex that would match two or more Capitalised > Names and return them as an array of group values but I couldn't work > out how to do that without writing a rather tedious set of ever longer > alternates. If anybody knows how to do that without resorting to > alternatives I'd be fascinated to know how you do that.
Ok I did some more testing since this is an interesting experiment.. I dumped 15000 mail bodies into a file like SA sees them and feeded it to simple Perl script. Runtime for different methods (memory used including Perl itself): - Single 70000 name regex, 20s (8MB) - 7 regexes of 10000 names each, 141s (9MB) - "Martin style", lookups from Perl hash, 8s (12MB) So it seems single regex is much more preferred than few smaller ones. Though creating it with Regexp::Assemble required 250MB of memory.. Yeah looking at this I would go for the generic regex and test all matches with names stored in Perl hash. Average count of "names" to check per message was around 100, so using SQL directly would be inefficient though possible. Anyways, I concur that with so many names you would probably get lots of FPs.. identical doctors, friends, "john doe wrote:" etc..