Lango, Trevor M. wrote: > I have two lists, not necessarily of the same length. List #1 has two > columns. List #2 has one column. I would like to do the following: > > Scan list #1 line by line. If a match for column #1 in list #1 is found > in list #2, extract the matching lines and put them in a new list (#3). > Otherwise, leave the contents of lists #1 and #2 as they are. > > If I expected the contents of the first column of each list to match > exactly (character for character) - this would be a simple task with C++ > or the like. However, the contents will not necessarily be perfectly > identical. I do believe they are nearly identical enough though to use > pattern matching via Perl or the like. Personally this is difficult for > me (as a Perl noob), I know how to scan through a file for a > pre-determined pattern - I don't understand how to scan through a file > for a pattern that is essentially given by a line in another file...? I > have not found anything in my reading of Perl documentation that > explains how to read a file and use its contents as an argument for the > pattern to search for in another file (suggestions on excellent Perl doc > sources appreciated also!). > > This is what the contents of the lists may look like: > > TALL0047A > TAL0047A > TAL047A > TAL47A > TA0047A > TA047A > TA47A > T0047A > T047A > T47A > T0047 > T047 > T47 > > Examples of matching: > > TALL0047A TALL047A match > TALL0047A TAL0047A not a match > TALL0047A TAL0470A not a match > > > The contents will always be one to four alpha characters followed by one > to four numeric characters possibly followed by one or two alpha > characters. > > A match would be defined as the following criteria being met: > > - The last one to four digits being identical (excluding leading zeroes) > - The first one to four letters being identical > > > It is absolutely imperative that any algorithm used does not produce > false positives - if a line is extracted as a match - it must without a > doubt actually be a match. It is not so critical if a possible match is > passed up. The lists will contain thousands or tens of thousands of > entries - just looking for a clever way to automate as much of the > process as possible. I expect to have to check a portion of the lists > by hand - I would simply appreciate reducing the number of lines that > have to be checked manually. Perl seems ideal - I'm just not savvy > enough with it (yet!) to make it work. I know there are some Perl gurus > lurking about in LUGOD so if any of you have a spare moment to lend this > some thought - thanks! > > > > Thank you in advance for any suggestions! > > - Trevor
Sounds like a good job for a SQL database. this algorithm is in a
language resembling Oracle PL/SQL.
create table one (col1 varchar(10));
create table two (col1 varchar(10), col2 varchar(100));
create or replace function matches(one varchar, other varchar) returns
boolean
begin
if regexp_replace(one,'^([A-Z])*0*([1-9][0-9]*)','\1') !=
regexp_replace(other,'^([A-Z])*0*([1-9][0-9]*)','\1') then
begin
return false;
end;
if regexp_replace(one,'^([A-Z])*0*([1-9][0-9]*)','\2') !=
regexp_replace(other,'^([A-Z])*0*([1-9][0-9]*)','\2') then
begin
return false;
end;
return true;
end;
/
select two.col2 from one,two where matches(one.col1,two.col1);
--
I usually have a GPG digital signature included as an attachment.
See http://www.gnupg.org/ for info about these digital signatures.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ vox-tech mailing list [email protected] http://lists.lugod.org/mailman/listinfo/vox-tech
