RE: looking for faster Ideas...

Ian McGowan Tue, 27 Jan 2004 15:55:32 -0800

http://aspell.sourceforge.net/metaphone/metaphone.basic


soundex is pathetic - nowadays, metaphone is much better.

if you're feeling perl'ish

http://www.foo.be/docs/tpj/issues/vol5_3/tpj0503-0009.html

has an interesting discussion of using several approximate methods for
identifying records by name.  it even discusses the betty/elizabeth,
jack/john problem...  looks slow so you would probably have to cache the
results. c'mon there must be *something* unique in the file they send!
:-)

On Tue, 2004-01-27 at 14:32, George Gallen wrote:
> I thought of that, but soundex only works on the first three letters, if
> I remember correctly.
> or it only encodes the first three letters, then remaining are
> unchanged.
>  
> The main problem is I can't isolate a last name from the source, it
> comes in as a full name,
> and if I use the full name as given to us by the consumer, there is a
> chance it won't be in
> the same exact format as in the file from the rental, might be missing
> the middle initial
> one may have a married hyphenated name, one could be a shortened or
> different first name
> (ie. betty instead of elizabeth, or jack instead john......etc).
>  
> Since my original was a list of if/thens, looks like the I'm not going
> to be able to gain much
> in speed any other way with straight programming (that is no temp files,
> or files to bounce off).
>  
> George
> 
> -----Original Message-----
> From: Jeff Schasny [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 27, 2004 5:12 PM
> To: U2 Users Discussion List
> Subject: RE: looking for faster Ideas...
> 
> 
> I suppose you could soundex the whole thing
> 
> -----Original Message-----
> From: Geoffrey Mitchell [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 27, 2004 2:59 PM
> To: U2 Users Discussion List
> Subject: RE: looking for faster Ideas...
> 
> 
> We do something like this, using a "match code" composed of fragments of
> data concatenated together.  I think we use a delimiter, but you
> wouldn't need to.
> 
> So, if you want to match Johnson in zipcode 12345 on Maple street, you
> might have a matchcode of "JOHNSON*12345*MAPLE", so you would extract
> the relevant fields, build the matchcode and check it against a list or
> file.  Actually, we use an I-type dictionary to generate the matchcode,
> and have an index built on it.  For small datasets this may be *slower*
> than your case statement, but I would think that it would be easier to
> maintain, and for large datasets it should be quicker since the time to
> construct the matchcode and do a read, selectindex, or whatever would be
> constant.  Of course, if you have a Jonsson that gets spelled Johnson,
> you're going to have problems no matter how you approach it.
> 
> On Tue, 2004-01-27 at 13:05, George Gallen wrote: 
> 
> I can't just check for names, it has to a name with a specific zip code
> and if the name is fairly common, we also add in part of the address to
> make sure no one else is weeded out that shouldn't be.
> 
> I suppose I could keep two or three arrays, do a specific lookup in each
> saving the position, and if all three positions are identicle (asuming
> all
> three arrays have the name, address, zip in the same order) then that
> would
> be a match....Thanks
> 
> George
> 
> >-----Original Message-----
> >From: Jeff Schasny [  <mailto:[EMAIL PROTECTED]>
> mailto:[EMAIL PROTECTED]
> >Sent: Tuesday, January 27, 2004 1:51 PM
> >To: U2 Users Discussion List
> >Subject: RE: looking for faster Ideas...
> >
> >
> >how about keeping a list of excluded names as a record in a 
> >file (or as a
> >flat file in a directory with each name/item/whatever on a 
> >line) and reading
> >it into the program as a dynamic array then doing a locate on 
> >the string in
> >question.  Something like this:
> >
> >
> >READ ALIST FROM AFILE,SOME-ID ELSE STOP
> >X = 0
> >LOOP
> >   X += 1
> >   ASTRING = INLIST<X>
> >UNTIL ASTRING = ''
> >   LOCATE ASTRING IN ALIST SETTING POS THEN
> >      DO
> >      OTHER
> >      STUFF
> >   END ELSE
> >      DONT
> >   END
> >REPEAT
> >
> >Of course of you really want speed then sort the list and use 
> >a "BY clause
> >in the locate
> >
> >-----Original Message-----
> >From: George Gallen [  <mailto:[EMAIL PROTECTED]>
> mailto:[EMAIL PROTECTED]
> >Sent: Tuesday, January 27, 2004 11:33 AM
> >To: 'Ardent List'
> >Subject: looking for faster Ideas...
> >
> >
> >I can't setup any indexs to speed this up. Basically I'm 
> >scanning a CSV file
> >for names to remove
> >   and set the flag of KICK=1 to remove it (creating a new CSV 
> >file at the
> >same time).
> >
> >Keep in mind the ".." are people's last names, or zip codes, or part of
> >their address, changed
> >them to ".." to protect the unwanting...
> >
> >Right now, I do a series of CASE's ...
> >Now, it's not a major problem as I'm only checking for 20 or 
> >so names, but
> >as more and more people
> >  request to be removed (and we don't have access to the 
> >creation of the
> >list). this could get quite
> >  slow over 50 or 60 thousand lines of checking.
> >
> >LIN is one line of the CSV file, the INDEX is checking for a 
> >last name & a
> >zip code and sometimes
> >   part of the address line.
> >
> >Any Ideas?
> >
> >Remember, we can't change the source of the file, it will 
> >always be a CSV,
> >being read line by line
> >
> >   KICK=0
> >   BEGIN CASE
> >      CASE -1
> >         KICK=1
> >        BEGIN CASE
> >            CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND
> >INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND
> >INDEX(LIN,"..",1)#0 
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 
> >           CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 
> >           CASE -1
> >              KICK=0
> >        END CASE
> >   END CASE
> >
> >George Gallen
> >Senior Programmer/Analyst
> >Accounting/Data Division
> >[EMAIL PROTECTED]
> >ph:856.848.1000 Ext 220
> >
> >SLACK Incorporated - An innovative information, education and 
> >management
> >company
> >  <http://www.slackinc.com> http://www.slackinc.com
> >
> >_______________________________________________
> >u2-users mailing list
> >[EMAIL PROTECTED]
> >  <http://www.oliver.com/mailman/listinfo/u2-users>
> http://www.oliver.com/mailman/listinfo/u2-users
> >_______________________________________________
> >u2-users mailing list
> >[EMAIL PROTECTED]
> >  <http://www.oliver.com/mailman/listinfo/u2-users>
> http://www.oliver.com/mailman/listinfo/u2-users
> >
> 
> 
> 
>   _____  
> 
> 
> 
> _______________________________________________
> 
> u2-users mailing list
> 
> [EMAIL PROTECTED]
> 
>  <http://www.oliver.com/mailman/listinfo/u2-users>
> http://www.oliver.com/mailman/listinfo/u2-users
> 
-- 
Ian McGowan <[EMAIL PROTECTED]>

_______________________________________________
u2-users mailing list
[EMAIL PROTECTED]
http://www.oliver.com/mailman/listinfo/u2-users

RE: looking for faster Ideas...

Reply via email to