RE: looking for faster Ideas...
I tried to compile this and found errors. 1. line 121 missing a ")", I stuck it just before the THEN 2. line 68 truncated. I added "E[N +1,1] = "N" THEN SILENT = 1" I moved back to lesser URL and found description in "C" code to help with above. These writers are aware of the truncations etc. The code in basic was put up UNCHANGED then they worked the "C" code with described algorithms. I then switched the variables in subroutine call statement at line 1 (METAPH, NAME) I then created a "I" descriptor SUBR("MTAPHON", LNAME) Viewed the items Created and build and index in MTAPHON field. I seems to work even with my "bad fix". then I experimented by changing "4" to "6" in line 23 --FOR N = 1 TO L WHILE LEN(METAPH) < 6 Rich Sias, DBA Keystone Mercy Health Plan 215-937-8860 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ian McGowan Sent: Tuesday, January 27, 2004 5:54 PM To: U2 Users Discussion List Subject: RE: looking for faster Ideas... http://aspell.sourceforge.net/metaphone/metaphone.basic soundex is pathetic - nowadays, metaphone is much better. if you're feeling perl'ish http://www.foo.be/docs/tpj/issues/vol5_3/tpj0503-0009.html has an interesting discussion of using several approximate methods for identifying records by name. it even discusses the betty/elizabeth, jack/john problem... looks slow so you would probably have to cache the results. c'mon there must be *something* unique in the file they send! :-) On Tue, 2004-01-27 at 14:32, George Gallen wrote: > I thought of that, but soundex only works on the first three letters, if > I remember correctly. > or it only encodes the first three letters, then remaining are > unchanged. > -- CONFIDENTIALITY NOTICE: This electronic mail may contain information that is privileged, confidential, and/or otherwise protected from disclosure to anyone other than its intended recipient(s). Any dissemination or use of this electronic mail or its contents by persons other than the intended recipient(s) is strictly prohibited. If you have received this communication in error, please notify the sender immediately by reply e-mail so that we may correct our internal records. Please then delete the original message. Thank you. == ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... I might not have mad myself clear. If you have 10,000 name that want to be removed. You put them into a hasfile, and then process though the csv file, and attempt to read the item from the hash file based on the criteria ( i.e. Name ) A few read per line, if ordering does not matter. Otherwise you could potentially have to do 10,000 (multiple more if order matters) case statements, for each name. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of George Gallen Sent: Tuesday, January 27, 2004 1:00 PM To: 'U2 Users Discussion List' Subject: RE: looking for faster Ideas... Mike, doing what you propose would require a massive file to start with, and would require a crap load of disk reads, which would be far slower then a bunch of cases, and the project isn't worth that kind of investment anyway. But thanks. the source line would look something like "","jon c smith","1234 anywhere st","","","somecity","SS","12345-1254","" I'm looking for "smith" & "12345" and sometimes "anywhere" We may get a call from john smith (john not jon because they didn't spell their first name), didn't leave their middle init and didn't give us their 9 digit zip, only 5 digit zip. So I can't build any indexes. Searching for multiple pieces on the same line pretty much gives a fairly good matchup considing the source and match data aren't EXACTLY the same. Any of course, I'm not going to go hog wild in doing this. Creating a temp file, parsing into dynamic arrays loops and lookups...way too much, rather just use PERL to pre-process. -----Original Message- From: Mike Rajkowski [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 2:41 PM To: U2 Users Discussion List Subject: RE: looking for faster Ideas... Create a temp file, and populate it with variations of the name in question (upcase and remove spaces). (Storing address information in each record) Then loop through your list, taking the name, and parsing the various combinations of the words. ( John David Doe - JOHNDOE, DOEJOHN JOHNDAVIDDOE, JOHNDOEDAVID) And attempt to read the item from the temp file, if it can read an item then verify the address information. Otherwise check the next item. -Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of George Gallen Sent: Tuesday, January 27, 2004 12:13 PM To: 'U2 Users Discussion List' Subject: RE: looking for faster Ideas... in rethinking my take on that. That would still be difficult since the arrays would only contain "parts" of the whole fields. making the searching of the arrays very difficult. We can't store the exact entry, since sometimes people will call and say stop sending me things and not give us the name the same way it's in the database we rent. Basically it takes the renting company a couple months to remove the name, but we like to filter it immediately to stop anything from going out before the renting company removes it, and it also will catch it if the renting company replaces it in a couple months later.... George -Original Message- From: George Gallen [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 2:06 PM To: 'U2 Users Discussion List' Subject: RE: looking for faster Ideas... I can't just check for names, it has to a name with a specific zip code and if the name is fairly common, we also add in part of the address to make sure no one else is weeded out that shouldn't be. I suppose I could keep two or three arrays, do a specific lookup in each saving the position, and if all three positions are identicle (asuming all three arrays have the name, address, zip in the same order) then that would be a matchThanks George >-Original Message- >From: Jeff Schasny [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 1:51 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >how about keeping a list of excluded names as a record in a >file (or as a >flat file in a directory with each name/item/whatever on a >line) and reading >it into the program as a dynamic array then doing a locate on >the string in >question. Something like this: > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP >X = 0 >LOOP > X += 1 > ASTRING = INLIST >UNTIL ASTRING = '' > LOCATE ASTRING IN ALIST SETTING POS THEN > DO > OTHER > STUFF >
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... How about using *nix sort and comm based on a like-structured csv reference file to produce a sub-file of possible hits, then trawl this output using D3/UD to refine the list of unwanted rows (building back into a flat file) and then again using comm to produce your cleaned output file. Cuts down the file size you'll need to process in mv basic Cheers Steve -Original Message-From: George Gallen [mailto:[EMAIL PROTECTED]Sent: 27 January 2004 20:04To: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... keep in mind, it's not the renting company that is giveing us the remove infomation, it's the consumer, and of course they never have the mailing piece in their hand. Although usually, if they call, we can get the specific info we are looking for which can change the case to one check. But when the info is mailed in or emailed in or left on a voice mail, that's when we run into not having the best data to go with. Calling/emailing/mailing them back usually just increases the annoyance level on their end, since we are contacting them Again.. George -Original Message-From: George Gallen [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:51 PMTo: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... sometimes there is a number, but rarely, are we given the number when requested to remove, usually just remove me from your $^&#^$*&$ mailing :) some add please. I considered PERL as a pre-processor to remove the names then pass that file to my program which does other stuff too George >-Original Message- >From: Ian McGowan [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 2:22 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >if speed is the issue, sounds like a job for a compiled lanuage. or >semi-compiled like perl or python. > >is there a unique number sent over by the other system? it might be >quicker to parse the whole thing and keep an exclude file keyed off the >unique number. if it weren't for embedded comma's you could >CONVERT "," >TO @AM, extract the key and write the record out as-is. that would be >quicker than 852 INDEX's :-) > >On Tue, 2004-01-27 at 11:05, George Gallen wrote: >> I can't just check for names, it has to a name with a >specific zip code >> and if the name is fairly common, we also add in part of the >address to >> make sure no one else is weeded out that shouldn't be. >> >> I suppose I could keep two or three arrays, do a specific >lookup in each >> >> saving the position, and if all three positions are >identicle (asuming >> all >> three arrays have the name, address, zip in the same order) then that >> would >> be a matchThanks >> >> George >> >> >-Original Message- >> >From: Jeff Schasny [ mailto:[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> ] >> >Sent: Tuesday, January 27, 2004 1:51 PM >> >To: U2 Users Discussion List >> >Subject: RE: looking for faster Ideas... >> > >> > >> >how about keeping a list of excluded names as a record in a >> >file (or as a >> >flat file in a directory with each name/item/whatever on a >> >line) and reading >> >it into the program as a dynamic array then doing a locate on >> >the string in >> >question. Something like this: >> > >> > >> >READ ALIST FROM AFILE,SOME-ID ELSE STOP >> >X = 0 >> >LOOP >> > X += 1 >> > ASTRING = INLIST >> >UNTIL ASTRING = '' >> > LOCATE ASTRING IN ALIST SETTING POS THEN >> > DO >> > OTHER >> > STUFF >> > END ELSE >> > DONT >> > END >> >REPEAT >> > >> >Of course of you really want speed then sort the list and use >> >a "BY clause >> >in the locate >> > >> >-Original Message- >> >From: George Gallen [ mailto:[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> ] >> >Sent: Tuesday, January 27, 2004 11:33 AM >> >To: 'Ardent List' >> >Subject: looking for faster Ideas... >> > >> > >
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... I like this idea. Thanks George -Original Message-From: Anthony Youngman [mailto:[EMAIL PROTECTED]Sent: Wednesday, January 28, 2004 3:09 AMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas... You may find contacting them again isn't the annoyance you expect. People tend to get annoyed if they think they're dealing with a computer. Get a genuine person get back to them and say "yes, we're trying to fix this for you", and you've just turned someone from being anti into being a prospective customer. Anyways, my take (to save on all this CASEing ...) - I'd use MATREAD rather than Matt's choice of READ, and ... can you preprocess on the basis of, say, zip code? Have an MV file containing all the records you want excluded or matchcodes thereof. Let's say, John Smith of AB12345 contacts you and says "take me off your list". You check, and his record has the correct zip code in the CSV. So you edit your MV file, and discover that Will Carling also told you to take him off some while back. ED EXCLUDEFILE AB12345 -: P 1: *WILL*CARLING* -: I *JOHN*SMITH* 2: *JOHN*SMITH* -: FI So now, when you're processing your CSV, from each record you can do extract zip code read zip-code-record from EXCLUDEFILE else record is okay if record matches LOWER(zip-code-record) else record is okay get next record Gets rid of reams of case statements, saves you having to rewrite the program every time, and is fast because most records will be validated on a single (failed) MV read. Cheers, Wol From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of George GallenSent: 27 January 2004 20:04To: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... But when the info is mailed in or emailed in or left on a voice mail, that's when we run into not having the best data to go with. Calling/emailing/mailing them back usually just increases the annoyance level on their end, since we are contacting them Again.. George***This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.Telephone numbers for ECA International offices are: Sydney +61 (0)2 9911 7799, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.*** ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... intresting. George -Original Message-From: Stuart Boydell [mailto:[EMAIL PROTECTED]Sent: Wednesday, January 28, 2004 2:43 AMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas... Maybe something like this fuzzy text string searcher might work for you http://www.pmsi.fr/fuzstrng.htm -Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]On Behalf Of George GallenSent: Wednesday, 28 January 2004 09:33To: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... I thought of that, but soundex only works on the first three letters, if I remember correctly. or it only encodes the first three letters, then remaining are unchanged. The main problem is I can't isolate a last name from the source, it comes in as a full name, and if I use the full name as given to us by the consumer, there is a chance it won't be in the same exact format as in the file from the rental, might be missing the middle initial one may have a married hyphenated name, one could be a shortened or different first name (ie. betty instead of elizabeth, or jack instead john..etc). Since my original was a list of if/thens, looks like the I'm not going to be able to gain much in speed any other way with straight programming (that is no temp files, or files to bounce off). George**This email message and any files transmitted with it are confidentialand intended solely for the use of addressed recipient(s). If you have received this email in error please notify the Spotless IS Support Centre (61 3 9269 7555) immediately who will advise further action.This footnote also confirms that this email message has been scannedfor the presence of computer viruses.** ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... I'm going to have to read this one over a few times. My brain hurt thinking about it :) Thanks George >-Original Message- >From: Craig Bennett [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 6:21 PM >To: U2 Users Discussion List >Subject: Re: looking for faster Ideas... > > >George, > >I don't know if this will help you, but part of the problem with a CASE >statement is that every statement is tested until you have a >match and EVERY >statement is tested if there is no match. If you don't have a >large number >to remove, this can get very wasteful. > >When I need to parse some data and I need to do it fast (and I >don't care >that I may write a very long tedious program (sometime I even write a >program to build the final program)) I find that a state >machine model with >computed gosubs based on ASCII character numbers can be quicker. > >I started writing the code below, but then remembered I had to work :( > >The basic Idea is to only test the characters you need and to >test them one >by one where each letter in a match is another internal subroutine eg: > >LOOP WHILE POS LE (DATALEN - MATCHLEN) DO > * Just match A-Z and we are only looking for names >starting with A and T > CHARCODE = SEQ(MYDATA[POS, 1]) ;* Under UV >BYTEVAL(MYDATA, POS) >is MUCH quicker. > ON CHARCODE + 64 GOSUB NOMATCH, > FIRSTCHARA, > >NOMATCH, ;* B > >NOMATCH, ;* C > >NOMATCH, ;* D > >NOMATCH, ;* E > >NOMATCH, ;* F > >NOMATCH, ;* G > >NOMATCH, ;* H > >NOMATCH, ;* I > >NOMATCH, ;* J > >NOMATCH, ;* K > >NOMATCH, ;* L > >NOMATCH, ;* M > >NOMATCH, ;* N > >NOMATCH, ;* O > >NOMATCH, ;* P > >NOMATCH, ;* Q > >NOMATCH, ;* R > >NOMATCH, ;* S > FIRSTCHART, > >NOMATCH, ;* U > >NOMATCH, ;* V > >NOMATCH, ;* W > >NOMATCH, ;* X > >NOMATCH, ;* Y > >NOMATCH, ;* Z > NOMATCH > >REPEAT > >NOMATCH: > * Set a flag to false > MATCH.NAME = 0 >RETURN > >FIRSTCHARA: > POS += 1 > CHARCODE = SEQ(MYDATA[POS, 1]) > ON CHARCODE + 64 GOSUB NOMATCH, > >NOMATCH, ;* A > SECONDCHARB, > >NOMATCH, ;* C > >NOMATCH, ;* D > >NOMATCH, ;* E > >NOMATCH, ;* F > >NOMATCH, ;* G > >NOMATCH, ;* H > >NOMATCH, ;* I > >NOMATCH, ;* J > >NOMATCH, ;* K > >NOMATCH, ;* L > >NOMA
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... True putting the first check in the case, then checking the 2nd and 3rd... in the body of the case at first sounds good...but if it fails on the 2nd or 3rd check in the body of the case, it will no longer check any other cases, since it had a positive case found, so I have to have all all checks on the case line (does that make any sense?) As for the 2k blocks. If all this program did was weed out names, you are right, that would be a better way to go. However, it also does other things to each line (like put in our own unique mailing code for nixie-returns) for all those that aren't supposed to get kicked out. George >-Original Message- >From: Tony Wood [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 5:28 PM >To: U2 Users Discussion List >Subject: Re: looking for faster Ideas... > > >Hi George, > >We some processing through files of 5-4Mb in D3 and UniVerse. >We found one >of the quicker to process these files was to read about 2k of data at a >time. You would need to identify the last complete line work >with everything >before that keeping the last bit for then next processing chunk. > >As far as finding matches you have one, two or three pieces of >data to match >on.So start with one if you score a match then look further. This will >reduce your processing to quickly find anything that might >match rather than >having to match on everything for every line. Processing in 2k >chunks also >means you can index for "SMITH" and find none quickly rather >than processing >each line looking for "SMITH" + "" and "SMITH" + "MENERE ST". > >I would avoid using index on a line by line basis. I would >also look at what >information you usually get and consider using a record where >the item id is >the key search string. Where you have more than one out-opter >you can then >use either multi-values or attributes to contain the other >search criteria. > >Sounds a little complicated but it breaks the job into smaller >chunks to be >resolved and will require less processing in the long run I believe. > >Good luck > >T. > >- Original Message - >From: "George Gallen" <[EMAIL PROTECTED]> >To: "'Ardent List'" <[EMAIL PROTECTED]> >Sent: Wednesday, January 28, 2004 5:33 AM >Subject: looking for faster Ideas... > > >> I can't setup any indexs to speed this up. Basically I'm >scanning a CSV >file >> for names to remove >> and set the flag of KICK=1 to remove it (creating a new >CSV file at the >> same time). >> >> Keep in mind the ".." are people's last names, or zip codes, >or part of >> their address, changed >> them to ".." to protect the unwanting... >> >> Right now, I do a series of CASE's ... >> Now, it's not a major problem as I'm only checking for 20 or >so names, but >> as more and more people >> request to be removed (and we don't have access to the >creation of the >> list). this could get quite >> slow over 50 or 60 thousand lines of checking. >> >> LIN is one line of the CSV file, the INDEX is checking for a >last name & a >> zip code and sometimes >> part of the address line. >> >> Any Ideas? >> >> Remember, we can't change the source of the file, it will >always be a CSV, >> being read line by line >> >> KICK=0 >> BEGIN CASE >> CASE -1 >> KICK=1 >> BEGIN CASE >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >> INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >> INDEX(LIN,"..",1)#0 >> CASE IND
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... You may find contacting them again isn't the annoyance you expect. People tend to get annoyed if they think they're dealing with a computer. Get a genuine person get back to them and say "yes, we're trying to fix this for you", and you've just turned someone from being anti into being a prospective customer. Anyways, my take (to save on all this CASEing ...) - I'd use MATREAD rather than Matt's choice of READ, and ... can you preprocess on the basis of, say, zip code? Have an MV file containing all the records you want excluded or matchcodes thereof. Let's say, John Smith of AB12345 contacts you and says "take me off your list". You check, and his record has the correct zip code in the CSV. So you edit your MV file, and discover that Will Carling also told you to take him off some while back. ED EXCLUDEFILE AB12345 -: P 1: *WILL*CARLING* -: I *JOHN*SMITH* 2: *JOHN*SMITH* -: FI So now, when you're processing your CSV, from each record you can do extract zip code read zip-code-record from EXCLUDEFILE else record is okay if record matches LOWER(zip-code-record) else record is okay get next record Gets rid of reams of case statements, saves you having to rewrite the program every time, and is fast because most records will be validated on a single (failed) MV read. Cheers, Wol From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of George GallenSent: 27 January 2004 20:04To: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... But when the info is mailed in or emailed in or left on a voice mail, that's when we run into not having the best data to go with. Calling/emailing/mailing them back usually just increases the annoyance level on their end, since we are contacting them Again.. George *** This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system. Telephone numbers for ECA International offices are: Sydney +61 (0)2 9911 7799, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333. *** ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... ï Maybe something like this fuzzy text string searcher might work for you http://www.pmsi.fr/fuzstrng.htm -Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]On Behalf Of George GallenSent: Wednesday, 28 January 2004 09:33To: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... I thought of that, but soundex only works on the first three letters, if I remember correctly. or it only encodes the first three letters, then remaining are unchanged. The main problem is I can't isolate a last name from the source, it comes in as a full name, and if I use the full name as given to us by the consumer, there is a chance it won't be in the same exact format as in the file from the rental, might be missing the middle initial one may have a married hyphenated name, one could be a shortened or different first name (ie. betty instead of elizabeth, or jack instead john..etc). Since my original was a list of if/thens, looks like the I'm not going to be able to gain much in speed any other way with straight programming (that is no temp files, or files to bounce off). George ** This email message and any files transmitted with it are confidential and intended solely for the use of addressed recipient(s). If you have received this email in error please notify the Spotless IS Support Centre (61 3 9269 7555) immediately who will advise further action. This footnote also confirms that this email message has been scanned for the presence of computer viruses. ** ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
Re: looking for faster Ideas...
George, my guess is that You use sequential IO on the CSV-file and that is what eats time. If You have memory enough to read the entire file inte memory and REMOVE lines instead of READSEQ them You'll see a _dramatic_ performance increase. Else split Your case into say four cases of ruoghly the same length with all lastnames sorting before say 'F' in the first one and those between F and M in the second etc . This approach will reduce the time in the CASE constructs by a factor four. Wich of course may turn out to be only some percent of total time :-( /Mats George Gallen wrote: I can't setup any indexs to speed this up. Basically I'm scanning a CSV file for names to remove and set the flag of KICK=1 to remove it (creating a new CSV file at the same time). Keep in mind the ".." are people's last names, or zip codes, or part of their address, changed them to ".." to protect the unwanting... Right now, I do a series of CASE's ... Now, it's not a major problem as I'm only checking for 20 or so names, but as more and more people request to be removed (and we don't have access to the creation of the list). this could get quite slow over 50 or 60 thousand lines of checking. LIN is one line of the CSV file, the INDEX is checking for a last name & a zip code and sometimes part of the address line. Any Ideas? Remember, we can't change the source of the file, it will always be a CSV, being read line by line KICK=0 BEGIN CASE CASE -1 KICK=1 BEGIN CASE CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE -1 KICK=0 END CASE END CASE George Gallen Senior Programmer/Analyst Accounting/Data Division [EMAIL PROTECTED] ph:856.848.1000 Ext 220 SLACK Incorporated - An innovative information, education and management company http://www.slackinc.com ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
Re: looking for faster Ideas...
George, I don't know if this will help you, but part of the problem with a CASE statement is that every statement is tested until you have a match and EVERY statement is tested if there is no match. If you don't have a large number to remove, this can get very wasteful. When I need to parse some data and I need to do it fast (and I don't care that I may write a very long tedious program (sometime I even write a program to build the final program)) I find that a state machine model with computed gosubs based on ASCII character numbers can be quicker. I started writing the code below, but then remembered I had to work :( The basic Idea is to only test the characters you need and to test them one by one where each letter in a match is another internal subroutine eg: LOOP WHILE POS LE (DATALEN - MATCHLEN) DO * Just match A-Z and we are only looking for names starting with A and T CHARCODE = SEQ(MYDATA[POS, 1]);* Under UV BYTEVAL(MYDATA, POS) is MUCH quicker. ON CHARCODE + 64 GOSUB NOMATCH, FIRSTCHARA, NOMATCH,;* B NOMATCH,;* C NOMATCH,;* D NOMATCH,;* E NOMATCH,;* F NOMATCH,;* G NOMATCH,;* H NOMATCH,;* I NOMATCH,;* J NOMATCH,;* K NOMATCH,;* L NOMATCH,;* M NOMATCH,;* N NOMATCH,;* O NOMATCH,;* P NOMATCH,;* Q NOMATCH,;* R NOMATCH,;* S FIRSTCHART, NOMATCH,;* U NOMATCH,;* V NOMATCH,;* W NOMATCH,;* X NOMATCH,;* Y NOMATCH,;* Z NOMATCH REPEAT NOMATCH: * Set a flag to false MATCH.NAME = 0 RETURN FIRSTCHARA: POS += 1 CHARCODE = SEQ(MYDATA[POS, 1]) ON CHARCODE + 64 GOSUB NOMATCH, NOMATCH,;* A SECONDCHARB, NOMATCH,;* C NOMATCH,;* D NOMATCH,;* E NOMATCH,;* F NOMATCH,;* G NOMATCH,;* H NOMATCH,;* I NOMATCH,;* J NOMATCH,;* K NOMATCH,;* L NOMATCH,;* M NOMATCH,;* N NOMATCH,;* O NOMATCH,;* P NOMATCH,;* Q NOMATCH,;* R NOMATCH,;* S SECONDCHART, NOMATCH,;* U NOMATCH,;* V NOMATCH,;* W NOMATCH,;* X NOMATCH,;* Y
Re: looking for faster Ideas...
Hi George, We some processing through files of 5-4Mb in D3 and UniVerse. We found one of the quicker to process these files was to read about 2k of data at a time. You would need to identify the last complete line work with everything before that keeping the last bit for then next processing chunk. As far as finding matches you have one, two or three pieces of data to match on.So start with one if you score a match then look further. This will reduce your processing to quickly find anything that might match rather than having to match on everything for every line. Processing in 2k chunks also means you can index for "SMITH" and find none quickly rather than processing each line looking for "SMITH" + "" and "SMITH" + "MENERE ST". I would avoid using index on a line by line basis. I would also look at what information you usually get and consider using a record where the item id is the key search string. Where you have more than one out-opter you can then use either multi-values or attributes to contain the other search criteria. Sounds a little complicated but it breaks the job into smaller chunks to be resolved and will require less processing in the long run I believe. Good luck T. - Original Message - From: "George Gallen" <[EMAIL PROTECTED]> To: "'Ardent List'" <[EMAIL PROTECTED]> Sent: Wednesday, January 28, 2004 5:33 AM Subject: looking for faster Ideas... > I can't setup any indexs to speed this up. Basically I'm scanning a CSV file > for names to remove >and set the flag of KICK=1 to remove it (creating a new CSV file at the > same time). > > Keep in mind the ".." are people's last names, or zip codes, or part of > their address, changed > them to ".." to protect the unwanting... > > Right now, I do a series of CASE's ... > Now, it's not a major problem as I'm only checking for 20 or so names, but > as more and more people > request to be removed (and we don't have access to the creation of the > list). this could get quite > slow over 50 or 60 thousand lines of checking. > > LIN is one line of the CSV file, the INDEX is checking for a last name & a > zip code and sometimes >part of the address line. > > Any Ideas? > > Remember, we can't change the source of the file, it will always be a CSV, > being read line by line > >KICK=0 >BEGIN CASE > CASE -1 > KICK=1 > BEGIN CASE > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE -1 >KICK=0 > END CASE >END CASE > > George Gallen > Senior Programmer/Analyst > Accounting/Data Division > [EMAIL PROTECTED] > ph:856.848.1000 Ext 220 > > SLACK Incorporated - An innovative information, education and management > company > http://www.slackinc.com > > ___ > u2-users mailing list > [EMAIL PROTECTED] > http://www.oliver.com/mailman/listinfo/u2-users > > ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
http://aspell.sourceforge.net/metaphone/metaphone.basic soundex is pathetic - nowadays, metaphone is much better. if you're feeling perl'ish http://www.foo.be/docs/tpj/issues/vol5_3/tpj0503-0009.html has an interesting discussion of using several approximate methods for identifying records by name. it even discusses the betty/elizabeth, jack/john problem... looks slow so you would probably have to cache the results. c'mon there must be *something* unique in the file they send! :-) On Tue, 2004-01-27 at 14:32, George Gallen wrote: > I thought of that, but soundex only works on the first three letters, if > I remember correctly. > or it only encodes the first three letters, then remaining are > unchanged. > > The main problem is I can't isolate a last name from the source, it > comes in as a full name, > and if I use the full name as given to us by the consumer, there is a > chance it won't be in > the same exact format as in the file from the rental, might be missing > the middle initial > one may have a married hyphenated name, one could be a shortened or > different first name > (ie. betty instead of elizabeth, or jack instead john..etc). > > Since my original was a list of if/thens, looks like the I'm not going > to be able to gain much > in speed any other way with straight programming (that is no temp files, > or files to bounce off). > > George > > -Original Message- > From: Jeff Schasny [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 27, 2004 5:12 PM > To: U2 Users Discussion List > Subject: RE: looking for faster Ideas... > > > I suppose you could soundex the whole thing > > -Original Message- > From: Geoffrey Mitchell [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 27, 2004 2:59 PM > To: U2 Users Discussion List > Subject: RE: looking for faster Ideas... > > > We do something like this, using a "match code" composed of fragments of > data concatenated together. I think we use a delimiter, but you > wouldn't need to. > > So, if you want to match Johnson in zipcode 12345 on Maple street, you > might have a matchcode of "JOHNSON*12345*MAPLE", so you would extract > the relevant fields, build the matchcode and check it against a list or > file. Actually, we use an I-type dictionary to generate the matchcode, > and have an index built on it. For small datasets this may be *slower* > than your case statement, but I would think that it would be easier to > maintain, and for large datasets it should be quicker since the time to > construct the matchcode and do a read, selectindex, or whatever would be > constant. Of course, if you have a Jonsson that gets spelled Johnson, > you're going to have problems no matter how you approach it. > > On Tue, 2004-01-27 at 13:05, George Gallen wrote: > > I can't just check for names, it has to a name with a specific zip code > and if the name is fairly common, we also add in part of the address to > make sure no one else is weeded out that shouldn't be. > > I suppose I could keep two or three arrays, do a specific lookup in each > saving the position, and if all three positions are identicle (asuming > all > three arrays have the name, address, zip in the same order) then that > would > be a matchThanks > > George > > >-Original Message- > >From: Jeff Schasny [ <mailto:[EMAIL PROTECTED]> > mailto:[EMAIL PROTECTED] > >Sent: Tuesday, January 27, 2004 1:51 PM > >To: U2 Users Discussion List > >Subject: RE: looking for faster Ideas... > > > > > >how about keeping a list of excluded names as a record in a > >file (or as a > >flat file in a directory with each name/item/whatever on a > >line) and reading > >it into the program as a dynamic array then doing a locate on > >the string in > >question. Something like this: > > > > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP > >X = 0 > >LOOP > > X += 1 > > ASTRING = INLIST > >UNTIL ASTRING = '' > > LOCATE ASTRING IN ALIST SETTING POS THEN > > DO > > OTHER > > STUFF > > END ELSE > > DONT > > END > >REPEAT > > > >Of course of you really want speed then sort the list and use > >a "BY clause > >in the locate > > > >-Original Message- > >From: George Gallen [ <mailto:[EMAIL PROTECTED]> > mailto:[EMAIL PROTECTED] > >Sent: Tuesday, January 27, 2004 11:33 AM > >To: 'Ardent List' > >Subject: looking for faster Ideas... > > > > > >I can't setup any indexs to s
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... More than you could have ever possibly wanted to know about soundex: http://www.avotaynu.com/soundex.html -Original Message-From: George Gallen [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 3:33 PMTo: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... I thought of that, but soundex only works on the first three letters, if I remember correctly. or it only encodes the first three letters, then remaining are unchanged. The main problem is I can't isolate a last name from the source, it comes in as a full name, and if I use the full name as given to us by the consumer, there is a chance it won't be in the same exact format as in the file from the rental, might be missing the middle initial one may have a married hyphenated name, one could be a shortened or different first name (ie. betty instead of elizabeth, or jack instead john..etc). Since my original was a list of if/thens, looks like the I'm not going to be able to gain much in speed any other way with straight programming (that is no temp files, or files to bounce off). George -Original Message-From: Jeff Schasny [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 5:12 PMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas... I suppose you could soundex the whole thing -Original Message-From: Geoffrey Mitchell [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:59 PMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas...We do something like this, using a "match code" composed of fragments of data concatenated together. I think we use a delimiter, but you wouldn't need to.So, if you want to match Johnson in zipcode 12345 on Maple street, you might have a matchcode of "JOHNSON*12345*MAPLE", so you would extract the relevant fields, build the matchcode and check it against a list or file. Actually, we use an I-type dictionary to generate the matchcode, and have an index built on it. For small datasets this may be *slower* than your case statement, but I would think that it would be easier to maintain, and for large datasets it should be quicker since the time to construct the matchcode and do a read, selectindex, or whatever would be constant. Of course, if you have a Jonsson that gets spelled Johnson, you're going to have problems no matter how you approach it.On Tue, 2004-01-27 at 13:05, George Gallen wrote: I can't just check for names, it has to a name with a specific zip codeand if the name is fairly common, we also add in part of the address tomake sure no one else is weeded out that shouldn't be.I suppose I could keep two or three arrays, do a specific lookup in eachsaving the position, and if all three positions are identicle (asuming allthree arrays have the name, address, zip in the same order) then that wouldbe a matchThanksGeorge>-Original Message->From: Jeff Schasny [mailto:[EMAIL PROTECTED]]>Sent: Tuesday, January 27, 2004 1:51 PM>To: U2 Users Discussion List>Subject: RE: looking for faster Ideas...>>>how about keeping a list of excluded names as a record in a >file (or as a>flat file in a directory with each name/item/whatever on a >line) and reading>it into the program as a dynamic array then doing a locate on >the string in>question. Something like this:>>>READ ALIST FROM AFILE,SOME-ID ELSE STOP>X = 0>LOOP> X += 1> ASTRING = INLIST>UNTIL ASTRING = ''> LOCATE ASTRING IN ALIST SETTING POS THEN> DO> OTHER> STUFF> END ELSE> DONT> END>REPEAT>>Of course of you really want speed then sort the list and use >a "BY clause>in the locate>>-Original Message->From: George Gallen [mailto:[EMAIL PROTECTED]]>Sent: Tuesday, January 27, 2004 11:33 AM>To: 'Ardent List'>Subject: looking for faster Ideas...>>>I can't setup any indexs to speed this up. Basically I'm >scanning a CSV file>for names to remove> and set the flag of KICK=1 to remove it (creating a new CSV >file at the>same time).>>Keep in mind the ".." are people's last names, or zip codes, or part of>their address, changed>them to ".." to protect the unwanting...>>Right n
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... I thought of that, but soundex only works on the first three letters, if I remember correctly. or it only encodes the first three letters, then remaining are unchanged. The main problem is I can't isolate a last name from the source, it comes in as a full name, and if I use the full name as given to us by the consumer, there is a chance it won't be in the same exact format as in the file from the rental, might be missing the middle initial one may have a married hyphenated name, one could be a shortened or different first name (ie. betty instead of elizabeth, or jack instead john..etc). Since my original was a list of if/thens, looks like the I'm not going to be able to gain much in speed any other way with straight programming (that is no temp files, or files to bounce off). George -Original Message-From: Jeff Schasny [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 5:12 PMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas... I suppose you could soundex the whole thing -Original Message-From: Geoffrey Mitchell [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:59 PMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas...We do something like this, using a "match code" composed of fragments of data concatenated together. I think we use a delimiter, but you wouldn't need to.So, if you want to match Johnson in zipcode 12345 on Maple street, you might have a matchcode of "JOHNSON*12345*MAPLE", so you would extract the relevant fields, build the matchcode and check it against a list or file. Actually, we use an I-type dictionary to generate the matchcode, and have an index built on it. For small datasets this may be *slower* than your case statement, but I would think that it would be easier to maintain, and for large datasets it should be quicker since the time to construct the matchcode and do a read, selectindex, or whatever would be constant. Of course, if you have a Jonsson that gets spelled Johnson, you're going to have problems no matter how you approach it.On Tue, 2004-01-27 at 13:05, George Gallen wrote: I can't just check for names, it has to a name with a specific zip codeand if the name is fairly common, we also add in part of the address tomake sure no one else is weeded out that shouldn't be.I suppose I could keep two or three arrays, do a specific lookup in eachsaving the position, and if all three positions are identicle (asuming allthree arrays have the name, address, zip in the same order) then that wouldbe a matchThanksGeorge>-Original Message->From: Jeff Schasny [mailto:[EMAIL PROTECTED]]>Sent: Tuesday, January 27, 2004 1:51 PM>To: U2 Users Discussion List>Subject: RE: looking for faster Ideas...>>>how about keeping a list of excluded names as a record in a >file (or as a>flat file in a directory with each name/item/whatever on a >line) and reading>it into the program as a dynamic array then doing a locate on >the string in>question. Something like this:>>>READ ALIST FROM AFILE,SOME-ID ELSE STOP>X = 0>LOOP> X += 1> ASTRING = INLIST>UNTIL ASTRING = ''> LOCATE ASTRING IN ALIST SETTING POS THEN> DO> OTHER> STUFF> END ELSE> DONT> END>REPEAT>>Of course of you really want speed then sort the list and use >a "BY clause>in the locate>>-Original Message->From: George Gallen [mailto:[EMAIL PROTECTED]]>Sent: Tuesday, January 27, 2004 11:33 AM>To: 'Ardent List'>Subject: looking for faster Ideas...>>>I can't setup any indexs to speed this up. Basically I'm >scanning a CSV file>for names to remove> and set the flag of KICK=1 to remove it (creating a new CSV >file at the>same time).>>Keep in mind the ".." are people's last names, or zip codes, or part of>their address, changed>them to ".." to protect the unwanting...>>Right now, I do a series of CASE's ...>Now, it's not a major problem as I'm only checking for 20 or >so names, but>as more and more people> request to be removed (and we don't have access to the >creation of the>list). this could get quite> slow over 50 or 60 thousand lines of checking.>>LIN is one line of the CSV file, the INDEX is checking for a >last name & a>zip code and sometimes> par
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... I suppose you could soundex the whole thing -Original Message-From: Geoffrey Mitchell [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:59 PMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas...We do something like this, using a "match code" composed of fragments of data concatenated together. I think we use a delimiter, but you wouldn't need to.So, if you want to match Johnson in zipcode 12345 on Maple street, you might have a matchcode of "JOHNSON*12345*MAPLE", so you would extract the relevant fields, build the matchcode and check it against a list or file. Actually, we use an I-type dictionary to generate the matchcode, and have an index built on it. For small datasets this may be *slower* than your case statement, but I would think that it would be easier to maintain, and for large datasets it should be quicker since the time to construct the matchcode and do a read, selectindex, or whatever would be constant. Of course, if you have a Jonsson that gets spelled Johnson, you're going to have problems no matter how you approach it.On Tue, 2004-01-27 at 13:05, George Gallen wrote: I can't just check for names, it has to a name with a specific zip codeand if the name is fairly common, we also add in part of the address tomake sure no one else is weeded out that shouldn't be.I suppose I could keep two or three arrays, do a specific lookup in eachsaving the position, and if all three positions are identicle (asuming allthree arrays have the name, address, zip in the same order) then that wouldbe a matchThanksGeorge>-Original Message->From: Jeff Schasny [mailto:[EMAIL PROTECTED]]>Sent: Tuesday, January 27, 2004 1:51 PM>To: U2 Users Discussion List>Subject: RE: looking for faster Ideas...>>>how about keeping a list of excluded names as a record in a >file (or as a>flat file in a directory with each name/item/whatever on a >line) and reading>it into the program as a dynamic array then doing a locate on >the string in>question. Something like this:>>>READ ALIST FROM AFILE,SOME-ID ELSE STOP>X = 0>LOOP> X += 1> ASTRING = INLIST>UNTIL ASTRING = ''> LOCATE ASTRING IN ALIST SETTING POS THEN> DO> OTHER> STUFF> END ELSE> DONT> END>REPEAT>>Of course of you really want speed then sort the list and use >a "BY clause>in the locate>>-Original Message->From: George Gallen [mailto:[EMAIL PROTECTED]]>Sent: Tuesday, January 27, 2004 11:33 AM>To: 'Ardent List'>Subject: looking for faster Ideas...>>>I can't setup any indexs to speed this up. Basically I'm >scanning a CSV file>for names to remove> and set the flag of KICK=1 to remove it (creating a new CSV >file at the>same time).>>Keep in mind the ".." are people's last names, or zip codes, or part of>their address, changed>them to ".." to protect the unwanting...>>Right now, I do a series of CASE's ...>Now, it's not a major problem as I'm only checking for 20 or >so names, but>as more and more people> request to be removed (and we don't have access to the >creation of the>list). this could get quite> slow over 50 or 60 thousand lines of checking.>>LIN is one line of the CSV file, the INDEX is checking for a >last name & a>zip code and sometimes> part of the address line.>>Any Ideas?>>Remember, we can't change the source of the file, it will >always be a CSV,>being read line by line>> KICK=0> BEGIN CASE> CASE -1> KICK=1> BEGIN CASE> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND>INDEX(LIN,"..",1)#0> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... We do something like this, using a "match code" composed of fragments of data concatenated together. I think we use a delimiter, but you wouldn't need to. So, if you want to match Johnson in zipcode 12345 on Maple street, you might have a matchcode of "JOHNSON*12345*MAPLE", so you would extract the relevant fields, build the matchcode and check it against a list or file. Actually, we use an I-type dictionary to generate the matchcode, and have an index built on it. For small datasets this may be *slower* than your case statement, but I would think that it would be easier to maintain, and for large datasets it should be quicker since the time to construct the matchcode and do a read, selectindex, or whatever would be constant. Of course, if you have a Jonsson that gets spelled Johnson, you're going to have problems no matter how you approach it. On Tue, 2004-01-27 at 13:05, George Gallen wrote: I can't just check for names, it has to a name with a specific zip code and if the name is fairly common, we also add in part of the address to make sure no one else is weeded out that shouldn't be. I suppose I could keep two or three arrays, do a specific lookup in each saving the position, and if all three positions are identicle (asuming all three arrays have the name, address, zip in the same order) then that would be a matchThanks George >-Original Message- >From: Jeff Schasny [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 1:51 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >how about keeping a list of excluded names as a record in a >file (or as a >flat file in a directory with each name/item/whatever on a >line) and reading >it into the program as a dynamic array then doing a locate on >the string in >question. Something like this: > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP >X = 0 >LOOP > X += 1 > ASTRING = INLIST >UNTIL ASTRING = '' > LOCATE ASTRING IN ALIST SETTING POS THEN > DO > OTHER > STUFF > END ELSE > DONT > END >REPEAT > >Of course of you really want speed then sort the list and use >a "BY clause >in the locate > >-Original Message- >From: George Gallen [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 11:33 AM >To: 'Ardent List' >Subject: looking for faster Ideas... > > >I can't setup any indexs to speed this up. Basically I'm >scanning a CSV file >for names to remove > and set the flag of KICK=1 to remove it (creating a new CSV >file at the >same time). > >Keep in mind the ".." are people's last names, or zip codes, or part of >their address, changed >them to ".." to protect the unwanting... > >Right now, I do a series of CASE's ... >Now, it's not a major problem as I'm only checking for 20 or >so names, but >as more and more people > request to be removed (and we don't have access to the >creation of the >list). this could get quite > slow over 50 or 60 thousand lines of checking. > >LIN is one line of the CSV file, the INDEX is checking for a >last name & a >zip code and sometimes > part of the address line. > >Any Ideas? > >Remember, we can't change the source of the file, it will >always be a CSV, >being read line by line > > KICK=0 > BEGIN CASE > CASE -1 > KICK=1 > BEGIN CASE > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... what is it considered if you run the perl program through perl2exe ? Is it compiled then? or still interpreted with a big library? George >-Original Message- >From: Jeff Schasny [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 3:46 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >What? As opposed to Uni/UV/Pick Basic? Surprise! it compiles >to psudocode >just like java. Now if you were to have proposed "C", Fortran, >Assembler, >etc I could see your point. > >-Original Message- >From: Ian McGowan [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 12:22 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >if speed is the issue, sounds like a job for a compiled lanuage. or >semi-compiled like perl or python. > >[snip] >___ >u2-users mailing list >[EMAIL PROTECTED] >http://www.oliver.com/mailman/listinfo/u2-users > ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
What? As opposed to Uni/UV/Pick Basic? Surprise! it compiles to psudocode just like java. Now if you were to have proposed "C", Fortran, Assembler, etc I could see your point. -Original Message- From: Ian McGowan [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 12:22 PM To: U2 Users Discussion List Subject: RE: looking for faster Ideas... if speed is the issue, sounds like a job for a compiled lanuage. or semi-compiled like perl or python. [snip] ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... keep in mind, it's not the renting company that is giveing us the remove infomation, it's the consumer, and of course they never have the mailing piece in their hand. Although usually, if they call, we can get the specific info we are looking for which can change the case to one check. But when the info is mailed in or emailed in or left on a voice mail, that's when we run into not having the best data to go with. Calling/emailing/mailing them back usually just increases the annoyance level on their end, since we are contacting them Again.. George -Original Message-From: George Gallen [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:51 PMTo: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... sometimes there is a number, but rarely, are we given the number when requested to remove, usually just remove me from your $^&#^$*&$ mailing :) some add please. I considered PERL as a pre-processor to remove the names then pass that file to my program which does other stuff too George >-Original Message- >From: Ian McGowan [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 2:22 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >if speed is the issue, sounds like a job for a compiled lanuage. or >semi-compiled like perl or python. > >is there a unique number sent over by the other system? it might be >quicker to parse the whole thing and keep an exclude file keyed off the >unique number. if it weren't for embedded comma's you could >CONVERT "," >TO @AM, extract the key and write the record out as-is. that would be >quicker than 852 INDEX's :-) > >On Tue, 2004-01-27 at 11:05, George Gallen wrote: >> I can't just check for names, it has to a name with a >specific zip code >> and if the name is fairly common, we also add in part of the >address to >> make sure no one else is weeded out that shouldn't be. >> >> I suppose I could keep two or three arrays, do a specific >lookup in each >> >> saving the position, and if all three positions are >identicle (asuming >> all >> three arrays have the name, address, zip in the same order) then that >> would >> be a matchThanks >> >> George >> >> >-----Original Message- >> >From: Jeff Schasny [ mailto:[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> ] >> >Sent: Tuesday, January 27, 2004 1:51 PM >> >To: U2 Users Discussion List >> >Subject: RE: looking for faster Ideas... >> > >> > >> >how about keeping a list of excluded names as a record in a >> >file (or as a >> >flat file in a directory with each name/item/whatever on a >> >line) and reading >> >it into the program as a dynamic array then doing a locate on >> >the string in >> >question. Something like this: >> > >> > >> >READ ALIST FROM AFILE,SOME-ID ELSE STOP >> >X = 0 >> >LOOP >> > X += 1 >> > ASTRING = INLIST >> >UNTIL ASTRING = '' >> > LOCATE ASTRING IN ALIST SETTING POS THEN >> > DO >> > OTHER >> > STUFF >> > END ELSE >> > DONT >> > END >> >REPEAT >> > >> >Of course of you really want speed then sort the list and use >> >a "BY clause >> >in the locate >> > >> >-Original Message- >> >From: George Gallen [ mailto:[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> ] >> >Sent: Tuesday, January 27, 2004 11:33 AM >> >To: 'Ardent List' >> >Subject: looking for faster Ideas... >> > >> > >> >I can't setup any indexs to speed this up. Basically I'm >> >scanning a CSV file >> >for names to remove >> > and set the flag of KICK=1 to remove it (creating a new CSV >> >file at the >> >same time). >> > >> >Keep in mind the ".." are people's last names, or zip >codes, or part of >> >> >their address, changed >> >them to ".." to protect the unwanting... >> > >> >Right now, I do a series of CASE's ... >> >Now, it's not a major problem as I'm only checking for 20 or >> >so names, but >> >as more and more people >&g
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... Mike, doing what you propose would require a massive file to start with, and would require a crap load of disk reads, which would be far slower then a bunch of cases, and the project isn't worth that kind of investment anyway. But thanks. the source line would look something like "","jon c smith","1234 anywhere st","","","somecity","SS","12345-1254","" I'm looking for "smith" & "12345" and sometimes "anywhere" We may get a call from john smith (john not jon because they didn't spell their first name), didn't leave their middle init and didn't give us their 9 digit zip, only 5 digit zip. So I can't build any indexes. Searching for multiple pieces on the same line pretty much gives a fairly good matchup considing the source and match data aren't EXACTLY the same. Any of course, I'm not going to go hog wild in doing this. Creating a temp file, parsing into dynamic arrays loops and lookups...way too much, rather just use PERL to pre-process. -Original Message-From: Mike Rajkowski [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:41 PMTo: U2 Users Discussion ListSubject: RE: looking for faster Ideas... Create a temp file, and populate it with variations of the name in question (upcase and remove spaces). (Storing address information in each record) Then loop through your list, taking the name, and parsing the various combinations of the words. ( John David Doe - JOHNDOE, DOEJOHN JOHNDAVIDDOE, JOHNDOEDAVID) And attempt to read the item from the temp file, if it can read an item then verify the address information. Otherwise check the next item. -Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of George GallenSent: Tuesday, January 27, 2004 12:13 PMTo: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... in rethinking my take on that. That would still be difficult since the arrays would only contain "parts" of the whole fields. making the searching of the arrays very difficult. We can't store the exact entry, since sometimes people will call and say stop sending me things and not give us the name the same way it's in the database we rent. Basically it takes the renting company a couple months to remove the name, but we like to filter it immediately to stop anything from going out before the renting company removes it, and it also will catch it if the renting company replaces it in a couple months later.... George -Original Message-From: George Gallen [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:06 PMTo: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... I can't just check for names, it has to a name with a specific zip code and if the name is fairly common, we also add in part of the address to make sure no one else is weeded out that shouldn't be. I suppose I could keep two or three arrays, do a specific lookup in each saving the position, and if all three positions are identicle (asuming all three arrays have the name, address, zip in the same order) then that would be a match....Thanks George >-Original Message- >From: Jeff Schasny [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 1:51 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >how about keeping a list of excluded names as a record in a >file (or as a >flat file in a directory with each name/item/whatever on a >line) and reading >it into the program as a dynamic array then doing a locate on >the string in >question. Something like this: > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP >X = 0 >LOOP > X += 1 > ASTRING = INLIST >UNTIL ASTRING = '' > LOCATE ASTRING IN ALIST SETTING POS THEN > DO > OTHER > STUFF > END ELSE > DONT > END >REPEAT > >Of course of you really want speed then sort the list and use >a "BY clause >in the locate > >-Original Message- ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... sometimes there is a number, but rarely, are we given the number when requested to remove, usually just remove me from your $^&#^$*&$ mailing :) some add please. I considered PERL as a pre-processor to remove the names then pass that file to my program which does other stuff too George >-Original Message- >From: Ian McGowan [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 2:22 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >if speed is the issue, sounds like a job for a compiled lanuage. or >semi-compiled like perl or python. > >is there a unique number sent over by the other system? it might be >quicker to parse the whole thing and keep an exclude file keyed off the >unique number. if it weren't for embedded comma's you could >CONVERT "," >TO @AM, extract the key and write the record out as-is. that would be >quicker than 852 INDEX's :-) > >On Tue, 2004-01-27 at 11:05, George Gallen wrote: >> I can't just check for names, it has to a name with a >specific zip code >> and if the name is fairly common, we also add in part of the >address to >> make sure no one else is weeded out that shouldn't be. >> >> I suppose I could keep two or three arrays, do a specific >lookup in each >> >> saving the position, and if all three positions are >identicle (asuming >> all >> three arrays have the name, address, zip in the same order) then that >> would >> be a matchThanks >> >> George >> >> >-----Original Message----- >> >From: Jeff Schasny [ mailto:[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> ] >> >Sent: Tuesday, January 27, 2004 1:51 PM >> >To: U2 Users Discussion List >> >Subject: RE: looking for faster Ideas... >> > >> > >> >how about keeping a list of excluded names as a record in a >> >file (or as a >> >flat file in a directory with each name/item/whatever on a >> >line) and reading >> >it into the program as a dynamic array then doing a locate on >> >the string in >> >question. Something like this: >> > >> > >> >READ ALIST FROM AFILE,SOME-ID ELSE STOP >> >X = 0 >> >LOOP >> > X += 1 >> > ASTRING = INLIST >> >UNTIL ASTRING = '' >> > LOCATE ASTRING IN ALIST SETTING POS THEN >> > DO >> > OTHER >> > STUFF >> > END ELSE >> > DONT >> > END >> >REPEAT >> > >> >Of course of you really want speed then sort the list and use >> >a "BY clause >> >in the locate >> > >> >-Original Message- >> >From: George Gallen [ mailto:[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> ] >> >Sent: Tuesday, January 27, 2004 11:33 AM >> >To: 'Ardent List' >> >Subject: looking for faster Ideas... >> > >> > >> >I can't setup any indexs to speed this up. Basically I'm >> >scanning a CSV file >> >for names to remove >> > and set the flag of KICK=1 to remove it (creating a new CSV >> >file at the >> >same time). >> > >> >Keep in mind the ".." are people's last names, or zip >codes, or part of >> >> >their address, changed >> >them to ".." to protect the unwanting... >> > >> >Right now, I do a series of CASE's ... >> >Now, it's not a major problem as I'm only checking for 20 or >> >so names, but >> >as more and more people >> > request to be removed (and we don't have access to the >> >creation of the >> >list). this could get quite >> > slow over 50 or 60 thousand lines of checking. >> > >> >LIN is one line of the CSV file, the INDEX is checking for a >> >last name & a >> >zip code and sometimes >> > part of the address line. >> > >> >Any Ideas? >> > >> >Remember, we can't change the source of the file, it will >> >always be a CSV, >> >being read line by line >> > >> > KICK=0 >> > BEGIN CASE >> > CASE -1 >> > KICK=1 >> > BEGIN CASE >> > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >> >INDEX(LIN,"..",1)#0 &g
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... Create a temp file, and populate it with variations of the name in question (upcase and remove spaces). (Storing address information in each record) Then loop through your list, taking the name, and parsing the various combinations of the words. ( John David Doe - JOHNDOE, DOEJOHN JOHNDAVIDDOE, JOHNDOEDAVID) And attempt to read the item from the temp file, if it can read an item then verify the address information. Otherwise check the next item. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of George Gallen Sent: Tuesday, January 27, 2004 12:13 PM To: 'U2 Users Discussion List' Subject: RE: looking for faster Ideas... in rethinking my take on that. That would still be difficult since the arrays would only contain "parts" of the whole fields. making the searching of the arrays very difficult. We can't store the exact entry, since sometimes people will call and say stop sending me things and not give us the name the same way it's in the database we rent. Basically it takes the renting company a couple months to remove the name, but we like to filter it immediately to stop anything from going out before the renting company removes it, and it also will catch it if the renting company replaces it in a couple months later George -Original Message- From: George Gallen [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 2:06 PM To: 'U2 Users Discussion List' Subject: RE: looking for faster Ideas... I can't just check for names, it has to a name with a specific zip code and if the name is fairly common, we also add in part of the address to make sure no one else is weeded out that shouldn't be. I suppose I could keep two or three arrays, do a specific lookup in each saving the position, and if all three positions are identicle (asuming all three arrays have the name, address, zip in the same order) then that would be a matchThanks George >-Original Message- >From: Jeff Schasny [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 1:51 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >how about keeping a list of excluded names as a record in a >file (or as a >flat file in a directory with each name/item/whatever on a >line) and reading >it into the program as a dynamic array then doing a locate on >the string in >question. Something like this: > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP >X = 0 >LOOP > X += 1 > ASTRING = INLIST >UNTIL ASTRING = '' > LOCATE ASTRING IN ALIST SETTING POS THEN > DO > OTHER > STUFF > END ELSE > DONT > END >REPEAT > >Of course of you really want speed then sort the list and use >a "BY clause >in the locate > >-Original Message- ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
On Tue, 2004-01-27 at 11:12, George Gallen wrote: > in rethinking my take on that. That would still be difficult > since the arrays would only contain "parts" of the whole fields. > making the searching of the arrays very difficult. ah, then you can't use grep or INDEX on the unparsed line - you have to parse the line into records first. some kind of unique key (phone number?) would be helpful, but you could always have an exclude file keyed on last name, with an MV list of zip codes: MCGOWAN 94111]94598]40210 and exclude them in your program: LOOP GOSUB READ.NEXT.LINE IF DONE THEN EXIT GOSUB PARSE.LINE NAME=REC<2> ZIP=REC<23> READ EXCLUDE.ZIPS FROM EXCLUDE.FILE, NAME THEN LOCATE ZIP IN EXCLUDE.ZIPS<1> SETTING POS THEN CONTINUE END ... MORE PROCESSING ... REPEAT -- Ian McGowan <[EMAIL PROTECTED]> ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
if speed is the issue, sounds like a job for a compiled lanuage. or semi-compiled like perl or python. is there a unique number sent over by the other system? it might be quicker to parse the whole thing and keep an exclude file keyed off the unique number. if it weren't for embedded comma's you could CONVERT "," TO @AM, extract the key and write the record out as-is. that would be quicker than 852 INDEX's :-) On Tue, 2004-01-27 at 11:05, George Gallen wrote: > I can't just check for names, it has to a name with a specific zip code > and if the name is fairly common, we also add in part of the address to > make sure no one else is weeded out that shouldn't be. > > I suppose I could keep two or three arrays, do a specific lookup in each > > saving the position, and if all three positions are identicle (asuming > all > three arrays have the name, address, zip in the same order) then that > would > be a matchThanks > > George > > >-Original Message- > >From: Jeff Schasny [ mailto:[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> ] > >Sent: Tuesday, January 27, 2004 1:51 PM > >To: U2 Users Discussion List > >Subject: RE: looking for faster Ideas... > > > > > >how about keeping a list of excluded names as a record in a > >file (or as a > >flat file in a directory with each name/item/whatever on a > >line) and reading > >it into the program as a dynamic array then doing a locate on > >the string in > >question. Something like this: > > > > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP > >X = 0 > >LOOP > > X += 1 > > ASTRING = INLIST > >UNTIL ASTRING = '' > > LOCATE ASTRING IN ALIST SETTING POS THEN > > DO > > OTHER > > STUFF > > END ELSE > > DONT > > END > >REPEAT > > > >Of course of you really want speed then sort the list and use > >a "BY clause > >in the locate > > > >-Original Message- > >From: George Gallen [ mailto:[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> ] > >Sent: Tuesday, January 27, 2004 11:33 AM > >To: 'Ardent List' > >Subject: looking for faster Ideas... > > > > > >I can't setup any indexs to speed this up. Basically I'm > >scanning a CSV file > >for names to remove > > and set the flag of KICK=1 to remove it (creating a new CSV > >file at the > >same time). > > > >Keep in mind the ".." are people's last names, or zip codes, or part of > > >their address, changed > >them to ".." to protect the unwanting... > > > >Right now, I do a series of CASE's ... > >Now, it's not a major problem as I'm only checking for 20 or > >so names, but > >as more and more people > > request to be removed (and we don't have access to the > >creation of the > >list). this could get quite > > slow over 50 or 60 thousand lines of checking. > > > >LIN is one line of the CSV file, the INDEX is checking for a > >last name & a > >zip code and sometimes > > part of the address line. > > > >Any Ideas? > > > >Remember, we can't change the source of the file, it will > >always be a CSV, > >being read line by line > > > > KICK=0 > > BEGIN CASE > > CASE -1 > > KICK=1 > >BEGIN CASE > >CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > >INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > >INDEX(LIN,"..&qu
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... in rethinking my take on that. That would still be difficult since the arrays would only contain "parts" of the whole fields. making the searching of the arrays very difficult. We can't store the exact entry, since sometimes people will call and say stop sending me things and not give us the name the same way it's in the database we rent. Basically it takes the renting company a couple months to remove the name, but we like to filter it immediately to stop anything from going out before the renting company removes it, and it also will catch it if the renting company replaces it in a couple months later George -Original Message-From: George Gallen [mailto:[EMAIL PROTECTED]Sent: Tuesday, January 27, 2004 2:06 PMTo: 'U2 Users Discussion List'Subject: RE: looking for faster Ideas... I can't just check for names, it has to a name with a specific zip code and if the name is fairly common, we also add in part of the address to make sure no one else is weeded out that shouldn't be. I suppose I could keep two or three arrays, do a specific lookup in each saving the position, and if all three positions are identicle (asuming all three arrays have the name, address, zip in the same order) then that would be a matchThanks George >-Original Message- >From: Jeff Schasny [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 1:51 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >how about keeping a list of excluded names as a record in a >file (or as a >flat file in a directory with each name/item/whatever on a >line) and reading >it into the program as a dynamic array then doing a locate on >the string in >question. Something like this: > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP >X = 0 >LOOP > X += 1 > ASTRING = INLIST >UNTIL ASTRING = '' > LOCATE ASTRING IN ALIST SETTING POS THEN > DO > OTHER > STUFF > END ELSE > DONT > END >REPEAT > >Of course of you really want speed then sort the list and use >a "BY clause >in the locate > >-Original Message- ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... thats why the zip code and sometimes the part of the address is used also, the chances of the matching part of the name the zip code, and part of the address and NOT being unique is extremely low. Which is also what complicates this. George >-Original Message- >From: Ian McGowan [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 1:56 PM >To: U2 Users Discussion List >Subject: Re: looking for faster Ideas... > > >do it outside basic using > >$grep -F -f pattern-file csv-file > remove-file > >the pattern file would have the pieces in there. what if you're >excluding something that's not unique? "smith" would exclude >"smithers", "smithy". "psmith (one for the wodehouse fans :-)" etc. > >i do this with some huge syslog files, and fairly big pattern files and >it's pretty darn quick. > >ian > >On Tue, 2004-01-27 at 10:33, George Gallen wrote: >> I can't setup any indexs to speed this up. Basically I'm >scanning a CSV >> file >> for names to remove >> and set the flag of KICK=1 to remove it (creating a new >CSV file at >> the >> same time). >> >> Keep in mind the ".." are people's last names, or zip codes, >or part of >> their address, changed >> them to ".." to protect the unwanting... >> >> Right now, I do a series of CASE's ... >> Now, it's not a major problem as I'm only checking for 20 or >so names, >> but >> as more and more people >> request to be removed (and we don't have access to the >creation of the >> list). this could get quite >> slow over 50 or 60 thousand lines of checking. >> >> LIN is one line of the CSV file, the INDEX is checking for a >last name & >> a >> zip code and sometimes >> part of the address line. >> >> Any Ideas? >> >> Remember, we can't change the source of the file, it will always be a >> CSV, >> being read line by line >> >> KICK=0 >> BEGIN CASE >> CASE -1 >> KICK=1 >> BEGIN CASE >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >> INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >> INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 >> CASE -1 >> KICK=0 >> END CASE >> END CASE >> >> George Gallen >> Senior Programmer/Analyst >> Accounting/Data Division >> [EMAIL PROTECTED] >> ph:856.848.1000 Ext 220 >> >> SLACK Incorporated - An innovative information, education >and management >> company >> http://www.slackinc.com >> >> ___ >> u2-users mailing list >> [EMAIL PROTECTED] >> http://www.oliver.com/mailman/listinfo/u2-users >-- >Ian McGowan <[EMAIL PROTECTED]> > >___ >u2-users mailing list >[EMAIL PROTECTED] >http://www.oliver.com/mailman/listinfo/u2-users > ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
Re: looking for faster Ideas...
or actually, since it seems you want to find everyone *except* the opt-out-er's: $grep -v -F -f pattern-file csv-file > process-file and then work thru the process file. still have a problem with partial matches, though... why not get rid of the csv file and keep a record for each user? then you could simply add an atb, OPTOUT, and SELECT DIRECTMAILLIST WITH OPTOUT = ""? ah, the csv file must be coming from some outside system. On Tue, 2004-01-27 at 10:56, Ian McGowan wrote: > do it outside basic using > > $grep -F -f pattern-file csv-file > remove-file > > the pattern file would have the pieces in there. what if you're > excluding something that's not unique? "smith" would exclude > "smithers", "smithy". "psmith (one for the wodehouse fans :-)" etc. > > i do this with some huge syslog files, and fairly big pattern files and > it's pretty darn quick. > > ian > > On Tue, 2004-01-27 at 10:33, George Gallen wrote: > > I can't setup any indexs to speed this up. Basically I'm scanning a > CSV > > file > > for names to remove > >and set the flag of KICK=1 to remove it (creating a new CSV file at > > the > > same time). > > > > Keep in mind the ".." are people's last names, or zip codes, or part > of > > their address, changed > > them to ".." to protect the unwanting... > > > > Right now, I do a series of CASE's ... > > Now, it's not a major problem as I'm only checking for 20 or so names, > > but > > as more and more people > > request to be removed (and we don't have access to the creation of > the > > list). this could get quite > > slow over 50 or 60 thousand lines of checking. > > > > LIN is one line of the CSV file, the INDEX is checking for a last name > & > > a > > zip code and sometimes > >part of the address line. > > > > Any Ideas? > > > > Remember, we can't change the source of the file, it will always be a > > CSV, > > being read line by line > > > >KICK=0 > >BEGIN CASE > > CASE -1 > > KICK=1 > > BEGIN CASE > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > > INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > > INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > > CASE -1 > >KICK=0 > > END CASE > >END CASE > > > > George Gallen > > Senior Programmer/Analyst > > Accounting/Data Division > > [EMAIL PROTECTED] > > ph:856.848.1000 Ext 220 > > > > SLACK Incorporated - An innovative information, education and > management > > company > > http://www.slackinc.com > > > > ___ > > u2-users mailing list > > [EMAIL PROTECTED] > > http://www.oliver.com/mailman/listinfo/u2-users -- Ian McGowan <[EMAIL PROTECTED]> ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
Title: RE: looking for faster Ideas... I can't just check for names, it has to a name with a specific zip code and if the name is fairly common, we also add in part of the address to make sure no one else is weeded out that shouldn't be. I suppose I could keep two or three arrays, do a specific lookup in each saving the position, and if all three positions are identicle (asuming all three arrays have the name, address, zip in the same order) then that would be a matchThanks George >-Original Message- >From: Jeff Schasny [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 1:51 PM >To: U2 Users Discussion List >Subject: RE: looking for faster Ideas... > > >how about keeping a list of excluded names as a record in a >file (or as a >flat file in a directory with each name/item/whatever on a >line) and reading >it into the program as a dynamic array then doing a locate on >the string in >question. Something like this: > > >READ ALIST FROM AFILE,SOME-ID ELSE STOP >X = 0 >LOOP > X += 1 > ASTRING = INLIST >UNTIL ASTRING = '' > LOCATE ASTRING IN ALIST SETTING POS THEN > DO > OTHER > STUFF > END ELSE > DONT > END >REPEAT > >Of course of you really want speed then sort the list and use >a "BY clause >in the locate > >-Original Message- >From: George Gallen [mailto:[EMAIL PROTECTED]] >Sent: Tuesday, January 27, 2004 11:33 AM >To: 'Ardent List' >Subject: looking for faster Ideas... > > >I can't setup any indexs to speed this up. Basically I'm >scanning a CSV file >for names to remove > and set the flag of KICK=1 to remove it (creating a new CSV >file at the >same time). > >Keep in mind the ".." are people's last names, or zip codes, or part of >their address, changed >them to ".." to protect the unwanting... > >Right now, I do a series of CASE's ... >Now, it's not a major problem as I'm only checking for 20 or >so names, but >as more and more people > request to be removed (and we don't have access to the >creation of the >list). this could get quite > slow over 50 or 60 thousand lines of checking. > >LIN is one line of the CSV file, the INDEX is checking for a >last name & a >zip code and sometimes > part of the address line. > >Any Ideas? > >Remember, we can't change the source of the file, it will >always be a CSV, >being read line by line > > KICK=0 > BEGIN CASE > CASE -1 > KICK=1 > BEGIN CASE > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND >INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE -1 > KICK=0 > END CASE > END CASE > >George Gallen >Senior Programmer/Analyst >Accounting/Data Division >[EMAIL PROTECTED] >ph:856.848.1000 Ext 220 > >SLACK Incorporated - An innovative information, education and >management >company >http://www.slackinc.com > >___ >u2-users mailing list >[EMAIL PROTECTED] >http://www.oliver.com/mailman/listinfo/u2-users >___ >u2-users mailing list >[EMAIL PROTECTED] >http://www.oliver.com/mailman/listinfo/u2-users > ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
Re: looking for faster Ideas...
do it outside basic using $grep -F -f pattern-file csv-file > remove-file the pattern file would have the pieces in there. what if you're excluding something that's not unique? "smith" would exclude "smithers", "smithy". "psmith (one for the wodehouse fans :-)" etc. i do this with some huge syslog files, and fairly big pattern files and it's pretty darn quick. ian On Tue, 2004-01-27 at 10:33, George Gallen wrote: > I can't setup any indexs to speed this up. Basically I'm scanning a CSV > file > for names to remove >and set the flag of KICK=1 to remove it (creating a new CSV file at > the > same time). > > Keep in mind the ".." are people's last names, or zip codes, or part of > their address, changed > them to ".." to protect the unwanting... > > Right now, I do a series of CASE's ... > Now, it's not a major problem as I'm only checking for 20 or so names, > but > as more and more people > request to be removed (and we don't have access to the creation of the > list). this could get quite > slow over 50 or 60 thousand lines of checking. > > LIN is one line of the CSV file, the INDEX is checking for a last name & > a > zip code and sometimes >part of the address line. > > Any Ideas? > > Remember, we can't change the source of the file, it will always be a > CSV, > being read line by line > >KICK=0 >BEGIN CASE > CASE -1 > KICK=1 >BEGIN CASE > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND > INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 > CASE -1 > KICK=0 >END CASE >END CASE > > George Gallen > Senior Programmer/Analyst > Accounting/Data Division > [EMAIL PROTECTED] > ph:856.848.1000 Ext 220 > > SLACK Incorporated - An innovative information, education and management > company > http://www.slackinc.com > > ___ > u2-users mailing list > [EMAIL PROTECTED] > http://www.oliver.com/mailman/listinfo/u2-users -- Ian McGowan <[EMAIL PROTECTED]> ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users
RE: looking for faster Ideas...
how about keeping a list of excluded names as a record in a file (or as a flat file in a directory with each name/item/whatever on a line) and reading it into the program as a dynamic array then doing a locate on the string in question. Something like this: READ ALIST FROM AFILE,SOME-ID ELSE STOP X = 0 LOOP X += 1 ASTRING = INLIST UNTIL ASTRING = '' LOCATE ASTRING IN ALIST SETTING POS THEN DO OTHER STUFF END ELSE DONT END REPEAT Of course of you really want speed then sort the list and use a "BY clause in the locate -Original Message- From: George Gallen [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 11:33 AM To: 'Ardent List' Subject: looking for faster Ideas... I can't setup any indexs to speed this up. Basically I'm scanning a CSV file for names to remove and set the flag of KICK=1 to remove it (creating a new CSV file at the same time). Keep in mind the ".." are people's last names, or zip codes, or part of their address, changed them to ".." to protect the unwanting... Right now, I do a series of CASE's ... Now, it's not a major problem as I'm only checking for 20 or so names, but as more and more people request to be removed (and we don't have access to the creation of the list). this could get quite slow over 50 or 60 thousand lines of checking. LIN is one line of the CSV file, the INDEX is checking for a last name & a zip code and sometimes part of the address line. Any Ideas? Remember, we can't change the source of the file, it will always be a CSV, being read line by line KICK=0 BEGIN CASE CASE -1 KICK=1 BEGIN CASE CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE INDEX(LIN,"..",1)#0 AND INDEX(LIN,"..",1)#0 CASE -1 KICK=0 END CASE END CASE George Gallen Senior Programmer/Analyst Accounting/Data Division [EMAIL PROTECTED] ph:856.848.1000 Ext 220 SLACK Incorporated - An innovative information, education and management company http://www.slackinc.com ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users ___ u2-users mailing list [EMAIL PROTECTED] http://www.oliver.com/mailman/listinfo/u2-users