Re: Is Python Suitable for Large Find & Replace Operations?
rbt wrote: > On Fri, 2005-06-17 at 09:18 -0400, Peter Hansen wrote: > >>rbt wrote: >> >>>The script is too long to post in its entirety. In short, I open the >>>files, do a binary read (in 1MB chunks for ease of memory usage) on them *ONE* MB? What are the higher-ups constraining you to run this on? My reference points: (1) An off-the-shelf office desktop PC from HP/Dell/whoever with 512MB of memory is standard fare. Motherboards with 4 memory slots (supporting up to 4GB) are common. (2) Reading files in 1MB chunks seemed appropriate when I was running an 8Mb 486 using Win 3.11. Are you able to run your script on the network server itself? And what is the server setup? Windows/Novell/Samba/??? What version of Python are you running? Have you considered mmap? >>>before placing that read into a variable and that in turn into a list >>>that I then apply the following re to In the following code, which is "variable" (chunk?) and which is the "list" to which you apply the re? The only list I see is the confusingly named "search" which is the *result* of applying re.findall. >>> >>>ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b') Users never use spaces or other punctuation (or no punctuation) instead of dashes? False positives? E.g. """ Dear X, Pls rush us qty 20 heart pacemaker alarm trigger circuits, part number 123-456-78-9012, as discussed. """ >>> >>>like this: >>> >>>for chunk in whole_file: >>>search = ss.findall(chunk) You will need the offset to update, so better get used to using something like finditer now. You may well find that "validate" may want to check some context on either side, even if it's just one character -- \b may turn out to be inappropriate. >>>if search: >>>validate(search) >> >>This seems so obvious that I hesitate to ask, but is the above really a >>simplification of the real code, which actually handles the case of SSNs >>that lie over the boundary between chunks? In other words, what happens >>if the first chunk has only the first four digits of the SSN, and the >>rest lies in the second chunk? >> >>-Peter > > > No, that's a good question. As of now, there is nothing to handle the > scenario that you bring up. I have considered this possibility (rare but > possible). I have not written a solution for it. It's a very good point > though. How rare? What is the average size of file? What is the average frequency of SSNs per file? What is the maximum file size (you did mention mostly MS Word and Excel, so can't be very big)? N.B. Using mmap just blows this problem away. Using bigger memory chunks diminishes it. And what percentage of files contain no SSNs at all? Are you updating in-situ, or must your script copy the files, changing SSNs as it goes? Do you have a choice between the two methods? Are you allowed to constrain the list of input files so that you don't update binaries (e.g.) exe files? > > Is that not why proper software is engineered? Anyone can build a go > cart (write a program), but it takes a team of engineers and much > testing to build a car, no? Which woulu you rather be riding in during a > crash? I wish upper mgt had a better understanding of this ;) > Don't we all so wish? Meanwhile, back in the real world, ... -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
On Fri, 2005-06-17 at 09:18 -0400, Peter Hansen wrote: > rbt wrote: > > The script is too long to post in its entirety. In short, I open the > > files, do a binary read (in 1MB chunks for ease of memory usage) on them > > before placing that read into a variable and that in turn into a list > > that I then apply the following re to > > > > ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b') > > > > like this: > > > > for chunk in whole_file: > > search = ss.findall(chunk) > > if search: > > validate(search) > > This seems so obvious that I hesitate to ask, but is the above really a > simplification of the real code, which actually handles the case of SSNs > that lie over the boundary between chunks? In other words, what happens > if the first chunk has only the first four digits of the SSN, and the > rest lies in the second chunk? > > -Peter No, that's a good question. As of now, there is nothing to handle the scenario that you bring up. I have considered this possibility (rare but possible). I have not written a solution for it. It's a very good point though. Is that not why proper software is engineered? Anyone can build a go cart (write a program), but it takes a team of engineers and much testing to build a car, no? Which woulu you rather be riding in during a crash? I wish upper mgt had a better understanding of this ;) -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt wrote: > The script is too long to post in its entirety. In short, I open the > files, do a binary read (in 1MB chunks for ease of memory usage) on them > before placing that read into a variable and that in turn into a list > that I then apply the following re to > > ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b') > > like this: > > for chunk in whole_file: > search = ss.findall(chunk) > if search: > validate(search) This seems so obvious that I hesitate to ask, but is the above really a simplification of the real code, which actually handles the case of SSNs that lie over the boundary between chunks? In other words, what happens if the first chunk has only the first four digits of the SSN, and the rest lies in the second chunk? -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
On Fri, 2005-06-17 at 12:33 +1000, John Machin wrote: > OK then, let's ignore the fact that the data is in a collection of Word > & Excel files, and let's ignore the scale for the moment. Let's assume > there are only 100 very plain text files to process, and only 1000 SSNs > in your map, so it doesn't have to be very efficient. > > Can you please write a few lines of Python that would define your task > -- assume you have a line of text from an input file, show how you would > determine that it needed to be changed, and how you would change it. The script is too long to post in its entirety. In short, I open the files, do a binary read (in 1MB chunks for ease of memory usage) on them before placing that read into a variable and that in turn into a list that I then apply the following re to ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b') like this: for chunk in whole_file: search = ss.findall(chunk) if search: validate(search) The validate function makes sure the string found is indeed in the range of a legitimate SSN. You may read about this range here: http://www.ssa.gov/history/ssn/geocard.html That is as far as I have gotten. And I hope you can tell that I have placed some small amount of thought into the matter. I've tested the find a lot and it is rather accurate in finding SSNs in files. I have not yet started replacing anything. I've only posted here for advice before beginning. > > > > > > >>(4) Under what circumstances will it not be possible to replace *ALL* > >>the SSNs? > > > > > > I do not understand this question. > > Can you determine from the data, without reference to the map, that a > particular string of characters is an SSN? See above. > > If so, and it is not in the map, why can it not be *added* to the map > with a new generated ID? It is not my responsibility to do this. I do not have that authority within the organization. Have you never worked for a real-world business and dealt with office politics and territory ;) > >>And what is the source of the SSNs in this file??? Have they been > >>extracted from the data? How? > > > > > > That is irrelevant. > > Quite the contrary. If they had been extracted from the data, They have not. They are generated by a higher authority and then used by lower authorities such as me. Again though, I think this is irrelevant to the task at hand... I have a map, I have access to the data and that is all I need to have, no? I do appreciate your input though. If you would like to have a further exchange of ideas, perhaps we should do so off list? -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt wrote: > On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote: > >>rbt wrote: >> >>>Here's the scenario: >>> >>>You have many hundred gigabytes of data... possible even a terabyte or >>>two. Within this data, you have private, sensitive information (US >>>social security numbers) about your company's clients. Your company has >>>generated its own unique ID numbers to replace the social security numbers. >>> >>>Now, management would like the IT guys to go thru the old data and >>>replace as many SSNs with the new ID numbers as possible. >> >>This question is grossly OT; it's nothing at all to do with Python. >>However >> >>(0) Is this homework? > > > No, it is not. > > >>(1) What to do with an SSN that's not in the map? > > > Leave it be. > > >>(2) How will a user of the system tell the difference between "new ID >>numbers" and SSNs? > > > We have documentation. The two sets of numbers (SSNs and new_ids) are > exclusive of each other. > > >>(3) Has the company really been using the SSN as a customer ID instead >>of an account number, or have they been merely recording the SSN as a >>data item? Will the "new ID numbers" be used in communication with the >>clients? Will they be advised of the new numbers? How will you handle >>the inevitable cases where the advice doesn't get through? > > > My task is purely technical. OK then, let's ignore the fact that the data is in a collection of Word & Excel files, and let's ignore the scale for the moment. Let's assume there are only 100 very plain text files to process, and only 1000 SSNs in your map, so it doesn't have to be very efficient. Can you please write a few lines of Python that would define your task -- assume you have a line of text from an input file, show how you would determine that it needed to be changed, and how you would change it. > > >>(4) Under what circumstances will it not be possible to replace *ALL* >>the SSNs? > > > I do not understand this question. Can you determine from the data, without reference to the map, that a particular string of characters is an SSN? If so, and it is not in the map, why can it not be *added* to the map with a new generated ID? > > >>(5) For how long can the data be off-line while it's being transformed? > > > The data is on file servers that are unused on weekends and nights. > > >> >>>You have a tab >>>delimited txt file that maps the SSNs to the new ID numbers. There are >>>500,000 of these number pairs. >> >>And what is the source of the SSNs in this file??? Have they been >>extracted from the data? How? > > > That is irrelevant. Quite the contrary. If they had been extracted from the data, it would mean that someone actually might have a vague clue about how those SSNs were represented in the data and might be able to tell you and you might be able to tell us so that we could help you. > > >>>What is the most efficient way to >>>approach this? I have done small-scale find and replace programs before, >>>but the scale of this is larger than what I'm accustomed to. >>> >>>Any suggestions on how to approach this are much appreciated. >> >>A sensible answer will depend on how the data is structured: >> >>1. If it's in a database with tables some of which have a column for >>SSN, then there's a solution involving SQL. >> >>2. If it's in semi-free-text files where the SSNs are marked somehow: >> >>---client header--- >>surname: Doe first: John initial: Q SSN:123456789 blah blah >>or >>123456789 >> >>then there's another solution which involves finding the markers ... >> >>3. If it's really free text, like >>""" >>File note: Today John Q. Doe telephoned to advise that his Social >>Security # is 123456789 not 987654321 (which is his wife's) and the soc >>sec numbers of his kids Bob & Carol are >>""" >>then you might be in some difficulty ... google("TREC") >> >> >>AND however you do it, you need to be very aware of the possibility >>(especially with really free text) of changing some string of digits >>that's NOT an SSN. > > > That's possible, but I think not probably. You "think not probably" -- based on what? pious hope? what the customer has told you? inspection of the data? -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote: > rbt a écrit : > > Here's the scenario: > > > > You have many hundred gigabytes of data... possible even a terabyte or > > two. Within this data, you have private, sensitive information (US > > social security numbers) about your company's clients. Your company has > > generated its own unique ID numbers to replace the social security numbers. > > > > Now, management would like the IT guys to go thru the old data and > > replace as many SSNs with the new ID numbers as possible. You have a tab > > delimited txt file that maps the SSNs to the new ID numbers. There are > > 500,000 of these number pairs. What is the most efficient way to > > approach this? I have done small-scale find and replace programs before, > > but the scale of this is larger than what I'm accustomed to. > > > > Any suggestions on how to approach this are much appreciated. > > Are this huge amount of data to rearch/replace stored in an RDBMS or in > flat file(s) with markup (XML, CSV, ...) ? > > -- > Gilles The data is in files. Mostly Word documents and excel spreadsheets. The SSN map I have is a plain text file that has a format like this: ssn-xx- new-id- ssn-xx- new-id- etc. There are a bit more than 500K of these pairs. Thank you, rbt -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote: > rbt a écrit : > > Here's the scenario: > > > > You have many hundred gigabytes of data... possible even a terabyte or > > two. Within this data, you have private, sensitive information (US > > social security numbers) about your company's clients. Your company has > > generated its own unique ID numbers to replace the social security numbers. > > > > Now, management would like the IT guys to go thru the old data and > > replace as many SSNs with the new ID numbers as possible. You have a tab > > delimited txt file that maps the SSNs to the new ID numbers. There are > > 500,000 of these number pairs. What is the most efficient way to > > approach this? I have done small-scale find and replace programs before, > > but the scale of this is larger than what I'm accustomed to. > > > > Any suggestions on how to approach this are much appreciated. > > Are this huge amount of data to rearch/replace stored in an RDBMS or in > flat file(s) with markup (XML, CSV, ...) ? > > -- > Gilles I apologize that it has taken me so long to respond. I had a hdd crash which I am in the process of recovering from. If emails to [EMAIL PROTECTED] bounced, that is the reason why. The data is in files. Mostly Word documents and excel spreadsheets. The SSN map I have is a plain text file that has a format like this: ssn-xx- new-id- ssn-xx- new-id- etc. There are a bit more than 500K of these pairs. Thank you, rbt -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote: > rbt wrote: > > Here's the scenario: > > > > You have many hundred gigabytes of data... possible even a terabyte or > > two. Within this data, you have private, sensitive information (US > > social security numbers) about your company's clients. Your company has > > generated its own unique ID numbers to replace the social security numbers. > > > > Now, management would like the IT guys to go thru the old data and > > replace as many SSNs with the new ID numbers as possible. > > This question is grossly OT; it's nothing at all to do with Python. > However > > (0) Is this homework? No, it is not. > > (1) What to do with an SSN that's not in the map? Leave it be. > > (2) How will a user of the system tell the difference between "new ID > numbers" and SSNs? We have documentation. The two sets of numbers (SSNs and new_ids) are exclusive of each other. > > (3) Has the company really been using the SSN as a customer ID instead > of an account number, or have they been merely recording the SSN as a > data item? Will the "new ID numbers" be used in communication with the > clients? Will they be advised of the new numbers? How will you handle > the inevitable cases where the advice doesn't get through? My task is purely technical. > > (4) Under what circumstances will it not be possible to replace *ALL* > the SSNs? I do not understand this question. > > (5) For how long can the data be off-line while it's being transformed? The data is on file servers that are unused on weekends and nights. > > > > You have a tab > > delimited txt file that maps the SSNs to the new ID numbers. There are > > 500,000 of these number pairs. > > And what is the source of the SSNs in this file??? Have they been > extracted from the data? How? That is irrelevant. > > > What is the most efficient way to > > approach this? I have done small-scale find and replace programs before, > > but the scale of this is larger than what I'm accustomed to. > > > > Any suggestions on how to approach this are much appreciated. > > A sensible answer will depend on how the data is structured: > > 1. If it's in a database with tables some of which have a column for > SSN, then there's a solution involving SQL. > > 2. If it's in semi-free-text files where the SSNs are marked somehow: > > ---client header--- > surname: Doe first: John initial: Q SSN:123456789 blah blah > or > 123456789 > > then there's another solution which involves finding the markers ... > > 3. If it's really free text, like > """ > File note: Today John Q. Doe telephoned to advise that his Social > Security # is 123456789 not 987654321 (which is his wife's) and the soc > sec numbers of his kids Bob & Carol are > """ > then you might be in some difficulty ... google("TREC") > > > AND however you do it, you need to be very aware of the possibility > (especially with really free text) of changing some string of digits > that's NOT an SSN. That's possible, but I think not probably. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote: > rbt a écrit : > > Here's the scenario: > > > > You have many hundred gigabytes of data... possible even a terabyte or > > two. Within this data, you have private, sensitive information (US > > social security numbers) about your company's clients. Your company has > > generated its own unique ID numbers to replace the social security numbers. > > > > Now, management would like the IT guys to go thru the old data and > > replace as many SSNs with the new ID numbers as possible. You have a tab > > delimited txt file that maps the SSNs to the new ID numbers. There are > > 500,000 of these number pairs. What is the most efficient way to > > approach this? I have done small-scale find and replace programs before, > > but the scale of this is larger than what I'm accustomed to. > > > > Any suggestions on how to approach this are much appreciated. > > Are this huge amount of data to rearch/replace stored in an RDBMS or in > flat file(s) with markup (XML, CSV, ...) ? > > -- > Gilles I apologize that it has taken me so long to respond. I had a hdd crash which I am in the process of recovering from. If emails to [EMAIL PROTECTED] bounced, that is the reason why. The data is in files. Mostly Word documents and excel spreadsheets. The SSN map I have is a plain text file that has a format like this: ssn-xx- new-id- ssn-xx- new-id- etc. There are a bit more than 500K of these pairs. Thank you, rbt -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote: > rbt a écrit : > > Here's the scenario: > > > > You have many hundred gigabytes of data... possible even a terabyte or > > two. Within this data, you have private, sensitive information (US > > social security numbers) about your company's clients. Your company has > > generated its own unique ID numbers to replace the social security numbers. > > > > Now, management would like the IT guys to go thru the old data and > > replace as many SSNs with the new ID numbers as possible. You have a tab > > delimited txt file that maps the SSNs to the new ID numbers. There are > > 500,000 of these number pairs. What is the most efficient way to > > approach this? I have done small-scale find and replace programs before, > > but the scale of this is larger than what I'm accustomed to. > > > > Any suggestions on how to approach this are much appreciated. > > Are this huge amount of data to rearch/replace stored in an RDBMS or in > flat file(s) with markup (XML, CSV, ...) ? > > -- > Gilles I apologize that it has taken me so long to respond. I had a hdd crash which I am in the process of recovering from. If emails to [EMAIL PROTECTED] bounced, that is the reason why. The data is in files. Mostly Word documents and excel spreadsheets. The SSN map I have is a plain text file that has a format like this: ssn-xx- new-id- ssn-xx- new-id- etc. There are a bit more than 500K of these pairs. Thank you, rbt -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt a écrit : > Here's the scenario: > > You have many hundred gigabytes of data... possible even a terabyte or > two. Within this data, you have private, sensitive information (US > social security numbers) about your company's clients. Your company has > generated its own unique ID numbers to replace the social security numbers. > > Now, management would like the IT guys to go thru the old data and > replace as many SSNs with the new ID numbers as possible. You have a tab > delimited txt file that maps the SSNs to the new ID numbers. There are > 500,000 of these number pairs. What is the most efficient way to > approach this? I have done small-scale find and replace programs before, > but the scale of this is larger than what I'm accustomed to. > > Any suggestions on how to approach this are much appreciated. Are this huge amount of data to rearch/replace stored in an RDBMS or in flat file(s) with markup (XML, CSV, ...) ? -- Gilles -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
Brian wrote: > Hi Rbt, > > To give an example of processing a lot of data, I used Python to read > and process every word in a single text file that contained the entire > King James Bible version. It processed it within about one second -- > split the words, etc. Worked quite well. > > Hope this helps, > Hardly "big" by the OP's standards at only 4.5 MB (and roughly 790,000 words, if anyone's interested). The problem with processing terabytes is frequently the need to accomodate the data with techniques that allow the virtual memory to avoid becoming swamped. The last thing you need to do under such circumstances is create an in-memory copy of hte entire data set, since on most real computers this will inflict swapping behavior. regards Steve -- Steve Holden+1 703 861 4237 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
Hi Rbt, To give an example of processing a lot of data, I used Python to read and process every word in a single text file that contained the entire King James Bible version. It processed it within about one second -- split the words, etc. Worked quite well. Hope this helps, Brian --- rbt wrote: > Here's the scenario: > > You have many hundred gigabytes of data... possible even a terabyte or > two. Within this data, you have private, sensitive information (US > social security numbers) about your company's clients. Your company has > generated its own unique ID numbers to replace the social security numbers. > > Now, management would like the IT guys to go thru the old data and > replace as many SSNs with the new ID numbers as possible. You have a tab > delimited txt file that maps the SSNs to the new ID numbers. There are > 500,000 of these number pairs. What is the most efficient way to > approach this? I have done small-scale find and replace programs before, > but the scale of this is larger than what I'm accustomed to. > > Any suggestions on how to approach this are much appreciated. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt wrote: > Here's the scenario: > > You have many hundred gigabytes of data... possible even a terabyte or > two. Within this data, you have private, sensitive information (US > social security numbers) about your company's clients. Your company has > generated its own unique ID numbers to replace the social security numbers. > > Now, management would like the IT guys to go thru the old data and > replace as many SSNs with the new ID numbers as possible. This question is grossly OT; it's nothing at all to do with Python. However (0) Is this homework? (1) What to do with an SSN that's not in the map? (2) How will a user of the system tell the difference between "new ID numbers" and SSNs? (3) Has the company really been using the SSN as a customer ID instead of an account number, or have they been merely recording the SSN as a data item? Will the "new ID numbers" be used in communication with the clients? Will they be advised of the new numbers? How will you handle the inevitable cases where the advice doesn't get through? (4) Under what circumstances will it not be possible to replace *ALL* the SSNs? (5) For how long can the data be off-line while it's being transformed? > You have a tab > delimited txt file that maps the SSNs to the new ID numbers. There are > 500,000 of these number pairs. And what is the source of the SSNs in this file??? Have they been extracted from the data? How? > What is the most efficient way to > approach this? I have done small-scale find and replace programs before, > but the scale of this is larger than what I'm accustomed to. > > Any suggestions on how to approach this are much appreciated. A sensible answer will depend on how the data is structured: 1. If it's in a database with tables some of which have a column for SSN, then there's a solution involving SQL. 2. If it's in semi-free-text files where the SSNs are marked somehow: ---client header--- surname: Doe first: John initial: Q SSN:123456789 blah blah or 123456789 then there's another solution which involves finding the markers ... 3. If it's really free text, like """ File note: Today John Q. Doe telephoned to advise that his Social Security # is 123456789 not 987654321 (which is his wife's) and the soc sec numbers of his kids Bob & Carol are """ then you might be in some difficulty ... google("TREC") AND however you do it, you need to be very aware of the possibility (especially with really free text) of changing some string of digits that's NOT an SSN. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt a écrit : > Here's the scenario: > > You have many hundred gigabytes of data... possible even a terabyte or > two. Within this data, you have private, sensitive information (US > social security numbers) about your company's clients. Your company has > generated its own unique ID numbers to replace the social security numbers. > > Now, management would like the IT guys to go thru the old data and > replace as many SSNs with the new ID numbers as possible. You have a tab > delimited txt file that maps the SSNs to the new ID numbers. There are > 500,000 of these number pairs. What is the most efficient way to > approach this? I have done small-scale find and replace programs before, > but the scale of this is larger than what I'm accustomed to. > > Any suggestions on how to approach this are much appreciated. I may say something stupid (that would not be the first time), but why don't you just put it all in a good RDBMS and go with a few SQL queries? -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
Robert Kern wrote: > I'm just tossing this out. I don't really have much experience with the > algorithm, but when searching for multiple targets, an Aho-Corasick > automaton might be appropriate. > > http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ More puttering around the web suggests that pytst may be a faster/more scalable implementation. http://www.lehuen.com/nicolas/index.php/Pytst Try them both. If you meet success with one of them, come back and tell us about it. -- Robert Kern [EMAIL PROTECTED] "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt <[EMAIL PROTECTED]> writes: > Now, management would like the IT guys to go thru the old data and > replace as many SSNs with the new ID numbers as possible. You have a > tab delimited txt file that maps the SSNs to the new ID numbers. There > are 500,000 of these number pairs. What is the most efficient way to > approach this? I have done small-scale find and replace programs > before, but the scale of this is larger than what I'm accustomed to. Just use an ordinary python dict for the map, on a system with enough ram (maybe 100 bytes or so for each pair, so the map would be 50 MB). Then it's simply a matter of scanning through the input files to find SSN's and look them up in the dict. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt <[EMAIL PROTECTED]> wrote: >You have many hundred gigabytes of data... possible even a terabyte or >two. Within this data, you have private, sensitive information (US >social security numbers) about your company's clients. Your company has >generated its own unique ID numbers to replace the social security numbers. > >Now, management would like the IT guys to go thru the old data and >replace as many SSNs with the new ID numbers as possible. The first thing I would do is a quick sanity check. Write a simple Python script which copies its input to its output with minimal processing. Something along the lines of: for line in sys.stdin.readline(): words = line.split() print words Now, run this on a gig of your data on your system and time it. Multiply the timing by 1000 (assuming a terabyte of real data), and now you've got a rough lower bound on how long the real job will take. If you find this will take weeks of processing time, you might want to re-think the plan. Better to discover that now than a few weeks into the project :-) >You have a tab >delimited txt file that maps the SSNs to the new ID numbers. There are >500,000 of these number pairs. What is the most efficient way to >approach this? I have done small-scale find and replace programs before, >but the scale of this is larger than what I'm accustomed to. The obvious first thing is to read the SSN/ID map file and build a dictionary where the keys are the SSN's and the values are the ID's. The next step depends on your data. Is it in text files? An SQL database? Some other format. One way or another you'll have to read it in, do something along the lines of "newSSN = ssnMap[oldSSN]", and write it back out. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Python Suitable for Large Find & Replace Operations?
rbt wrote: > Here's the scenario: > > You have many hundred gigabytes of data... possible even a terabyte or > two. Within this data, you have private, sensitive information (US > social security numbers) about your company's clients. Your company has > generated its own unique ID numbers to replace the social security numbers. > > Now, management would like the IT guys to go thru the old data and > replace as many SSNs with the new ID numbers as possible. You have a tab > delimited txt file that maps the SSNs to the new ID numbers. There are > 500,000 of these number pairs. What is the most efficient way to > approach this? I have done small-scale find and replace programs before, > but the scale of this is larger than what I'm accustomed to. > > Any suggestions on how to approach this are much appreciated. I'm just tossing this out. I don't really have much experience with the algorithm, but when searching for multiple targets, an Aho-Corasick automaton might be appropriate. http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ Good luck! -- Robert Kern [EMAIL PROTECTED] "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter -- http://mail.python.org/mailman/listinfo/python-list
Is Python Suitable for Large Find & Replace Operations?
Here's the scenario: You have many hundred gigabytes of data... possible even a terabyte or two. Within this data, you have private, sensitive information (US social security numbers) about your company's clients. Your company has generated its own unique ID numbers to replace the social security numbers. Now, management would like the IT guys to go thru the old data and replace as many SSNs with the new ID numbers as possible. You have a tab delimited txt file that maps the SSNs to the new ID numbers. There are 500,000 of these number pairs. What is the most efficient way to approach this? I have done small-scale find and replace programs before, but the scale of this is larger than what I'm accustomed to. Any suggestions on how to approach this are much appreciated. -- http://mail.python.org/mailman/listinfo/python-list