Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-17 Thread John Machin
rbt wrote:
> On Fri, 2005-06-17 at 09:18 -0400, Peter Hansen wrote:
> 
>>rbt wrote:
>>
>>>The script is too long to post in its entirety. In short, I open the
>>>files, do a binary read (in 1MB chunks for ease of memory usage) on them

*ONE* MB? What are the higher-ups constraining you to run this on? My 
reference points: (1) An off-the-shelf office desktop PC from 
HP/Dell/whoever with 512MB of memory is standard fare. Motherboards with 
4 memory slots (supporting up to 4GB) are common. (2) Reading files in 
1MB chunks seemed appropriate when I was running an 8Mb 486 using Win 3.11.

Are you able to run your script on the network server itself? And what 
is the server setup? Windows/Novell/Samba/??? What version of Python are 
you running? Have you considered mmap?

>>>before placing that read into a variable and that in turn into a list
>>>that I then apply the following re to

In the following code, which is "variable" (chunk?) and which is the 
"list" to which you apply the re? The only list I see is the confusingly 
named "search" which is the *result* of applying re.findall.

>>>
>>>ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')

Users never use spaces or other punctuation (or no punctuation) instead 
of dashes?

False positives? E.g. """
Dear X, Pls rush us qty 20 heart pacemaker alarm trigger circuits, part 
number 123-456-78-9012, as discussed.
"""

>>>
>>>like this:
>>>
>>>for chunk in whole_file:
>>>search = ss.findall(chunk)

You will need the offset to update, so better get used to using 
something like finditer now. You may well find that "validate" may want 
to check some context on either side, even if it's just one character -- 
\b may turn out to be inappropriate.

>>>if search:
>>>validate(search)
>>
>>This seems so obvious that I hesitate to ask, but is the above really a 
>>simplification of the real code, which actually handles the case of SSNs 
>>that lie over the boundary between chunks? In other words, what happens 
>>if the first chunk has only the first four digits of the SSN, and the 
>>rest lies in the second chunk?
>>
>>-Peter
> 
> 
> No, that's a good question. As of now, there is nothing to handle the
> scenario that you bring up. I have considered this possibility (rare but
> possible). I have not written a solution for it. It's a very good point
> though.

How rare? What is the average size of file? What is the average 
frequency of SSNs per file? What is the maximum file size (you did 
mention mostly MS Word and Excel, so can't be very big)?

N.B. Using mmap just blows this problem away. Using bigger memory chunks 
diminishes it.

And what percentage of files contain no SSNs at all?

Are you updating in-situ, or must your script copy the files, changing 
SSNs as it goes? Do you have a choice between the two methods?

Are you allowed to constrain the list of input files so that you don't 
update binaries (e.g.) exe files?

> 
> Is that not why proper software is engineered? Anyone can build a go
> cart (write a program), but it takes a team of engineers and much
> testing to build a car, no? Which woulu you rather be riding in during a
> crash? I wish upper mgt had a better understanding of this ;)
> 

Don't we all so wish? Meanwhile, back in the real world, ...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-17 Thread rbt
On Fri, 2005-06-17 at 09:18 -0400, Peter Hansen wrote:
> rbt wrote:
> > The script is too long to post in its entirety. In short, I open the
> > files, do a binary read (in 1MB chunks for ease of memory usage) on them
> > before placing that read into a variable and that in turn into a list
> > that I then apply the following re to
> > 
> > ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
> > 
> > like this:
> > 
> > for chunk in whole_file:
> > search = ss.findall(chunk)
> > if search:
> > validate(search)
> 
> This seems so obvious that I hesitate to ask, but is the above really a 
> simplification of the real code, which actually handles the case of SSNs 
> that lie over the boundary between chunks?  In other words, what happens 
> if the first chunk has only the first four digits of the SSN, and the 
> rest lies in the second chunk?
> 
> -Peter

No, that's a good question. As of now, there is nothing to handle the
scenario that you bring up. I have considered this possibility (rare but
possible). I have not written a solution for it. It's a very good point
though.

Is that not why proper software is engineered? Anyone can build a go
cart (write a program), but it takes a team of engineers and much
testing to build a car, no? Which woulu you rather be riding in during a
crash? I wish upper mgt had a better understanding of this ;)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-17 Thread Peter Hansen
rbt wrote:
> The script is too long to post in its entirety. In short, I open the
> files, do a binary read (in 1MB chunks for ease of memory usage) on them
> before placing that read into a variable and that in turn into a list
> that I then apply the following re to
> 
> ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
> 
> like this:
> 
> for chunk in whole_file:
> search = ss.findall(chunk)
> if search:
> validate(search)

This seems so obvious that I hesitate to ask, but is the above really a 
simplification of the real code, which actually handles the case of SSNs 
that lie over the boundary between chunks?  In other words, what happens 
if the first chunk has only the first four digits of the SSN, and the 
rest lies in the second chunk?

-Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-17 Thread rbt
On Fri, 2005-06-17 at 12:33 +1000, John Machin wrote:

> OK then, let's ignore the fact that the data is in a collection of Word 
> & Excel files, and let's ignore the scale for the moment. Let's assume 
> there are only 100 very plain text files to process, and only 1000 SSNs 
> in your map, so it doesn't have to be very efficient.
> 
> Can you please write a few lines of Python that would define your task 
> -- assume you have a line of text from an input file, show how you would 
>   determine that it needed to be changed, and how you would change it.

The script is too long to post in its entirety. In short, I open the
files, do a binary read (in 1MB chunks for ease of memory usage) on them
before placing that read into a variable and that in turn into a list
that I then apply the following re to

ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')

like this:

for chunk in whole_file:
search = ss.findall(chunk)
if search:
validate(search)

The validate function makes sure the string found is indeed in the range
of a legitimate SSN. You may read about this range here:

http://www.ssa.gov/history/ssn/geocard.html

That is as far as I have gotten. And I hope you can tell that I have
placed some small amount of thought into the matter. I've tested the
find a lot and it is rather accurate in finding SSNs in files. I have
not yet started replacing anything. I've only posted here for advice
before beginning.

> 
> > 
> > 
> >>(4) Under what circumstances will it not be possible to replace *ALL* 
> >>the SSNs?
> > 
> > 
> > I do not understand this question.
> 
> Can you determine from the data, without reference to the map, that a 
> particular string of characters is an SSN?

See above.

> 
> If so, and it is not in the map, why can it not be *added* to the map 
> with a new generated ID?

It is not my responsibility to do this. I do not have that authority
within the organization. Have you never worked for a real-world business
and dealt with office politics and territory ;)

> >>And what is the source of the SSNs in this file??? Have they been 
> >>extracted from the data? How?
> > 
> > 
> > That is irrelevant.
> 
> Quite the contrary. If they had been extracted from the data,

They have not. They are generated by a higher authority and then used by
lower authorities such as me. 

Again though, I think this is irrelevant to the task at hand... I have a
map, I have access to the data and that is all I need to have, no? I do
appreciate your input though. If you would like to have a further
exchange of ideas, perhaps we should do so off list?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-16 Thread John Machin
rbt wrote:
> On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote:
> 
>>rbt wrote:
>>
>>>Here's the scenario:
>>>
>>>You have many hundred gigabytes of data... possible even a terabyte or 
>>>two. Within this data, you have private, sensitive information (US 
>>>social security numbers) about your company's clients. Your company has 
>>>generated its own unique ID numbers to replace the social security numbers.
>>>
>>>Now, management would like the IT guys to go thru the old data and 
>>>replace as many SSNs with the new ID numbers as possible.
>>
>>This question is grossly OT; it's nothing at all to do with Python. 
>>However 
>>
>>(0) Is this homework?
> 
> 
> No, it is not.
> 
> 
>>(1) What to do with an SSN that's not in the map?
> 
> 
> Leave it be.
> 
> 
>>(2) How will a user of the system tell the difference between "new ID 
>>numbers" and SSNs?
> 
> 
> We have documentation. The two sets of numbers (SSNs and new_ids) are
> exclusive of each other.
> 
> 
>>(3) Has the company really been using the SSN as a customer ID instead 
>>of an account number, or have they been merely recording the SSN as a 
>>data item? Will the "new ID numbers" be used in communication with the 
>>clients? Will they be advised of the new numbers? How will you handle 
>>the inevitable cases where the advice doesn't get through?
> 
> 
> My task is purely technical.

OK then, let's ignore the fact that the data is in a collection of Word 
& Excel files, and let's ignore the scale for the moment. Let's assume 
there are only 100 very plain text files to process, and only 1000 SSNs 
in your map, so it doesn't have to be very efficient.

Can you please write a few lines of Python that would define your task 
-- assume you have a line of text from an input file, show how you would 
  determine that it needed to be changed, and how you would change it.

> 
> 
>>(4) Under what circumstances will it not be possible to replace *ALL* 
>>the SSNs?
> 
> 
> I do not understand this question.

Can you determine from the data, without reference to the map, that a 
particular string of characters is an SSN?

If so, and it is not in the map, why can it not be *added* to the map 
with a new generated ID?

> 
> 
>>(5) For how long can the data be off-line while it's being transformed?
> 
> 
> The data is on file servers that are unused on weekends and nights.
> 
> 
>>
>>>You have a tab 
>>>delimited txt file that maps the SSNs to the new ID numbers. There are 
>>>500,000 of these number pairs.
>>
>>And what is the source of the SSNs in this file??? Have they been 
>>extracted from the data? How?
> 
> 
> That is irrelevant.

Quite the contrary. If they had been extracted from the data, it would 
mean that someone actually might have a vague clue about how those SSNs 
were represented in the data and might be able to tell you and you might 
be able to tell us so that we could help you.

> 
> 
>>>What is the most efficient way  to 
>>>approach this? I have done small-scale find and replace programs before, 
>>>but the scale of this is larger than what I'm accustomed to.
>>>
>>>Any suggestions on how to approach this are much appreciated.
>>
>>A sensible answer will depend on how the data is structured:
>>
>>1. If it's in a database with tables some of which have a column for 
>>SSN, then there's a solution involving SQL.
>>
>>2. If it's in semi-free-text files where the SSNs are marked somehow:
>>
>>---client header---
>>surname: Doe first: John initial: Q SSN:123456789 blah blah
>>or
>>123456789
>>
>>then there's another solution which involves finding the markers ...
>>
>>3. If it's really free text, like
>>"""
>>File note: Today John Q. Doe telephoned to advise that his Social 
>>Security # is 123456789  not 987654321 (which is his wife's) and the soc 
>>sec numbers of his kids Bob & Carol are 
>>"""
>>then you might be in some difficulty ... google("TREC")
>>
>>
>>AND however you do it, you need to be very aware of the possibility 
>>(especially with really free text) of changing some string of digits 
>>that's NOT an SSN.
> 
> 
> That's possible, but I think not probably.

You "think not probably" -- based on what? pious hope? what the customer 
has told you? inspection of the data?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-16 Thread rbt
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote:
> rbt a écrit :
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible. You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs. What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> Are this huge amount of data to rearch/replace stored in an RDBMS or in 
> flat file(s) with markup (XML, CSV, ...) ?
> 
> --
> Gilles

The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx- new-id-
ssn-xx- new-id-
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-16 Thread rbt
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote:
> rbt a écrit :
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible. You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs. What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> Are this huge amount of data to rearch/replace stored in an RDBMS or in 
> flat file(s) with markup (XML, CSV, ...) ?
> 
> --
> Gilles

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
[EMAIL PROTECTED] bounced, that is the reason why.


The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx- new-id-
ssn-xx- new-id-
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-16 Thread rbt
On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote:
> rbt wrote:
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible.
> 
> This question is grossly OT; it's nothing at all to do with Python. 
> However 
> 
> (0) Is this homework?

No, it is not.

> 
> (1) What to do with an SSN that's not in the map?

Leave it be.

> 
> (2) How will a user of the system tell the difference between "new ID 
> numbers" and SSNs?

We have documentation. The two sets of numbers (SSNs and new_ids) are
exclusive of each other.

> 
> (3) Has the company really been using the SSN as a customer ID instead 
> of an account number, or have they been merely recording the SSN as a 
> data item? Will the "new ID numbers" be used in communication with the 
> clients? Will they be advised of the new numbers? How will you handle 
> the inevitable cases where the advice doesn't get through?

My task is purely technical.

> 
> (4) Under what circumstances will it not be possible to replace *ALL* 
> the SSNs?

I do not understand this question.

> 
> (5) For how long can the data be off-line while it's being transformed?

The data is on file servers that are unused on weekends and nights.

> 
> 
> > You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs.
> 
> And what is the source of the SSNs in this file??? Have they been 
> extracted from the data? How?

That is irrelevant.

> 
> > What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> A sensible answer will depend on how the data is structured:
> 
> 1. If it's in a database with tables some of which have a column for 
> SSN, then there's a solution involving SQL.
> 
> 2. If it's in semi-free-text files where the SSNs are marked somehow:
> 
> ---client header---
> surname: Doe first: John initial: Q SSN:123456789 blah blah
> or
> 123456789
> 
> then there's another solution which involves finding the markers ...
> 
> 3. If it's really free text, like
> """
> File note: Today John Q. Doe telephoned to advise that his Social 
> Security # is 123456789  not 987654321 (which is his wife's) and the soc 
> sec numbers of his kids Bob & Carol are 
> """
> then you might be in some difficulty ... google("TREC")
> 
> 
> AND however you do it, you need to be very aware of the possibility 
> (especially with really free text) of changing some string of digits 
> that's NOT an SSN.

That's possible, but I think not probably.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-16 Thread rbt
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote:
> rbt a écrit :
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible. You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs. What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> Are this huge amount of data to rearch/replace stored in an RDBMS or in 
> flat file(s) with markup (XML, CSV, ...) ?
> 
> --
> Gilles

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
[EMAIL PROTECTED] bounced, that is the reason why.


The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx- new-id-
ssn-xx- new-id-
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-16 Thread rbt
On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote:
> rbt a écrit :
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible. You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs. What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> Are this huge amount of data to rearch/replace stored in an RDBMS or in 
> flat file(s) with markup (XML, CSV, ...) ?
> 
> --
> Gilles

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
[EMAIL PROTECTED] bounced, that is the reason why.


The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx- new-id-
ssn-xx- new-id-
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-14 Thread Gilles Lenfant
rbt a écrit :
> Here's the scenario:
> 
> You have many hundred gigabytes of data... possible even a terabyte or 
> two. Within this data, you have private, sensitive information (US 
> social security numbers) about your company's clients. Your company has 
> generated its own unique ID numbers to replace the social security numbers.
> 
> Now, management would like the IT guys to go thru the old data and 
> replace as many SSNs with the new ID numbers as possible. You have a tab 
> delimited txt file that maps the SSNs to the new ID numbers. There are 
> 500,000 of these number pairs. What is the most efficient way  to 
> approach this? I have done small-scale find and replace programs before, 
> but the scale of this is larger than what I'm accustomed to.
> 
> Any suggestions on how to approach this are much appreciated.

Are this huge amount of data to rearch/replace stored in an RDBMS or in 
flat file(s) with markup (XML, CSV, ...) ?

--
Gilles
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-14 Thread Steve Holden
Brian wrote:
> Hi Rbt,
> 
> To give an example of processing a lot of data, I used Python to read 
> and process every word in a single text file that contained the entire 
> King James Bible version.  It processed it within about one second -- 
> split the words, etc.  Worked quite well.
> 
> Hope this helps,
> 
Hardly "big" by the OP's standards at only 4.5 MB (and roughly 790,000 
words, if anyone's interested).

The problem with processing terabytes is frequently the need to 
accomodate the data with techniques that allow the virtual memory to 
avoid becoming swamped. The last thing you need to do under such 
circumstances is create an in-memory copy of hte entire data set, since 
on most real computers this will inflict swapping behavior.

regards
  Steve
-- 
Steve Holden+1 703 861 4237  +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread Brian
Hi Rbt,

To give an example of processing a lot of data, I used Python to read 
and process every word in a single text file that contained the entire 
King James Bible version.  It processed it within about one second -- 
split the words, etc.  Worked quite well.

Hope this helps,

Brian
---


rbt wrote:
> Here's the scenario:
> 
> You have many hundred gigabytes of data... possible even a terabyte or 
> two. Within this data, you have private, sensitive information (US 
> social security numbers) about your company's clients. Your company has 
> generated its own unique ID numbers to replace the social security numbers.
> 
> Now, management would like the IT guys to go thru the old data and 
> replace as many SSNs with the new ID numbers as possible. You have a tab 
> delimited txt file that maps the SSNs to the new ID numbers. There are 
> 500,000 of these number pairs. What is the most efficient way  to 
> approach this? I have done small-scale find and replace programs before, 
> but the scale of this is larger than what I'm accustomed to.
> 
> Any suggestions on how to approach this are much appreciated.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread John Machin
rbt wrote:
> Here's the scenario:
> 
> You have many hundred gigabytes of data... possible even a terabyte or 
> two. Within this data, you have private, sensitive information (US 
> social security numbers) about your company's clients. Your company has 
> generated its own unique ID numbers to replace the social security numbers.
> 
> Now, management would like the IT guys to go thru the old data and 
> replace as many SSNs with the new ID numbers as possible.

This question is grossly OT; it's nothing at all to do with Python. 
However 

(0) Is this homework?

(1) What to do with an SSN that's not in the map?

(2) How will a user of the system tell the difference between "new ID 
numbers" and SSNs?

(3) Has the company really been using the SSN as a customer ID instead 
of an account number, or have they been merely recording the SSN as a 
data item? Will the "new ID numbers" be used in communication with the 
clients? Will they be advised of the new numbers? How will you handle 
the inevitable cases where the advice doesn't get through?

(4) Under what circumstances will it not be possible to replace *ALL* 
the SSNs?

(5) For how long can the data be off-line while it's being transformed?


> You have a tab 
> delimited txt file that maps the SSNs to the new ID numbers. There are 
> 500,000 of these number pairs.

And what is the source of the SSNs in this file??? Have they been 
extracted from the data? How?

> What is the most efficient way  to 
> approach this? I have done small-scale find and replace programs before, 
> but the scale of this is larger than what I'm accustomed to.
> 
> Any suggestions on how to approach this are much appreciated.

A sensible answer will depend on how the data is structured:

1. If it's in a database with tables some of which have a column for 
SSN, then there's a solution involving SQL.

2. If it's in semi-free-text files where the SSNs are marked somehow:

---client header---
surname: Doe first: John initial: Q SSN:123456789 blah blah
or
123456789

then there's another solution which involves finding the markers ...

3. If it's really free text, like
"""
File note: Today John Q. Doe telephoned to advise that his Social 
Security # is 123456789  not 987654321 (which is his wife's) and the soc 
sec numbers of his kids Bob & Carol are 
"""
then you might be in some difficulty ... google("TREC")


AND however you do it, you need to be very aware of the possibility 
(especially with really free text) of changing some string of digits 
that's NOT an SSN.

HTH,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread Bruno Desthuilliers
rbt a écrit :
> Here's the scenario:
> 
> You have many hundred gigabytes of data... possible even a terabyte or 
> two. Within this data, you have private, sensitive information (US 
> social security numbers) about your company's clients. Your company has 
> generated its own unique ID numbers to replace the social security numbers.
> 
> Now, management would like the IT guys to go thru the old data and 
> replace as many SSNs with the new ID numbers as possible. You have a tab 
> delimited txt file that maps the SSNs to the new ID numbers. There are 
> 500,000 of these number pairs. What is the most efficient way  to 
> approach this? I have done small-scale find and replace programs before, 
> but the scale of this is larger than what I'm accustomed to.
> 
> Any suggestions on how to approach this are much appreciated.

I may say something stupid (that would not be the first time), but why 
don't you just put it all in a good RDBMS and go with a few SQL queries?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread Robert Kern
Robert Kern wrote:

> I'm just tossing this out. I don't really have much experience with the 
> algorithm, but when searching for multiple targets, an Aho-Corasick 
> automaton might be appropriate.
> 
> http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

More puttering around the web suggests that pytst may be a faster/more 
scalable implementation.

http://www.lehuen.com/nicolas/index.php/Pytst

Try them both. If you meet success with one of them, come back and tell 
us about it.

-- 
Robert Kern
[EMAIL PROTECTED]

"In the fields of hell where the grass grows high
  Are the graves of dreams allowed to die."
   -- Richard Harter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread Paul Rubin
rbt <[EMAIL PROTECTED]> writes:
> Now, management would like the IT guys to go thru the old data and
> replace as many SSNs with the new ID numbers as possible. You have a
> tab delimited txt file that maps the SSNs to the new ID numbers. There
> are 500,000 of these number pairs. What is the most efficient way  to
> approach this? I have done small-scale find and replace programs
> before, but the scale of this is larger than what I'm accustomed to.

Just use an ordinary python dict for the map, on a system with enough
ram (maybe 100 bytes or so for each pair, so the map would be 50 MB).
Then it's simply a matter of scanning through the input files to find
SSN's and look them up in the dict.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread Roy Smith
rbt  <[EMAIL PROTECTED]> wrote:
>You have many hundred gigabytes of data... possible even a terabyte or 
>two. Within this data, you have private, sensitive information (US 
>social security numbers) about your company's clients. Your company has 
>generated its own unique ID numbers to replace the social security numbers.
>
>Now, management would like the IT guys to go thru the old data and 
>replace as many SSNs with the new ID numbers as possible.

The first thing I would do is a quick sanity check.  Write a simple
Python script which copies its input to its output with minimal
processing.  Something along the lines of:

for line in sys.stdin.readline():
words = line.split()
print words

Now, run this on a gig of your data on your system and time it.
Multiply the timing by 1000 (assuming a terabyte of real data), and
now you've got a rough lower bound on how long the real job will take.
If you find this will take weeks of processing time, you might want to
re-think the plan.  Better to discover that now than a few weeks into
the project :-)

>You have a tab 
>delimited txt file that maps the SSNs to the new ID numbers. There are 
>500,000 of these number pairs. What is the most efficient way  to 
>approach this? I have done small-scale find and replace programs before, 
>but the scale of this is larger than what I'm accustomed to.

The obvious first thing is to read the SSN/ID map file and build a
dictionary where the keys are the SSN's and the values are the ID's.

The next step depends on your data.  Is it in text files?  An SQL
database?  Some other format.  One way or another you'll have to read
it in, do something along the lines of "newSSN = ssnMap[oldSSN]", and
write it back out.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread Robert Kern
rbt wrote:
> Here's the scenario:
> 
> You have many hundred gigabytes of data... possible even a terabyte or 
> two. Within this data, you have private, sensitive information (US 
> social security numbers) about your company's clients. Your company has 
> generated its own unique ID numbers to replace the social security numbers.
> 
> Now, management would like the IT guys to go thru the old data and 
> replace as many SSNs with the new ID numbers as possible. You have a tab 
> delimited txt file that maps the SSNs to the new ID numbers. There are 
> 500,000 of these number pairs. What is the most efficient way  to 
> approach this? I have done small-scale find and replace programs before, 
> but the scale of this is larger than what I'm accustomed to.
> 
> Any suggestions on how to approach this are much appreciated.

I'm just tossing this out. I don't really have much experience with the 
algorithm, but when searching for multiple targets, an Aho-Corasick 
automaton might be appropriate.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Good luck!

-- 
Robert Kern
[EMAIL PROTECTED]

"In the fields of hell where the grass grows high
  Are the graves of dreams allowed to die."
   -- Richard Harter

-- 
http://mail.python.org/mailman/listinfo/python-list


Is Python Suitable for Large Find & Replace Operations?

2005-06-13 Thread rbt
Here's the scenario:

You have many hundred gigabytes of data... possible even a terabyte or 
two. Within this data, you have private, sensitive information (US 
social security numbers) about your company's clients. Your company has 
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and 
replace as many SSNs with the new ID numbers as possible. You have a tab 
delimited txt file that maps the SSNs to the new ID numbers. There are 
500,000 of these number pairs. What is the most efficient way  to 
approach this? I have done small-scale find and replace programs before, 
but the scale of this is larger than what I'm accustomed to.

Any suggestions on how to approach this are much appreciated.
-- 
http://mail.python.org/mailman/listinfo/python-list