On 1/3/2010 4:58 PM, Shashwat Anand wrote:
I need to extract some meaningful data from grabages.
Here are four examples. I need to get date, company name and address
from these.
For date i used regex but I'm unable to find any definite pattern for
address and company name
the format is more or less :
garbage
id - date
garbage
company name
garbage
company address
garbage
How should I parse info if I'm not certain of any definite rules. This
is my first time dealing with real-life data.
Other than the "id - date"; it seems quite difficult to reliably extract
the company names and addresses. Extracting the company names and
addresses appears to be based on a best-effort basis.
Tips: look for clue keywords; company names often ends with
ltd/sdn/bhd/berhad; lines that starts with "address" often is followed
by the actual addresses; etc.
Tips: this is a good showcase for TDD; pick twenty-or-so cases and
manually extract the information and write your program to match as much
of these test cases as possible (while manually extracting you should be
able to notice additional patterns that you can use later on while
writing your program).
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor