@Alan, @Lie thanks The approach which I am taking right now is taking some test-cases, and creating rules for them. Later on after expanding the cases there aroused some cases which didn't followed earlier pattern so I tweaked some rules so as to match all of them. The task is time-consuming but with every new test-sets exceptions are becoming less and less. (There are .2 million such pages)
PS. The task is to create a trademark-database which stores ID, company name, date, address, and trademarks from the original set and later matches with the given trademarks to disqualify similar trademarks. On Sun, Jan 3, 2010 at 2:53 PM, Lie Ryan <[email protected]> wrote: > On 1/3/2010 4:58 PM, Shashwat Anand wrote: > >> I need to extract some meaningful data from grabages. >> Here are four examples. I need to get date, company name and address >> from these. >> For date i used regex but I'm unable to find any definite pattern for >> address and company name >> the format is more or less : >> garbage >> id - date >> garbage >> company name >> garbage >> company address >> garbage >> >> How should I parse info if I'm not certain of any definite rules. This >> is my first time dealing with real-life data. >> > > Other than the "id - date"; it seems quite difficult to reliably extract > the company names and addresses. Extracting the company names and addresses > appears to be based on a best-effort basis. > > Tips: look for clue keywords; company names often ends with > ltd/sdn/bhd/berhad; lines that starts with "address" often is followed by > the actual addresses; etc. > > Tips: this is a good showcase for TDD; pick twenty-or-so cases and manually > extract the information and write your program to match as much of these > test cases as possible (while manually extracting you should be able to > notice additional patterns that you can use later on while writing your > program). > > _______________________________________________ > Tutor maillist - [email protected] > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor >
_______________________________________________ Tutor maillist - [email protected] To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
