Shashwat Anand dixit: > @Alan, @Lie thanks > The approach which I am taking right now is taking some test-cases, and > creating rules for them. Later on after expanding the cases there aroused > some cases which didn't followed earlier pattern so I tweaked some rules so > as to match all of them. The task is time-consuming but with every new > test-sets exceptions are becoming less and less. (There are .2 million such > pages) > > PS. The task is to create a trademark-database which stores ID, company > name, date, address, and trademarks from the original set and later matches > with the given trademarks to disqualify similar trademarks.
Sometimes it is worthful to note patterns for not-to-be-kept parts of source (garbage). Esp. if you can find patterns for start/end of garbage parts. Eg if address end is hard, look whether it's easier to find a pattern for the start of the following garbage. Denis ________________________________ la vita e estrany http://spir.wikidot.com/ _______________________________________________ Tutor maillist - [email protected] To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
