Shashwat Anand dixit:

> @Alan, @Lie thanks
> The approach which I am taking right now is taking some test-cases, and
> creating rules for them. Later on after expanding the cases there aroused
> some cases which didn't followed earlier pattern so I tweaked some rules so
> as to match all of them. The task is time-consuming but with every new
> test-sets exceptions are becoming less and less. (There are .2 million such
> pages)
> 
> PS. The task is to create a trademark-database which stores ID, company
> name, date, address, and trademarks from the original set and later matches
> with the given trademarks to disqualify similar trademarks.

Sometimes it is worthful to note patterns for not-to-be-kept parts of source 
(garbage). Esp. if you can find patterns for start/end of garbage parts.
Eg if address end is hard, look whether it's easier to find a pattern for the 
start of the following garbage.

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to