Hi, I want to exclude some of Yahoo Answers URLs from crawling.
Few examples are as follows: 1. http://answers.yahoo.com/question/?link=answer&qid=20091122033318AA3huLM 2. http://answers.yahoo.com/question/index?link=answer&qid=20091122033342AAOM4wP 3. http://answers.yahoo.com/question/index?qid=20091121162441AAOnXso&link=mailto 4. http://answers.yahoo.com/question/accuse_write?qid=20091116082658AAzR9qL&kid=Jr1wAnTjCVjn.WTrbMQi&s=comm&date=2009-11-21+09%3A20%3A28&.crumb= 5. http://answers.yahoo.com/question/report?qid=20091122033318AA3huLM&kid=DZZPMmO7I0PuHPDMoI2x&date=2009-11-22+03%3A33%3A18&.crumb=&s=q 6. http://answers.yahoo.com/answer/report?qid=20091120013612AARBqlD&kid=H4JcAW7NV2jAEVlLGO65&.crumb= 7. http://answers.yahoo.com/answer/report?qid=20091120013625AAo1Ota&kid=FsN4XGC9UzltYyeO8uaqeIMG8.92dGiXm6bhhsbceWn9b.flUOVf&.crumb= 8. http://answers.yahoo.com/rss/question?qid=20091121183113AAPHkNo I tried setting up the following filters for some of the above in crawl-urlfilter.txt, but that didn't help: For 1. -^http://answers.yahoo.com/question/\?link=answer\&qid=([a-zA-Z0-9]*) For 2. -^http://answers.yahoo.com/question/index/\?link=answer\&qid=([a-zA-Z0-9]*) For 3. -^http://answers.yahoo.com/question/index/\?qid=([a-zA-Z0-9]*)\&link=mailto For 4. -^http://answers.yahoo.com/question/accuse_write\?qid=([a-zA-Z0-9]*)\&kid=([a-zA-Z0-9]\.*)/\?date=([a-zA-Z0-9]\-\+\%\&\.*)crumb= Any suggestions? Thanks in advance. If I make changes to the config files (crawl-urlfilter.txt for crawl CLI and regex-urlfilter.txt for generate, fetch, updatedb... CLI), it will be loaded next time I run the CLIs, is my understanding correct? Thanks and Regards, Vidya. -- View this message in context: http://old.nabble.com/Yahoo-Answers-subdirectory-exclusion-filter-tp26464177p26464177.html Sent from the Nutch - User mailing list archive at Nabble.com.