Hi,

I want to exclude some of Yahoo Answers URLs from crawling. 

Few examples are as follows:
1. http://answers.yahoo.com/question/?link=answer&qid=20091122033318AA3huLM
2.
http://answers.yahoo.com/question/index?link=answer&qid=20091122033342AAOM4wP
3.
http://answers.yahoo.com/question/index?qid=20091121162441AAOnXso&link=mailto
4.
http://answers.yahoo.com/question/accuse_write?qid=20091116082658AAzR9qL&kid=Jr1wAnTjCVjn.WTrbMQi&s=comm&date=2009-11-21+09%3A20%3A28&.crumb=
5.
http://answers.yahoo.com/question/report?qid=20091122033318AA3huLM&kid=DZZPMmO7I0PuHPDMoI2x&date=2009-11-22+03%3A33%3A18&.crumb=&s=q
6.
http://answers.yahoo.com/answer/report?qid=20091120013612AARBqlD&kid=H4JcAW7NV2jAEVlLGO65&.crumb=
7.
http://answers.yahoo.com/answer/report?qid=20091120013625AAo1Ota&kid=FsN4XGC9UzltYyeO8uaqeIMG8.92dGiXm6bhhsbceWn9b.flUOVf&.crumb=
8. http://answers.yahoo.com/rss/question?qid=20091121183113AAPHkNo

I tried setting up the following filters for some of the above in
crawl-urlfilter.txt, but that didn't help:
For 1. -^http://answers.yahoo.com/question/\?link=answer\&qid=([a-zA-Z0-9]*)
For 2.
-^http://answers.yahoo.com/question/index/\?link=answer\&qid=([a-zA-Z0-9]*)
For 3.
-^http://answers.yahoo.com/question/index/\?qid=([a-zA-Z0-9]*)\&link=mailto
For 4.
-^http://answers.yahoo.com/question/accuse_write\?qid=([a-zA-Z0-9]*)\&kid=([a-zA-Z0-9]\.*)/\?date=([a-zA-Z0-9]\-\+\%\&\.*)crumb=

Any suggestions? Thanks in advance.

If I make changes to the config files (crawl-urlfilter.txt for crawl CLI and
regex-urlfilter.txt for generate, fetch, updatedb... CLI), it will be loaded
next time I run the CLIs, is my understanding correct?

Thanks and Regards,
Vidya.

-- 
View this message in context: 
http://old.nabble.com/Yahoo-Answers-subdirectory-exclusion-filter-tp26464177p26464177.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to