Crawl and Index specific links on specific page

anish_88 Thu, 12 Dec 2013 22:12:03 -0800

Hi

I am new to nutch so I am just starting my way in. I want to crawl a
specific page and under that page, I want to crawl specific links.for e.g


I want to crawl only http://nutch.apache.org/downloads.html

Under this page I just want to crawl say only *.txt links.Now they can be
active links like in    or the could be embedded in some div like we mostly
saw in variety of forums where a link for file upload/download sites are
pasted/embedded in some div etc. like
htp://example.com/movie_abcd/firstpart.avi

Here I just want to crawl links ended with avi.I am just confused with
regex-urlfilter because till now I am only using it and I ma not familiar
with other url filters such prefix and suffix urls filters.Does they also
play important role in the solution for my problem.How can achieve this.

I will be curiously waiting for the answers.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawl-and-Index-specific-links-on-specific-page-tp4106524.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Crawl and Index specific links on specific page

Reply via email to