RE: Crawl and Index specific links on specific page

Markus Jelsma Fri, 13 Dec 2013 02:01:09 -0800
Prefix and suffix URL filters just resp. filter protocol schema's and file 
extensions. The latter can be used to filter out .txt or .avi. But 
unfortunately URL filters are unaware of context, you cannot just allow .txt or 
.avi on a specific page via a suffix URL filter. You would need  carefully 
constructed regex filters to allow specific files in a context.
 
-----Original message-----
> From:anish_88 <[email protected]>
> Sent: Friday 13th December 2013 7:11
> To: [email protected]
> Subject: Crawl and Index specific links on specific page
> 
> Hi
> 
> I am new to nutch so I am just starting my way in. I want to crawl a
> specific page and under that page, I want to crawl specific links.for e.g
> 
> I want to crawl only http://nutch.apache.org/downloads.html
> 
> Under this page I just want to crawl say only *.txt links.Now they can be
> active links like in    or the could be embedded in some div like we mostly
> saw in variety of forums where a link for file upload/download sites are
> pasted/embedded in some div etc. like
> htp://example.com/movie_abcd/firstpart.avi
> 
> Here I just want to crawl links ended with avi.I am just confused with
> regex-urlfilter because till now I am only using it and I ma not familiar
> with other url filters such prefix and suffix urls filters.Does they also
> play important role in the solution for my problem.How can achieve this.
> 
> I will be curiously waiting for the answers.
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Crawl-and-Index-specific-links-on-specific-page-tp4106524.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
RE: Crawl and Index specific links on specific page

Reply via email to