Re: Nutch - Restriction by content type

2023-11-16 Thread Markus Jelsma
Hello,

You can skip certain types of documents based on their file extension,
using the urlfilter-suffix. It only filters known suffixes. Filtering based
on content type is not possible, because to know the content type requires
fetching and parsing them.

You can skip specific content types when indexing using the Jexl indexing
filter.

Regards,
Markus

Op do 16 nov 2023 om 14:56 schreef Raj Chidara :

> Hello
>   Can we control crawling of web pages by its content type through any
> configuration setting?  For example, I want to crawl only pages whose
> content type is text/html from a website and does not want to crawl other
> pages/files.
>
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>
>
> Worldwide Offices:
>
> USA | UK | India | Singapore | Japan
>
> *ISO 9001, 27001, 2 Compliant
>
>
>
> www.DDIsmart.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> DISCLAIMER: This message is intended solely for the use of the individual
> or entity to which it is addressed. If you are not the intended recipient,
> you should not use, copy, alter, or disclose the contents of this message.
> All information or opinions expressed in this message and/or any
> attachments are those of the author and are not necessarily those of the
> group companies.
>
>
>
>
>
>
>


Nutch - Restriction by content type

2023-11-16 Thread Raj Chidara
Hello 
  Can we control crawling of web pages by its content type through any 
configuration setting?  For example, I want to crawl only pages whose content 
type is text/html from a website and does not want to crawl other pages/files.



Thanks and Regards

Raj Chidara


 
 
 
Worldwide Offices:

USA | UK | India | Singapore | Japan

*ISO 9001, 27001, 2 Compliant



www.DDIsmart.com


 
 
 

 
 
 
 
 
 
 
DISCLAIMER: This message is intended solely for the use of the individual or 
entity to which it is addressed. If you are not the intended recipient, you 
should not use, copy, alter, or disclose the contents of this message. All 
information or opinions expressed in this message and/or any attachments are 
those of the author and are not necessarily those of the group companies.