Fetch Filter
------------

                 Key: NUTCH-828
                 URL: https://issues.apache.org/jira/browse/NUTCH-828
             Project: Nutch
          Issue Type: New Feature
          Components: fetcher
         Environment: All
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.1
         Attachments: NUTCH-828-1-20100608.patch

Adds a Nutch extension point for a fetch filter.  The fetch filter allows 
filtering content and parse data/text after it is fetched but before it is 
written to segments.  The fliter can return true if content is to be written or 
false if it is not.  

Some use cases for this filter would be topical search engines that only want 
to fetch/index certain types of content, for example a news or sports only 
search engine.  In these types of situations the only way to determine if 
content belongs to a particular set is to fetch the page and then analyze the 
content.  If the content passes, meaning belongs to the set of say sports 
pages, then we want to include it.  If it doesn't then we want to ignore it, 
never fetch that same page in the future, and ignore any urls on that page.  If 
content is rejected due to a fetch filter then its status is written to the 
CrawlDb as gone and its content is ignored and not written to segments.  This 
effectively stop crawling along the crawl path of that page and the urls from 
that page.  An example filter, fetch-safe, is provided that allows fetching 
content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to