Re: Avoid crawling nonsense calendar webpage

Ferdy Galema Sat, 05 May 2012 03:45:00 -0700

Hi,

Fetching unwanted pages such as in this case dynamically generated pages is
a general problem. Currently I'm not aware of any pending improvements in
this area, but feel free to contribute if you have a solution. Probably the
best way to solve such a problem is by implementing a custom URLFilter.
This filter might have some heuristics that is able to detect dynamically
generated urls.


Ferdy.

On Fri, May 4, 2012 at 9:13 PM, Xiao Li <[email protected]> wrote:

> Hi Nutch people,
>
> I am using Nutch to index a website. I notice that Nutch has crawled
> some junk webpages, such as
> http://**************/category/events/2015-11. This webpage is about
> the event occurring in 2015, 11. This is completely nonsense for me. I
> want to know is it possible for Nutch to intelligently skip such
> webpages. It may be argued that I can use Regex to avoid this.
> However, as the naming pattern of calendar webpages are not the same
> all the time, there is no way to write a perfect Regex for this. I
> know Heritrix (a Internet archive crawler) has such capabilities to
> avoid crawling nonsense calendar webpage. Does anyone solve this
> issue?
>
> Regrds
> Xiao
>

Re: Avoid crawling nonsense calendar webpage

Reply via email to