Hi,
This is a tough problem indeed. We partially mitigate this problem by
using several regular expressions, linkrank scores with domain limiting
generator for regular crawls and a second shallow crawl, only following
links from the home page.
A custom URLFilter as Ferdy explains is a good idea indeed. However,
URLFilters operate on single URL's only, which is as difficuly as
creating regular expressions. If we can process all outlinks of a given
page at the same time it's easier to compare and calculate similarity
and if needed, discard them if we consider them as unwanted calendars.
Can you explain how Heretrix does it? Perhaps we can learn from it.
Cheers,
Markus
On Sat, 5 May 2012 12:44:27 +0200, Ferdy Galema
<[email protected]> wrote:
Hi,
Fetching unwanted pages such as in this case dynamically generated
pages is
a general problem. Currently I'm not aware of any pending
improvements in
this area, but feel free to contribute if you have a solution.
Probably the
best way to solve such a problem is by implementing a custom
URLFilter.
This filter might have some heuristics that is able to detect
dynamically
generated urls.
Ferdy.
On Fri, May 4, 2012 at 9:13 PM, Xiao Li <[email protected]>
wrote:
Hi Nutch people,
I am using Nutch to index a website. I notice that Nutch has crawled
some junk webpages, such as
http://**************/category/events/2015-11. This webpage is about
the event occurring in 2015, 11. This is completely nonsense for me.
I
want to know is it possible for Nutch to intelligently skip such
webpages. It may be argued that I can use Regex to avoid this.
However, as the naming pattern of calendar webpages are not the same
all the time, there is no way to write a perfect Regex for this. I
know Heritrix (a Internet archive crawler) has such capabilities to
avoid crawling nonsense calendar webpage. Does anyone solve this
issue?
Regrds
Xiao
--