Hi,

there is already an issue open:
https://issues.apache.org/jira/browse/NUTCH-710

I've struggled with the rel=canonical tag right now.
About 70% of the documents of the crawled site had this tag set.
The quick solution was to write a parse filter that extracts the
tag and an indexing filter that skips all documents with this tag.
As Julien mentioned in the issue, this has the drawback that
some content may get lost: docs with canonical tag pointing to gone
documents, redirects or docs having themselves canonical tags.
However, in my case it's by far better than so many duplicates.
A real solution would be somewhat difficult, esp. for Nutch 1.x
because to resolve chains of canonical tags and/or redirects
would mean iterating several times over the data / CrawlDb.

At least, what about writing the target of the canonical tag
to CrawlDatum's meta? It would make a solution by iterating over CrawlDb
possible. And an indexing filter that skips those URLs/documents
would be trivial to implement. Any suggestions?

Sebastian

On 03/22/2012 08:40 PM, Markus Jelsma wrote:
This is not supported by Nutch and there's no issue ticket yet. Feel free to 
open one.

On Thu, 22 Mar 2012 14:32:26 -0500, <thomas.j.lut...@wellsfargo.com> wrote:
Ran across a posting for the Nutch roadmap mentioning support for the
canonical tag.

http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/nutch/Nutch2Roadmap
Is there any update as to when this support will be added to Nutch?


Reply via email to