[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859286#action_12859286 ]
Julien Nioche commented on NUTCH-710: ------------------------------------- As suggested previously we could either treat canonicals as redirections or during deduplication. Neither are satisfactory solutions. Redirection : we want to index the document if/when the target of the canonical is not available for indexing. We also want to follow the outlinks. Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact that we need to follow redirections We probably need a third approach: prefilter by going through the crawldb & detect URLs which have a canonical target already indexed or ready to be indexed. We need to follow up to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of C (intermediate redirs will not get indexed anyway) As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate) in the crawlDB. Then -> if indexer comes across such an entry : skip it -> make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices The implementation would be as follows : Go through all redirections and generate all redirection chains e.g. A -> B B -> C D -> C where C is an indexable document (i.e. has been fetched and parsed - it may have been already indexed. will yield A -> C B -> C D -> C but also C -> C Once we have all possible redirections : go through the crawlDB in search of canonicals. if the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate' This design implies generating quite a few intermediate structures + scanning the whole crawlDB twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some of the entries as duplicates. This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs from the initial URL having a canonical tag instead of generating these intermediate structures. We can then modify the entries one by one instead of regenerating the whole crawlDB. WDYT? > Support for rel="canonical" attribute > ------------------------------------- > > Key: NUTCH-710 > URL: https://issues.apache.org/jira/browse/NUTCH-710 > Project: Nutch > Issue Type: New Feature > Affects Versions: 1.1 > Reporter: Frank McCown > Priority: Minor > > There is a the new rel="canonical" attribute which is > now being supported by Google, Yahoo, and Live: > http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html > Adding support for this attribute value will potentially reduce the number of > URLs crawled and indexed and reduce duplicate page content. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.