[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859286#action_12859286
 ] 

Julien Nioche commented on NUTCH-710:
-------------------------------------

As suggested previously we could either treat canonicals as redirections or 
during deduplication. Neither are satisfactory solutions.

Redirection : we want to index the document if/when the target of the canonical 
is not available for indexing. We also want to follow the outlinks. 
Dedup : could modify the *DeleteDuplicates code but canonical are more complex 
due to fact that we need to follow redirections

We probably need a third approach: prefilter by going through the crawldb & 
detect URLs which have a canonical target already indexed or ready to be 
indexed. We need to follow up to X levels of redirection e.g. doc A marked as 
canonical representation doc B, doc B redirects to doc C etc...if end of 
redirection chain exists and is valid then mark A as duplicate of C 
(intermediate redirs will not get indexed anyway)

As we don't know if has been indexed yet we would give it a special marker 
(e.g. status_duplicate) in the crawlDB. Then
-> if indexer comes across such an entry : skip it
-> make so that *deleteDuplicates can take a list of URLs with status_duplicate 
as an additional source of input OR have a custom resource that deletes such 
entries in SOLR or Lucene indices

The implementation would be as follows :

Go through all redirections and generate all redirection chains e.g.

A -> B
B -> C
D -> C

where C is an indexable document (i.e. has been fetched and parsed - it may 
have been already indexed.

will yield

A -> C
B -> C
D -> C

but also

C -> C

Once we have all possible redirections : go through the crawlDB in search of 
canonicals. if the target of a canonical is the source of a valid alias (e.g. A 
- B - C - D) mark it as 'status:duplicate'

This design implies generating quite a few intermediate structures + scanning 
the whole crawlDB twice (once of the aliases then for the canonical) + rewrite 
the whole crawlDB to mark some of the entries as duplicates.

This would be much easier to do when we have Nutch2/HBase : could simply follow 
the redirs from the initial URL having a canonical tag instead of generating 
these intermediate structures. We can then modify the entries one by one 
instead of regenerating the whole crawlDB.

WDYT?



> Support for rel="canonical" attribute
> -------------------------------------
>
>                 Key: NUTCH-710
>                 URL: https://issues.apache.org/jira/browse/NUTCH-710
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to