RE: about canonical pages to avoid duplicates pages

Markus Jelsma Wed, 26 Oct 2016 13:35:49 -0700

Hello Eyeris - there is no such thing in Nutch right now. Although i do seem to 
remember having a plugin that provides support for it, as well as support for 
it via HTTP headers and og:url, of course with normalize and filter and uses 
robots=noindex to prevent indexing duplicates.


You can also try to improve on the patch attached to NUTCH-710. There are 
excellent comments for guidance. 

M.
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <[email protected]>
> Sent: Wednesday 26th October 2016 22:01
> To: [email protected]
> Subject: about canonical pages to avoid duplicates pages
> 
> Hi all.
> Im using nutch 1.12 and solr 4.10.3. in local mode.
> I have detected a lot of duplicates pages on crawlDB. Maybe using canonical 
> atribute i can reduce duplicate pages on crawldb.
> I have read a old post(see below),that is an intersting topic.
> https://issues.apache.org/jira/browse/NUTCH-710 
> 
> Is this feature supported by nutch or not ?.
> 
> 
>

RE: about canonical pages to avoid duplicates pages

Reply via email to