Hi,

the indexing job ("bin/nutch index") will delete this document.
But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
does not (cf. NUTCH-1758).

Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
will not send anything to the index.

Best,
Sebastian


On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
> Hi all.
> I need some help with this problem, sorry if is a trivial things.
> I have a little problem with some url that have noindex meta and are being 
> indexed.
> 
> For example this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> 
> have the meta noindex and for some reason it is not deleted as well and 
> <meta name="robots" content="noindex,follow"/>
> 
> I have read that nutch should delete this document at the indexing time and 
> it is not occurring correctly.
> 
> <property>
>   <name>indexer.delete.robots.noindex</name>
>   <value>true</value>
> </property>
> 
> If i do a parsechecker the output has an empty content but the document it is 
> not deleted:
> 
> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> robots.txt whitelist not configured.
> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> contentType: text/html
> date :        Wed May 10 14:21:36 CDT 2017
> agent :       cubbot
> type :        text/html
> type :        text
> type :        html
> title :       3
> url : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> content :     
> tstamp :      Wed May 10 14:21:36 CDT 2017
> domain :      uci.cu
> digest :      25ed6b1b7be4cbb69a3405f5efe2f8a2
> host :        humanos.uci.cu
> name :        3
> id :  https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> lang :        es
> 
> Please any help or suggestion will be appreciated.
> ****************************************************
> Text below is autogenerated
> ***************************************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
> 

Reply via email to