Hi,
the indexing job ("bin/nutch index") will delete this document.
But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
does not (cf. NUTCH-1758).
Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
will not send anything to the index.
Best,
Sebastian
On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
> Hi all.
> I need some help with this problem, sorry if is a trivial things.
> I have a little problem with some url that have noindex meta and are being
> indexed.
>
> For example this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>
> have the meta noindex and for some reason it is not deleted as well and
> <meta name="robots" content="noindex,follow"/>
>
> I have read that nutch should delete this document at the indexing time and
> it is not occurring correctly.
>
> <property>
> <name>indexer.delete.robots.noindex</name>
> <value>true</value>
> </property>
>
> If i do a parsechecker the output has an empty content but the document it is
> not deleted:
>
> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> robots.txt whitelist not configured.
> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> contentType: text/html
> date : Wed May 10 14:21:36 CDT 2017
> agent : cubbot
> type : text/html
> type : text
> type : html
> title : 3
> url : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> content :
> tstamp : Wed May 10 14:21:36 CDT 2017
> domain : uci.cu
> digest : 25ed6b1b7be4cbb69a3405f5efe2f8a2
> host : humanos.uci.cu
> name : 3
> id : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> lang : es
>
> Please any help or suggestion will be appreciated.
> ****************************************************
> Text below is autogenerated
> ***************************************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>