Thanks Sebastian for your answer.
I am using nutch 1.12 in local mode and always use the comand bin/crawl for a 
complete cycle.
For some reason all document with noindex meta are being indexed.
I have tested bin/nutch index and the document are indexed.

I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the 
problem persist.

It looks like if nutch never read the property indexer.delete.robots.noindex in 
nutch-site.xml

I have read the method configure in IndexerMapReduce.java class and it has a 
line for that property but
i dont understand why those document are indexed.

this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); 
  (line 97)


Please i really want to solve this situation, any advice or suggestion will be 
appreciated.












----- Mensaje original -----
De: "Sebastian Nagel" <[email protected]>
Para: [email protected]
Enviados: Jueves, 11 de Mayo 2017 10:05:35
Asunto: [MASSMAIL]Re: problems with documents with noindex meta

Hi,

the indexing job ("bin/nutch index") will delete this document.
But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
does not (cf. NUTCH-1758).

Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
will not send anything to the index.

Best,
Sebastian


On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
> Hi all.
> I need some help with this problem, sorry if is a trivial things.
> I have a little problem with some url that have noindex meta and are being 
> indexed.
> 
> For example this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> 
> have the meta noindex and for some reason it is not deleted as well and 
> <meta name="robots" content="noindex,follow"/>
> 
> I have read that nutch should delete this document at the indexing time and 
> it is not occurring correctly.
> 
> <property>
>   <name>indexer.delete.robots.noindex</name>
>   <value>true</value>
> </property>
> 
> If i do a parsechecker the output has an empty content but the document it is 
> not deleted:
> 
> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> robots.txt whitelist not configured.
> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> contentType: text/html
> date :        Wed May 10 14:21:36 CDT 2017
> agent :       cubbot
> type :        text/html
> type :        text
> type :        html
> title :       3
> url : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> content :     
> tstamp :      Wed May 10 14:21:36 CDT 2017
> domain :      uci.cu
> digest :      25ed6b1b7be4cbb69a3405f5efe2f8a2
> host :        humanos.uci.cu
> name :        3
> id :  https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> lang :        es
> 
> Please any help or suggestion will be appreciated.


****************************************************
Text below is autogenerated
***************************************************

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply via email to