Thanks Sebastian for your answer. I am using nutch 1.12 in local mode and always use the comand bin/crawl for a complete cycle. For some reason all document with noindex meta are being indexed. I have tested bin/nutch index and the document are indexed.
I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist. It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml I have read the method configure in IndexerMapReduce.java class and it has a line for that property but i dont understand why those document are indexed. this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); (line 97) Please i really want to solve this situation, any advice or suggestion will be appreciated. ----- Mensaje original ----- De: "Sebastian Nagel" <[email protected]> Para: [email protected] Enviados: Jueves, 11 de Mayo 2017 10:05:35 Asunto: [MASSMAIL]Re: problems with documents with noindex meta Hi, the indexing job ("bin/nutch index") will delete this document. But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true" does not (cf. NUTCH-1758). Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex" will not send anything to the index. Best, Sebastian On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote: > Hi all. > I need some help with this problem, sorry if is a trivial things. > I have a little problem with some url that have noindex meta and are being > indexed. > > For example this url: > https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > > have the meta noindex and for some reason it is not deleted as well and > <meta name="robots" content="noindex,follow"/> > > I have read that nutch should delete this document at the indexing time and > it is not occurring correctly. > > <property> > <name>indexer.delete.robots.noindex</name> > <value>true</value> > </property> > > If i do a parsechecker the output has an empty content but the document it is > not deleted: > > fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > robots.txt whitelist not configured. > parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > contentType: text/html > date : Wed May 10 14:21:36 CDT 2017 > agent : cubbot > type : text/html > type : text > type : html > title : 3 > url : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > content : > tstamp : Wed May 10 14:21:36 CDT 2017 > domain : uci.cu > digest : 25ed6b1b7be4cbb69a3405f5efe2f8a2 > host : humanos.uci.cu > name : 3 > id : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > lang : es > > Please any help or suggestion will be appreciated. **************************************************** Text below is autogenerated *************************************************** La @universidad_uci es Fidel. Los jóvenes no fallaremos. #HastaSiempreComandante #HastalaVictoriaSiempre

