Thanks Sebastian. I have opened a jira for that.
https://issues.apache.org/jira/browse/NUTCH-2387 Do you think that the responsability of delete document with noindex robots meta is for mapreduce class or indexing filters like(index-basic or index-more) ? ----- Mensaje original ----- De: "Sebastian Nagel" <[email protected]> Para: [email protected] Enviados: Jueves, 18 de Mayo 2017 11:45:43 Asunto: Re: [MASSMAIL]Re: problems with documents with noindex meta Hi, sorry for the late answer... > I have tested bin/nutch parsechecker and indexchecker with doIndex=true but > the problem persist. That's expected as indexchecker does not support deletion by robots meta. Could you open a Jira issue to fix this? Thanks! > It looks like if nutch never read the property indexer.delete.robots.noindex > in nutch-site.xml The indexer job (IndexerMapreduce.java) does ... > I have read the method configure in IndexerMapReduce.java class and it has a > line for that > property but i dont understand why those document are indexed. > > this.deleteRobotsNoIndex = > job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); (line 97) ok, and it should work (tested with 1.13-SNAPSHOT): % cat > urls.txt https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ ^C % nutch inject crawldb urls.txt ... Injector: Total new urls injected: 1 Injector: finished at 2017-05-18 17:31:16, elapsed: 00:00:01 % nutch generate crawldb segments ... % nutch fetch segments/20170518173127 ... fetching https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ (queue crawl delay=5000ms) ... Fetcher: finished at 2017-05-18 17:31:42, elapsed: 00:00:07 % nutch parse segments/20170518173127 ... % nutch updatedb crawldb/ segments/20170518173127 ... nutch index -Dindexer.delete.robots.noindex=true \ -Dplugin.includes=indexer-dummy -Ddummy.path=index.txt \ crawldb/ segments/20170518173127/ -deleteGone Segment dir is complete: segments/20170518173127. Indexer: starting at 2017-05-18 17:38:52 Indexer: deleting gone documents: true Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : DummyIndexWriter dummy.path : Path of the file to write to (mandatory) Indexer: number of documents indexed, deleted, or skipped: Indexer: 1 deleted (robots=noindex) Indexer: finished at 2017-05-18 17:38:53, elapsed: 00:00:01 % cat index.txt delete https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ Did you use -Dindexer.delete.robots.noindex=true in combination with -deleteGone? Otherwise no "delete" actions are performed. That's not really clear and also not handled the same way by all indexer plugins: indexer-solr does not but indexer-elastic does without. Best, Sebastian On 05/18/2017 02:43 PM, Eyeris Rodriguez Rueda wrote: > Thanks Sebastian for your answer. > > This is my environment > I am using nutch 1.12 solr 4.10.3 in local mode and always use the comand > bin/crawl for a complete cycle. > For some reason all document with noindex meta are being indexed. > > I have tested bin/nutch index and the document are indexed. > > I have tested bin/nutch parsechecker and indexchecker with doIndex=true but > the problem persist. > > It looks like if nutch never read the property indexer.delete.robots.noindex > in nutch-site.xml > > I have read the method configure in IndexerMapReduce.java class and it has a > line for that property but > i dont understand why those document are indexed. > > this.deleteRobotsNoIndex = > job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); (line 97) > > > Please i really want to solve this situation, any advice or suggestion will > be appreciated. > > > > > > > > > > > > > ----- Mensaje original ----- > De: "Sebastian Nagel" <[email protected]> > Para: [email protected] > Enviados: Jueves, 11 de Mayo 2017 10:05:35 > Asunto: [MASSMAIL]Re: problems with documents with noindex meta > > Hi, > > the indexing job ("bin/nutch index") will delete this document. > But it looks like that it's not done by "bin/nutch indexchecker > -DdoIndex=true" > does not (cf. NUTCH-1758). > > Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex" > will not send anything to the index. > > Best, > Sebastian > > > On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote: >> Hi all. >> I need some help with this problem, sorry if is a trivial things. >> I have a little problem with some url that have noindex meta and are being >> indexed. >> >> For example this url: >> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >> >> have the meta noindex and for some reason it is not deleted as well and >> name="robots" content="noindex,follow"/> >> >> I have read that nutch should delete this document at the indexing time and >> it is not occurring correctly. >> >> <property> >> <name>indexer.delete.robots.noindexname> >> <value>truevalue> >> property> >> >> If i do a parsechecker the output has an empty content but the document it >> is not deleted: >> >> fetching: >> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >> robots.txt whitelist not configured. >> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >> contentType: text/html >> date : Wed May 10 14:21:36 CDT 2017 >> agent : cubbot >> type : text/html >> type : text >> type : html >> title : 3 >> url : >> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >> content : >> tstamp : Wed May 10 14:21:36 CDT 2017 >> domain : uci.cu >> digest : 25ed6b1b7be4cbb69a3405f5efe2f8a2 >> host : humanos.uci.cu >> name : 3 >> id : >> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >> lang : es >> >> Please any help or suggestion will be appreciated. > > > **************************************************** > Text below is autogenerated > *************************************************** > La @universidad_uci es Fidel. Los jóvenes no fallaremos. > #HastaSiempreComandante > #HastalaVictoriaSiempre > La @universidad_uci es Fidel. Los jóvenes no fallaremos. #HastaSiempreComandante #HastalaVictoriaSiempre

