> I have opened a jira for that. > > https://issues.apache.org/jira/browse/NUTCH-2387
Thanks. Strictly speaking, "not indexing" and "deleting" are different things. Of course, if only one single time "indexer.delete.robots.noindex" is false then documents with robots=noindex make it into the index. > Do you think that the responsability of delete document with noindex robots > meta is for mapreduce class or indexing filters like(index-basic or index-more) ? I think it's the responsibility of both - IndexingJob / IndexerMapreduce and - "indexer" plugins (implements IndexWriter) But there may be indexer plugins which do not support deletion of documents. An "indexing filter" adds index fields to indexed documents. Best, Sebastian On 05/18/2017 09:01 PM, Eyeris Rodriguez Rueda wrote: > Thanks Sebastian. > > I have opened a jira for that. > > https://issues.apache.org/jira/browse/NUTCH-2387 > > Do you think that the responsability of delete document with noindex robots > meta is for mapreduce class or indexing filters like(index-basic or > index-more) ? > > > > ----- Mensaje original ----- > De: "Sebastian Nagel" <[email protected]> > Para: [email protected] > Enviados: Jueves, 18 de Mayo 2017 11:45:43 > Asunto: Re: [MASSMAIL]Re: problems with documents with noindex meta > > Hi, > > sorry for the late answer... > >> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but >> the problem persist. > > That's expected as indexchecker does not support deletion by robots meta. > Could you open a Jira issue to fix this? Thanks! > >> It looks like if nutch never read the property indexer.delete.robots.noindex >> in nutch-site.xml > > The indexer job (IndexerMapreduce.java) does ... > >> I have read the method configure in IndexerMapReduce.java class and it has a >> line for that >> property but i dont understand why those document are indexed. >> >> this.deleteRobotsNoIndex = >> job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); (line 97) > > ok, and it should work (tested with 1.13-SNAPSHOT): > > % cat > urls.txt > https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > ^C > > % nutch inject crawldb urls.txt > ... > Injector: Total new urls injected: 1 > Injector: finished at 2017-05-18 17:31:16, elapsed: 00:00:01 > > % nutch generate crawldb segments > ... > > % nutch fetch segments/20170518173127 > ... > fetching https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > (queue crawl delay=5000ms) > ... > Fetcher: finished at 2017-05-18 17:31:42, elapsed: 00:00:07 > > % nutch parse segments/20170518173127 > ... > > % nutch updatedb crawldb/ segments/20170518173127 > ... > > nutch index -Dindexer.delete.robots.noindex=true \ > -Dplugin.includes=indexer-dummy -Ddummy.path=index.txt \ > crawldb/ segments/20170518173127/ -deleteGone > Segment dir is complete: segments/20170518173127. > Indexer: starting at 2017-05-18 17:38:52 > Indexer: deleting gone documents: true > Indexer: URL filtering: false > Indexer: URL normalizing: false > Active IndexWriters : > DummyIndexWriter > dummy.path : Path of the file to write to (mandatory) > > > Indexer: number of documents indexed, deleted, or skipped: > Indexer: 1 deleted (robots=noindex) > Indexer: finished at 2017-05-18 17:38:53, elapsed: 00:00:01 > > % cat index.txt > delete https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ > > > Did you use -Dindexer.delete.robots.noindex=true in combination with > -deleteGone? > Otherwise no "delete" actions are performed. > That's not really clear and also not handled the same way by all indexer > plugins: > indexer-solr does not but indexer-elastic does without. > > > Best, > Sebastian > > > On 05/18/2017 02:43 PM, Eyeris Rodriguez Rueda wrote: >> Thanks Sebastian for your answer. >> >> This is my environment >> I am using nutch 1.12 solr 4.10.3 in local mode and always use the comand >> bin/crawl for a complete cycle. >> For some reason all document with noindex meta are being indexed. >> >> I have tested bin/nutch index and the document are indexed. >> >> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but >> the problem persist. >> >> It looks like if nutch never read the property indexer.delete.robots.noindex >> in nutch-site.xml >> >> I have read the method configure in IndexerMapReduce.java class and it has a >> line for that property but >> i dont understand why those document are indexed. >> >> this.deleteRobotsNoIndex = >> job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); (line 97) >> >> >> Please i really want to solve this situation, any advice or suggestion will >> be appreciated. >> >> >> >> >> >> >> >> >> >> >> >> >> ----- Mensaje original ----- >> De: "Sebastian Nagel" <[email protected]> >> Para: [email protected] >> Enviados: Jueves, 11 de Mayo 2017 10:05:35 >> Asunto: [MASSMAIL]Re: problems with documents with noindex meta >> >> Hi, >> >> the indexing job ("bin/nutch index") will delete this document. >> But it looks like that it's not done by "bin/nutch indexchecker >> -DdoIndex=true" >> does not (cf. NUTCH-1758). >> >> Please, note that "bin/nutch parsechecker" or "indexchecker" without >> "doIndex" >> will not send anything to the index. >> >> Best, >> Sebastian >> >> >> On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote: >>> Hi all. >>> I need some help with this problem, sorry if is a trivial things. >>> I have a little problem with some url that have noindex meta and are being >>> indexed. >>> >>> For example this url: >>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >>> >>> have the meta noindex and for some reason it is not deleted as well and >>> name="robots" content="noindex,follow"/> >>> >>> I have read that nutch should delete this document at the indexing time and >>> it is not occurring correctly. >>> >>> <property> >>> <name>indexer.delete.robots.noindexname> >>> <value>truevalue> >>> property> >>> >>> If i do a parsechecker the output has an empty content but the document it >>> is not deleted: >>> >>> fetching: >>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >>> robots.txt whitelist not configured. >>> parsing: >>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >>> contentType: text/html >>> date : Wed May 10 14:21:36 CDT 2017 >>> agent : cubbot >>> type : text/html >>> type : text >>> type : html >>> title : 3 >>> url : >>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >>> content : >>> tstamp : Wed May 10 14:21:36 CDT 2017 >>> domain : uci.cu >>> digest : 25ed6b1b7be4cbb69a3405f5efe2f8a2 >>> host : humanos.uci.cu >>> name : 3 >>> id : >>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ >>> lang : es >>> >>> Please any help or suggestion will be appreciated. >> >> >> **************************************************** >> Text below is autogenerated >> *************************************************** >> La @universidad_uci es Fidel. Los jóvenes no fallaremos. >> #HastaSiempreComandante >> #HastalaVictoriaSiempre >> > > La @universidad_uci es Fidel. Los jóvenes no fallaremos. > #HastaSiempreComandante > #HastalaVictoriaSiempre >

