Re: [MASSMAIL]Re: problems with documents with noindex meta

Sebastian Nagel Thu, 18 May 2017 13:57:12 -0700

> I have opened a jira for that.
>
> https://issues.apache.org/jira/browse/NUTCH-2387


Thanks.

Strictly speaking, "not indexing" and "deleting" are different things.
Of course, if only one single time "indexer.delete.robots.noindex" is false then
documents with robots=noindex make it into the index.

> Do you think that the responsability of delete document with noindex robots 
> meta is for mapreduce
class or indexing filters like(index-basic or index-more) ?

I think it's the responsibility of both
- IndexingJob / IndexerMapreduce and
- "indexer" plugins (implements IndexWriter)
But there may be indexer plugins which do not support deletion of documents.
An "indexing filter" adds index fields to indexed documents.

Best,
Sebastian

On 05/18/2017 09:01 PM, Eyeris Rodriguez Rueda wrote:
> Thanks Sebastian.
> 
> I have opened a jira for that.
> 
> https://issues.apache.org/jira/browse/NUTCH-2387
> 
> Do you think that the responsability of delete document with noindex robots 
> meta is for mapreduce class or indexing filters like(index-basic or 
> index-more) ?
> 
> 
> 
> ----- Mensaje original -----
> De: "Sebastian Nagel" <[email protected]>
> Para: [email protected]
> Enviados: Jueves, 18 de Mayo 2017 11:45:43
> Asunto: Re: [MASSMAIL]Re: problems with documents with noindex meta
> 
> Hi,
> 
> sorry for the late answer...
> 
>> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but 
>> the problem persist.
> 
> That's expected as indexchecker does not support deletion by robots meta.
> Could you open a Jira issue to fix this? Thanks!
> 
>> It looks like if nutch never read the property indexer.delete.robots.noindex 
>> in nutch-site.xml
> 
> The indexer job (IndexerMapreduce.java) does ...
> 
>> I have read the method configure in IndexerMapReduce.java class and it has a 
>> line for that
>> property but i dont understand why those document are indexed.
>>
>> this.deleteRobotsNoIndex = 
>> job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)
> 
> ok, and it should work (tested with 1.13-SNAPSHOT):
> 
> % cat > urls.txt
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> ^C
> 
> % nutch inject crawldb urls.txt
> ...
> Injector: Total new urls injected: 1
> Injector: finished at 2017-05-18 17:31:16, elapsed: 00:00:01
> 
> % nutch generate crawldb segments
> ...
> 
> % nutch fetch segments/20170518173127
> ...
> fetching https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ 
> (queue crawl delay=5000ms)
> ...
> Fetcher: finished at 2017-05-18 17:31:42, elapsed: 00:00:07
> 
> % nutch parse segments/20170518173127
> ...
> 
> % nutch updatedb crawldb/ segments/20170518173127
> ...
> 
> nutch index -Dindexer.delete.robots.noindex=true \
>     -Dplugin.includes=indexer-dummy -Ddummy.path=index.txt \
>      crawldb/ segments/20170518173127/ -deleteGone
> Segment dir is complete: segments/20170518173127.
> Indexer: starting at 2017-05-18 17:38:52
> Indexer: deleting gone documents: true
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> DummyIndexWriter
>         dummy.path : Path of the file to write to (mandatory)
> 
> 
> Indexer: number of documents indexed, deleted, or skipped:
> Indexer:      1  deleted (robots=noindex)
> Indexer: finished at 2017-05-18 17:38:53, elapsed: 00:00:01
> 
> % cat index.txt
> delete  https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> 
> 
> Did you use -Dindexer.delete.robots.noindex=true in combination with 
> -deleteGone?
> Otherwise no "delete" actions are performed.
> That's not really clear and also not handled the same way by all indexer 
> plugins:
> indexer-solr does not but indexer-elastic does without.
> 
> 
> Best,
> Sebastian
> 
> 
> On 05/18/2017 02:43 PM, Eyeris Rodriguez Rueda wrote:
>> Thanks Sebastian for your answer.
>>
>> This is my environment
>> I am using nutch 1.12  solr 4.10.3  in local mode and always use the comand 
>> bin/crawl for a complete cycle.
>> For some reason all document with noindex meta are being indexed.
>>
>> I have tested bin/nutch index and the document are indexed.
>>
>> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but 
>> the problem persist.
>>
>> It looks like if nutch never read the property indexer.delete.robots.noindex 
>> in nutch-site.xml
>>
>> I have read the method configure in IndexerMapReduce.java class and it has a 
>> line for that property but
>> i dont understand why those document are indexed.
>>
>> this.deleteRobotsNoIndex = 
>> job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)
>>
>>
>> Please i really want to solve this situation, any advice or suggestion will 
>> be appreciated.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Mensaje original -----
>> De: "Sebastian Nagel" <[email protected]>
>> Para: [email protected]
>> Enviados: Jueves, 11 de Mayo 2017 10:05:35
>> Asunto: [MASSMAIL]Re: problems with documents with noindex meta
>>
>> Hi,
>>
>> the indexing job ("bin/nutch index") will delete this document.
>> But it looks like that it's not done by "bin/nutch indexchecker 
>> -DdoIndex=true"
>> does not (cf. NUTCH-1758).
>>
>> Please, note that "bin/nutch parsechecker" or "indexchecker" without 
>> "doIndex"
>> will not send anything to the index.
>>
>> Best,
>> Sebastian
>>
>>
>> On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
>>> Hi all.
>>> I need some help with this problem, sorry if is a trivial things.
>>> I have a little problem with some url that have noindex meta and are being 
>>> indexed.
>>>
>>> For example this url:
>>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>>
>>> have the meta noindex and for some reason it is not deleted as well and 
>>> name="robots" content="noindex,follow"/>
>>>
>>> I have read that nutch should delete this document at the indexing time and 
>>> it is not occurring correctly.
>>>
>>> <property>
>>>   <name>indexer.delete.robots.noindexname>
>>>   <value>truevalue>
>>> property>
>>>
>>> If i do a parsechecker the output has an empty content but the document it 
>>> is not deleted:
>>>
>>> fetching: 
>>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> robots.txt whitelist not configured.
>>> parsing: 
>>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> contentType: text/html
>>> date :        Wed May 10 14:21:36 CDT 2017
>>> agent :        cubbot
>>> type :        text/html
>>> type :        text
>>> type :        html
>>> title :        3
>>> url :        
>>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> content :        
>>> tstamp :        Wed May 10 14:21:36 CDT 2017
>>> domain :        uci.cu
>>> digest :        25ed6b1b7be4cbb69a3405f5efe2f8a2
>>> host :        humanos.uci.cu
>>> name :        3
>>> id :        
>>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> lang :        es
>>>
>>> Please any help or suggestion will be appreciated.
>>
>>
>> ****************************************************
>> Text below is autogenerated
>> ***************************************************
>> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
>> #HastaSiempreComandante
>> #HastalaVictoriaSiempre
>>
> 
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>

Re: [MASSMAIL]Re: problems with documents with noindex meta

Reply via email to