Re: [MASSMAIL]Re: problems with documents with noindex meta

Eyeris Rodriguez Rueda Thu, 18 May 2017 12:03:03 -0700

Thanks Sebastian.

I have opened a jira for that.


https://issues.apache.org/jira/browse/NUTCH-2387

Do you think that the responsability of delete document with noindex robots 
meta is for mapreduce class or indexing filters like(index-basic or index-more) 
?



----- Mensaje original -----
De: "Sebastian Nagel" <[email protected]>
Para: [email protected]
Enviados: Jueves, 18 de Mayo 2017 11:45:43
Asunto: Re: [MASSMAIL]Re: problems with documents with noindex meta

Hi,

sorry for the late answer...

> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but 
> the problem persist.

That's expected as indexchecker does not support deletion by robots meta.
Could you open a Jira issue to fix this? Thanks!

> It looks like if nutch never read the property indexer.delete.robots.noindex 
> in nutch-site.xml

The indexer job (IndexerMapreduce.java) does ...

> I have read the method configure in IndexerMapReduce.java class and it has a 
> line for that
> property but i dont understand why those document are indexed.
>
> this.deleteRobotsNoIndex = 
> job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)

ok, and it should work (tested with 1.13-SNAPSHOT):

% cat > urls.txt
https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
^C

% nutch inject crawldb urls.txt
...
Injector: Total new urls injected: 1
Injector: finished at 2017-05-18 17:31:16, elapsed: 00:00:01

% nutch generate crawldb segments
...

% nutch fetch segments/20170518173127
...
fetching https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ 
(queue crawl delay=5000ms)
...
Fetcher: finished at 2017-05-18 17:31:42, elapsed: 00:00:07

% nutch parse segments/20170518173127
...

% nutch updatedb crawldb/ segments/20170518173127
...

nutch index -Dindexer.delete.robots.noindex=true \
    -Dplugin.includes=indexer-dummy -Ddummy.path=index.txt \
     crawldb/ segments/20170518173127/ -deleteGone
Segment dir is complete: segments/20170518173127.
Indexer: starting at 2017-05-18 17:38:52
Indexer: deleting gone documents: true
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
DummyIndexWriter
        dummy.path : Path of the file to write to (mandatory)


Indexer: number of documents indexed, deleted, or skipped:
Indexer:      1  deleted (robots=noindex)
Indexer: finished at 2017-05-18 17:38:53, elapsed: 00:00:01

% cat index.txt
delete  https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/


Did you use -Dindexer.delete.robots.noindex=true in combination with 
-deleteGone?
Otherwise no "delete" actions are performed.
That's not really clear and also not handled the same way by all indexer 
plugins:
indexer-solr does not but indexer-elastic does without.


Best,
Sebastian


On 05/18/2017 02:43 PM, Eyeris Rodriguez Rueda wrote:
> Thanks Sebastian for your answer.
> 
> This is my environment
> I am using nutch 1.12  solr 4.10.3  in local mode and always use the comand 
> bin/crawl for a complete cycle.
> For some reason all document with noindex meta are being indexed.
> 
> I have tested bin/nutch index and the document are indexed.
> 
> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but 
> the problem persist.
> 
> It looks like if nutch never read the property indexer.delete.robots.noindex 
> in nutch-site.xml
> 
> I have read the method configure in IndexerMapReduce.java class and it has a 
> line for that property but
> i dont understand why those document are indexed.
> 
> this.deleteRobotsNoIndex = 
> job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)
> 
> 
> Please i really want to solve this situation, any advice or suggestion will 
> be appreciated.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Sebastian Nagel" <[email protected]>
> Para: [email protected]
> Enviados: Jueves, 11 de Mayo 2017 10:05:35
> Asunto: [MASSMAIL]Re: problems with documents with noindex meta
> 
> Hi,
> 
> the indexing job ("bin/nutch index") will delete this document.
> But it looks like that it's not done by "bin/nutch indexchecker 
> -DdoIndex=true"
> does not (cf. NUTCH-1758).
> 
> Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
> will not send anything to the index.
> 
> Best,
> Sebastian
> 
> 
> On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
>> Hi all.
>> I need some help with this problem, sorry if is a trivial things.
>> I have a little problem with some url that have noindex meta and are being 
>> indexed.
>>
>> For example this url:
>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>
>> have the meta noindex and for some reason it is not deleted as well and 
>> name="robots" content="noindex,follow"/>
>>
>> I have read that nutch should delete this document at the indexing time and 
>> it is not occurring correctly.
>>
>> <property>
>>   <name>indexer.delete.robots.noindexname>
>>   <value>truevalue>
>> property>
>>
>> If i do a parsechecker the output has an empty content but the document it 
>> is not deleted:
>>
>> fetching: 
>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> robots.txt whitelist not configured.
>> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> contentType: text/html
>> date :        Wed May 10 14:21:36 CDT 2017
>> agent :        cubbot
>> type :        text/html
>> type :        text
>> type :        html
>> title :        3
>> url :        
>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> content :        
>> tstamp :        Wed May 10 14:21:36 CDT 2017
>> domain :        uci.cu
>> digest :        25ed6b1b7be4cbb69a3405f5efe2f8a2
>> host :        humanos.uci.cu
>> name :        3
>> id :        
>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> lang :        es
>>
>> Please any help or suggestion will be appreciated.
> 
> 
> ****************************************************
> Text below is autogenerated
> ***************************************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
> 

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Re: [MASSMAIL]Re: problems with documents with noindex meta

Reply via email to