Re: inject deletes urls from crawldb

Sebastian Nagel Fri, 20 Oct 2017 01:14:50 -0700

Hi Michael,

that's actually due to a bug introduced with Nutch 1.12 and already fixed for 
Nutch 1.14, see
  https://issues.apache.org/jira/browse/NUTCH-2335


Thanks,
Sebastian

On 09/28/2017 07:26 PM, Michael Coffey wrote:
> If the Inject command does filtering, then the documentation should say so. 
> The page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention 
> any filtering or normalization. I find it very counter-intuitive that an 
> injection operation would delete existing data.
> 
> Should I edit that page? Can I?
> 
> 
>       From: Markus Jelsma <markus.jel...@openindex.io>
>  To: "user@nutch.apache.org" <user@nutch.apache.org>; User 
> <user@nutch.apache.org> 
>  Sent: Thursday, September 28, 2017 2:06 AM
>  Subject: RE: inject deletes urls from crawldb
>    
> filters and/or normalizers come to mind!
> 
>  
>  
> -----Original message-----
>> From:Michael Coffey <mcof...@yahoo.com.INVALID>
>> Sent: Thursday 28th September 2017 4:40
>> To: User <user@nutch.apache.org>
>> Subject: inject deletes urls from crawldb
>>
>> Perhaps my strangest question yet!
>> Why does Inject delete URLs from the crawldb and how can I prevent it?
>> I was trying to add 2 new sites to an existing crawldb. According to readdb 
>> stats, about 10% of my URLs disappeared in the process.
>>
>> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 
>> 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
>> 2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      
>> 346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 
>> 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   
>> 77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   
>> 9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
>> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 
>> 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
>> 1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      
>> 318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 
>> 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   
>> 2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   
>> 9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
>> My command line is like this
>> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D 
>> db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb 
>> /crawls/$crawlspace/seeds_nbcnews.txt
>> Does it apply urlfilters as it injects?
>>
> 
>    
>

Re: inject deletes urls from crawldb

Reply via email to