Re: inject deletes urls from crawldb

Michael Coffey Thu, 28 Sep 2017 10:27:27 -0700

If the Inject command does filtering, then the documentation should say so. The 
page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any 
filtering or normalization. I find it very counter-intuitive that an injection 
operation would delete existing data.


Should I edit that page? Can I?


      From: Markus Jelsma <[email protected]>
 To: "[email protected]" <[email protected]>; User 
<[email protected]> 
 Sent: Thursday, September 28, 2017 2:06 AM
 Subject: RE: inject deletes urls from crawldb
   
filters and/or normalizers come to mind!

 
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Thursday 28th September 2017 4:40
> To: User <[email protected]>
> Subject: inject deletes urls from crawldb
> 
> Perhaps my strangest question yet!
> Why does Inject delete URLs from the crawldb and how can I prevent it?
> I was trying to add 2 new sites to an existing crawldb. According to readdb 
> stats, about 10% of my URLs disappeared in the process.
> 
> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 
> 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
> 2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      
> 346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 
> 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   
> 77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   
> 9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 
> 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
> 1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      
> 318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 
> 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   
> 2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   
> 9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
> My command line is like this
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D 
> db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb 
> /crawls/$crawlspace/seeds_nbcnews.txt
> Does it apply urlfilters as it injects?
>

Re: inject deletes urls from crawldb

Reply via email to