filters and/or normalizers come to mind!

 
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Thursday 28th September 2017 4:40
> To: User <[email protected]>
> Subject: inject deletes urls from crawldb
> 
> Perhaps my strangest question yet!
> Why does Inject delete URLs from the crawldb and how can I prevent it?
> I was trying to add 2 new sites to an existing crawldb. According to readdb 
> stats, about 10% of my URLs disappeared in the process.
> 
> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 
> 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
> 2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      
> 346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 
> 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   
> 77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   
> 9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 
> 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
> 1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      
> 318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 
> 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   
> 2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   
> 9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
> My command line is like this
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D 
> db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb 
> /crawls/$crawlspace/seeds_nbcnews.txt
> Does it apply urlfilters as it injects?
> 

Reply via email to