filters and/or normalizers come to mind!
-----Original message----- > From:Michael Coffey <[email protected]> > Sent: Thursday 28th September 2017 4:40 > To: User <[email protected]> > Subject: inject deletes urls from crawldb > > Perhaps my strangest question yet! > Why does Inject delete URLs from the crawldb and how can I prevent it? > I was trying to add 2 new sites to an existing crawldb. According to readdb > stats, about 10% of my URLs disappeared in the process. > > (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: > 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > 2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched): > 346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): > 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): > 77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): > 9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 65 > (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: > 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > 1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched): > 318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): > 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): > 2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): > 9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 49 > My command line is like this > $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D > db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb > /crawls/$crawlspace/seeds_nbcnews.txt > Does it apply urlfilters as it injects? >

