Hi Michael, that's actually due to a bug introduced with Nutch 1.12 and already fixed for Nutch 1.14, see https://issues.apache.org/jira/browse/NUTCH-2335
Thanks, Sebastian On 09/28/2017 07:26 PM, Michael Coffey wrote: > If the Inject command does filtering, then the documentation should say so. > The page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention > any filtering or normalization. I find it very counter-intuitive that an > injection operation would delete existing data. > > Should I edit that page? Can I? > > > From: Markus Jelsma <markus.jel...@openindex.io> > To: "user@nutch.apache.org" <user@nutch.apache.org>; User > <user@nutch.apache.org> > Sent: Thursday, September 28, 2017 2:06 AM > Subject: RE: inject deletes urls from crawldb > > filters and/or normalizers come to mind! > > > > -----Original message----- >> From:Michael Coffey <mcof...@yahoo.com.INVALID> >> Sent: Thursday 28th September 2017 4:40 >> To: User <user@nutch.apache.org> >> Subject: inject deletes urls from crawldb >> >> Perhaps my strangest question yet! >> Why does Inject delete URLs from the crawldb and how can I prevent it? >> I was trying to add 2 new sites to an existing crawldb. According to readdb >> stats, about 10% of my URLs disappeared in the process. >> >> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: >> 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched): >> 2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched): >> 346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): >> 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): >> 77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): >> 9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 65 >> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: >> 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched): >> 1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched): >> 318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): >> 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): >> 2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): >> 9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 49 >> My command line is like this >> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D >> db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb >> /crawls/$crawlspace/seeds_nbcnews.txt >> Does it apply urlfilters as it injects? >> > > >