If the Inject command does filtering, then the documentation should say so. The
page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any
filtering or normalization. I find it very counter-intuitive that an injection
operation would delete existing data.
Should I edit that page? Can I?
From: Markus Jelsma <[email protected]>
To: "[email protected]" <[email protected]>; User
<[email protected]>
Sent: Thursday, September 28, 2017 2:06 AM
Subject: RE: inject deletes urls from crawldb
filters and/or normalizers come to mind!
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Thursday 28th September 2017 4:40
> To: User <[email protected]>
> Subject: inject deletes urls from crawldb
>
> Perhaps my strangest question yet!
> Why does Inject delete URLs from the crawldb and how can I prevent it?
> I was trying to add 2 new sites to an existing crawldb. According to readdb
> stats, about 10% of my URLs disappeared in the process.
>
> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls:
> 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> 2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> 346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone):
> 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> 77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
> 9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 65
> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls:
> 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> 1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> 318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone):
> 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> 2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
> 9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 49
> My command line is like this
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D
> db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb
> /crawls/$crawlspace/seeds_nbcnews.txt
> Does it apply urlfilters as it injects?
>