Just trying to figure out how to get the URIs from which the "gone" URIs were linked. ATM I have two domains crawled and indexed and was able to identify 120 gone links.
~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb ${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -stats|grep gone 2024-12-06 21:52:43,333 INFO o.a.n.c.CrawlDbReader [main] status 3 (db_gone): 120 generate CSV export ~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb ${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -dump ./dbdump -format csv and then grep for "gone" ~$ grep gone dbdump/part-r-00000 |wc -l 120 So how to get the source URIs of those "gones"? Peter