Get source of gone links

Peter Viskup Fri, 06 Dec 2024 12:56:13 -0800

Just trying to figure out how to get the URIs from which the "gone" URIs
were linked.
ATM I have two domains crawled and indexed and was able to identify 120
gone links.


~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -stats|grep gone
2024-12-06 21:52:43,333 INFO o.a.n.c.CrawlDbReader [main] status 3
(db_gone):   120

generate CSV export
~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -dump ./dbdump -format csv

and then grep for "gone"
~$ grep gone dbdump/part-r-00000 |wc -l
120

So how to get the source URIs of those "gones"?

Peter

Get source of gone links

Reply via email to