Thanks for reply Lewis, to be more specific we want to know all in-links to 404 (db_gone ..) pages or urls. What Stage - may be once crawl is over then or if not feasible at any stage.
-MV > On Aug 26, 2016, at 3:14 PM, lewis john mcgibbney <[email protected]> wrote: > > Hi Manish, > > On Fri, Aug 26, 2016 at 2:16 PM, <[email protected]> wrote: > >> >> From: Manish Verma <[email protected]> >> To: [email protected] >> Cc: >> Date: Fri, 26 Aug 2016 14:16:49 -0700 >> Subject: Pull All URL List >> Hi, >> >> Using nutch 1.12 is there any way to get urls referring to given url ? > > > Depending on at which stage within the Crawl cycle you wish to do this > (which you've not mentioned), I am offering a possible mechanism which is > within the IndexingFilter > http://nutch.apache.org/apidocs/apidocs-1.12/index.html?org/apache/nutch/indexer/IndexingFilter.html > If you @Override the following filter method you will have access to Inlinks > > *filter > <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>* > (NutchDocument > <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/NutchDocument.html> > doc, > Parse > <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/parse/Parse.html> > parse, > Text > <http://hadoop.apache.org/docs/r2.4.0/api/org/apache/hadoop/io/Text.html?is-external=true> > url, > CrawlDatum > <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/CrawlDatum.html> > datum, > Inlinks > <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/Inlinks.html> > inlinks) > > This being said, is is only for one record at one point in time and does > not account for every other record which may also, at some point in time > have an inlink to the current record you are processing and about to index. > > >> Also can we pull all url list crawled by nutch irrespective of status code >> ? >> >> > Do you mean every URL in the CrawlDB? If so I would advise you to take a > look at the following documentation > http://wiki.apache.org/nutch/bin/nutch%20readdb > You can also run $NUTCH_HOME/runtime/local/bin/nutch readdb to get a feel > for the type of Jexl and regex you use to filter your results. > hth > Lewis > > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney

