Hi Manish, On Fri, Aug 26, 2016 at 2:16 PM, <[email protected]> wrote:
> > From: Manish Verma <[email protected]> > To: [email protected] > Cc: > Date: Fri, 26 Aug 2016 14:16:49 -0700 > Subject: Pull All URL List > Hi, > > Using nutch 1.12 is there any way to get urls referring to given url ? Depending on at which stage within the Crawl cycle you wish to do this (which you've not mentioned), I am offering a possible mechanism which is within the IndexingFilter http://nutch.apache.org/apidocs/apidocs-1.12/index.html?org/apache/nutch/indexer/IndexingFilter.html If you @Override the following filter method you will have access to Inlinks *filter <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>* (NutchDocument <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/NutchDocument.html> doc, Parse <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/parse/Parse.html> parse, Text <http://hadoop.apache.org/docs/r2.4.0/api/org/apache/hadoop/io/Text.html?is-external=true> url, CrawlDatum <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/CrawlDatum.html> datum, Inlinks <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/Inlinks.html> inlinks) This being said, is is only for one record at one point in time and does not account for every other record which may also, at some point in time have an inlink to the current record you are processing and about to index. > Also can we pull all url list crawled by nutch irrespective of status code > ? > > Do you mean every URL in the CrawlDB? If so I would advise you to take a look at the following documentation http://wiki.apache.org/nutch/bin/nutch%20readdb You can also run $NUTCH_HOME/runtime/local/bin/nutch readdb to get a feel for the type of Jexl and regex you use to filter your results. hth Lewis -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney

