Hi Manish,

On Fri, Aug 26, 2016 at 2:16 PM, <[email protected]> wrote:

>
> From: Manish Verma <[email protected]>
> To: [email protected]
> Cc:
> Date: Fri, 26 Aug 2016 14:16:49 -0700
> Subject: Pull All URL List
> Hi,
>
> Using nutch 1.12 is there any way to get urls referring to given url ?


Depending on at which stage within the Crawl cycle you wish to do this
(which you've not mentioned), I am offering a possible mechanism which is
within the IndexingFilter
http://nutch.apache.org/apidocs/apidocs-1.12/index.html?org/apache/nutch/indexer/IndexingFilter.html
If you @Override the following filter method you will have access to Inlinks

*filter
<http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
(NutchDocument
<http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/NutchDocument.html>
doc,
Parse
<http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/parse/Parse.html>
parse,
Text
<http://hadoop.apache.org/docs/r2.4.0/api/org/apache/hadoop/io/Text.html?is-external=true>
url,
CrawlDatum
<http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/CrawlDatum.html>
datum,
Inlinks
<http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/Inlinks.html>
 inlinks)

This being said, is is only for one record at one point in time and does
not account for every other record which may also, at some point in time
have an inlink to the current record you are processing and about to index.


> Also can we pull all url list crawled by nutch irrespective of status code
> ?
>
>
Do you mean every URL in the CrawlDB? If so I would advise you to take a
look at the following documentation
http://wiki.apache.org/nutch/bin/nutch%20readdb
You can also run $NUTCH_HOME/runtime/local/bin/nutch readdb to get a feel
for the type of Jexl and regex you use to filter your results.
hth
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Reply via email to