Re: Pull All URL List

Manish Verma Fri, 26 Aug 2016 16:07:31 -0700

Thanks for reply Lewis, to be more specific we want to know all in-links to 404 
 (db_gone ..) pages or urls.
What Stage - may be once crawl is over then or if not feasible at any stage.


-MV 

> On Aug 26, 2016, at 3:14 PM, lewis john mcgibbney <[email protected]> wrote:
> 
> Hi Manish,
> 
> On Fri, Aug 26, 2016 at 2:16 PM, <[email protected]> wrote:
> 
>> 
>> From: Manish Verma <[email protected]>
>> To: [email protected]
>> Cc:
>> Date: Fri, 26 Aug 2016 14:16:49 -0700
>> Subject: Pull All URL List
>> Hi,
>> 
>> Using nutch 1.12 is there any way to get urls referring to given url ?
> 
> 
> Depending on at which stage within the Crawl cycle you wish to do this
> (which you've not mentioned), I am offering a possible mechanism which is
> within the IndexingFilter
> http://nutch.apache.org/apidocs/apidocs-1.12/index.html?org/apache/nutch/indexer/IndexingFilter.html
> If you @Override the following filter method you will have access to Inlinks
> 
> *filter
> <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
> (NutchDocument
> <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/indexer/NutchDocument.html>
> doc,
> Parse
> <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/parse/Parse.html>
> parse,
> Text
> <http://hadoop.apache.org/docs/r2.4.0/api/org/apache/hadoop/io/Text.html?is-external=true>
> url,
> CrawlDatum
> <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/CrawlDatum.html>
> datum,
> Inlinks
> <http://nutch.apache.org/apidocs/apidocs-1.12/org/apache/nutch/crawl/Inlinks.html>
> inlinks)
> 
> This being said, is is only for one record at one point in time and does
> not account for every other record which may also, at some point in time
> have an inlink to the current record you are processing and about to index.
> 
> 
>> Also can we pull all url list crawled by nutch irrespective of status code
>> ?
>> 
>> 
> Do you mean every URL in the CrawlDB? If so I would advise you to take a
> look at the following documentation
> http://wiki.apache.org/nutch/bin/nutch%20readdb
> You can also run $NUTCH_HOME/runtime/local/bin/nutch readdb to get a feel
> for the type of Jexl and regex you use to filter your results.
> hth
> Lewis
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: Pull All URL List

Reply via email to