Re: No links to process, is the webgraph empty?

Markus Jelsma Fri, 23 Sep 2011 03:58:21 -0700

Check this linked wiki page:

http://wiki.apache.org/nutch/NewScoringIndexingExample



On Friday 23 September 2011 12:51:37 Thomas Anderson wrote:
> I solve this problem by re-run it several times.
> 
> nutch generate /path/to/crawldb /path/to/segments -topN 1000
> nutch fetch crawl/to/segments/folder
> nutch updatedb crawldb crawl/to/segments/folder
> nutch parse crawl/to/segments/folder
> 
> nutch generate /path/to/crawldb /path/to/segments -topN 1000
> nutch fetch crawl/to/segments/folder
> nutch updatedb crawldb crawl/to/segments/folder
> nutch parse crawl/to/segments/folder
> 
> But now a new question rises. How can I check/ see the cores
> calculated by LinkRank?
> 
> http://wiki.apache.org/nutch/NewScoring#LinkRank states scores is
> stored in node database. But when checking path/to/webgraphdb/nodes,
> within which only some url and random values are recorded.
> 
> a�'http://xoom.myblog.it/>[a�K;:http://xyz.freeweblogger.com/stats/r/rgm
> decorazionievetri/>�(http://yeniasya.com.tr/>)http://yenimesaj.com.
> tr/>)http://yenisafak.com.tr/>@0/http://ync.ne.jp:8080/cms/html/171
> 10255567.html>[a�,http://youtu.be/IGEhSOsmq50>�0
> http://youtube.com/justthirdway>:��%http://yukichika.jp/>�.https://
> ehrincentives.cms.gov>/�0
> 
>  Is this the right place to check? Or where to check its output result?
> 
> Thanks.
> 
> On Fri, Sep 23, 2011 at 1:55 PM, Thomas Anderson
> 
> <[email protected]> wrote:
> > I re-crawl from the injection stage, but it still throws
> > `webgraph.LinkRank: LinkAnalysis: java.io.IOException: No links to
> > process, is the webgraph empty?'
> > 
> > Checking the source, it shows hdfs will read from
> > numLinksPath/part-00000, of which numLinksPaths is constructed by
> > webGraphDb/NUM_NODES where NUM_NODES is "_num_nodes_". However,
> > listing hdfs content under webgraphdb, there is not such path
> > existing.
> > 
> > drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:17
> > /crawl/webgraphdb/inlinks
> > drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:27
> > /crawl/webgraphdb/linkrank
> > drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:25
> > /crawl/webgraphdb/loops
> > drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:18
> > /crawl/webgraphdb/nodes
> > drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:16
> > /crawl/webgraphdb/outlinks
> > drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:24
> > /crawl/webgraphdb/routes
> > 
> > How can I examine what the properties values are used when nutch process?
> > 
> > Thanks
> > 
> > 
> > On Thu, Sep 22, 2011 at 5:26 PM, lewis john mcgibbney
> > 
> > <[email protected]> wrote:
> >> Hi Thomas,
> >> 
> >> After adding the properties as you mentioned, did you re-start at the
> >> injecting stage or did you just use the webgraph class? If the latter,
> >> then I would try re-starting the whole process, maybe even checking you
> >> reading your crawldb on the way to executing the webgraph class.
> >> 
> >> Just a quick note on this one, Markus (I think) added the webgraph
> >> commands to the nutch script so this creates a simpler working
> >> environment from 1.4 onwards.
> >> 
> >> On Thu, Sep 22, 2011 at 7:53 AM, Thomas Anderson
> >> 
> >> <[email protected]>wrote:
> >>> I follow the example tutorial at
> >>> http://wiki.apache.org/nutch/NewScoringIndexingExample. Nearly all
> >>> command executes well except LinkRank command.
> >>> 
> >>> When executing LinkRank command `nutch
> >>> org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb
> >>> crawl/webgraphdb/,` it throws following exception.
> >>> 
> >>> 11/09/22 14:44:56 FATAL webgraph.LinkRank: LinkAnalysis:
> >>> java.io.IOException: No links to process, is the webgraph empty?
> >>>        at
> >>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:131
> >>> ) at
> >>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:610)
> >>>        at
> >>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:686) at
> >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
> >>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:656)
> >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>        at
> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja
> >>> va:39) at
> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
> >>> rImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)
> >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>> 
> >>> At beginning I do not add the following properties to
> >>> hadoop/conf/nutch-site.xml
> >>> 
> >>> <!-- linkrank scoring properties -->
> >>> <property>
> >>>  <name>link.ignore.internal.host</name>
> >>>  <value>true</value>
> >>>  <description>Ignore outlinks to the same hostname.</description>
> >>> </property>
> >>> 
> >>> <property>
> >>>  <name>link.ignore.internal.domain</name>
> >>>  <value>true</value>
> >>>  <description>Ignore outlinks to the same domain.</description>
> >>> </property>
> >>> 
> >>> <property>
> >>>  <name>link.ignore.limit.page</name>
> >>>  <value>true</value>
> >>>  <description>Limit to only a single outlink to the same
> >>> page.</description>
> >>> </property>
> >>> 
> >>> <property>
> >>>  <name>link.ignore.limit.domain</name>
> >>>  <value>true</value>
> >>>  <description>Limit to only a single outlink to the same
> >>> domain.</description>
> >>> </property>
> >>> 
> >>> But after adding those properties, the exception remains.
> >>> What may cause such error?
> >>> 
> >>> Environment: java "1.6.0_26", debian with 2.6.39-2-686-pae kernel,
> >>> nutch 1.3, hadoop 0.20.2
> >>> 
> >>> Thanks
> >> 
> >> --
> >> *Lewis*

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: No links to process, is the webgraph empty?

Reply via email to