I solve this problem by re-run it several times. nutch generate /path/to/crawldb /path/to/segments -topN 1000 nutch fetch crawl/to/segments/folder nutch updatedb crawldb crawl/to/segments/folder nutch parse crawl/to/segments/folder
nutch generate /path/to/crawldb /path/to/segments -topN 1000 nutch fetch crawl/to/segments/folder nutch updatedb crawldb crawl/to/segments/folder nutch parse crawl/to/segments/folder But now a new question rises. How can I check/ see the cores calculated by LinkRank? http://wiki.apache.org/nutch/NewScoring#LinkRank states scores is stored in node database. But when checking path/to/webgraphdb/nodes, within which only some url and random values are recorded. a�'http://xoom.myblog.it/>[a�K;:http://xyz.freeweblogger.com/stats/r/rgmdecorazionievetri/>�(http://yeniasya.com.tr/>)http://yenimesaj.com.tr/>)http://yenisafak.com.tr/>@0/http://ync.ne.jp:8080/cms/html/17110255567.html>[a�,http://youtu.be/IGEhSOsmq50>�0 http://youtube.com/justthirdway>:��%http://yukichika.jp/>�.https://ehrincentives.cms.gov>/�0 Is this the right place to check? Or where to check its output result? Thanks. On Fri, Sep 23, 2011 at 1:55 PM, Thomas Anderson <[email protected]> wrote: > I re-crawl from the injection stage, but it still throws > `webgraph.LinkRank: LinkAnalysis: java.io.IOException: No links to > process, is the webgraph empty?' > > Checking the source, it shows hdfs will read from > numLinksPath/part-00000, of which numLinksPaths is constructed by > webGraphDb/NUM_NODES where NUM_NODES is "_num_nodes_". However, > listing hdfs content under webgraphdb, there is not such path > existing. > > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:17 > /crawl/webgraphdb/inlinks > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:27 > /crawl/webgraphdb/linkrank > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:25 > /crawl/webgraphdb/loops > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:18 > /crawl/webgraphdb/nodes > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:16 > /crawl/webgraphdb/outlinks > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:24 > /crawl/webgraphdb/routes > > How can I examine what the properties values are used when nutch process? > > Thanks > > > On Thu, Sep 22, 2011 at 5:26 PM, lewis john mcgibbney > <[email protected]> wrote: >> Hi Thomas, >> >> After adding the properties as you mentioned, did you re-start at the >> injecting stage or did you just use the webgraph class? If the latter, then >> I would try re-starting the whole process, maybe even checking you reading >> your crawldb on the way to executing the webgraph class. >> >> Just a quick note on this one, Markus (I think) added the webgraph commands >> to the nutch script so this creates a simpler working environment from 1.4 >> onwards. >> >> On Thu, Sep 22, 2011 at 7:53 AM, Thomas Anderson >> <[email protected]>wrote: >> >>> I follow the example tutorial at >>> http://wiki.apache.org/nutch/NewScoringIndexingExample. Nearly all >>> command executes well except LinkRank command. >>> >>> When executing LinkRank command `nutch >>> org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb >>> crawl/webgraphdb/,` it throws following exception. >>> >>> 11/09/22 14:44:56 FATAL webgraph.LinkRank: LinkAnalysis: >>> java.io.IOException: No links to process, is the webgraph empty? >>> at >>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:131) >>> at >>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:610) >>> at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:686) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at >>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:656) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>> >>> At beginning I do not add the following properties to >>> hadoop/conf/nutch-site.xml >>> >>> <!-- linkrank scoring properties --> >>> <property> >>> <name>link.ignore.internal.host</name> >>> <value>true</value> >>> <description>Ignore outlinks to the same hostname.</description> >>> </property> >>> >>> <property> >>> <name>link.ignore.internal.domain</name> >>> <value>true</value> >>> <description>Ignore outlinks to the same domain.</description> >>> </property> >>> >>> <property> >>> <name>link.ignore.limit.page</name> >>> <value>true</value> >>> <description>Limit to only a single outlink to the same >>> page.</description> >>> </property> >>> >>> <property> >>> <name>link.ignore.limit.domain</name> >>> <value>true</value> >>> <description>Limit to only a single outlink to the same >>> domain.</description> >>> </property> >>> >>> But after adding those properties, the exception remains. >>> What may cause such error? >>> >>> Environment: java "1.6.0_26", debian with 2.6.39-2-686-pae kernel, >>> nutch 1.3, hadoop 0.20.2 >>> >>> Thanks >>> >> >> >> >> -- >> *Lewis* >> >

