Check this linked wiki page: http://wiki.apache.org/nutch/NewScoringIndexingExample
On Friday 23 September 2011 12:51:37 Thomas Anderson wrote: > I solve this problem by re-run it several times. > > nutch generate /path/to/crawldb /path/to/segments -topN 1000 > nutch fetch crawl/to/segments/folder > nutch updatedb crawldb crawl/to/segments/folder > nutch parse crawl/to/segments/folder > > nutch generate /path/to/crawldb /path/to/segments -topN 1000 > nutch fetch crawl/to/segments/folder > nutch updatedb crawldb crawl/to/segments/folder > nutch parse crawl/to/segments/folder > > But now a new question rises. How can I check/ see the cores > calculated by LinkRank? > > http://wiki.apache.org/nutch/NewScoring#LinkRank states scores is > stored in node database. But when checking path/to/webgraphdb/nodes, > within which only some url and random values are recorded. > > a�'http://xoom.myblog.it/>[a�K;:http://xyz.freeweblogger.com/stats/r/rgm > decorazionievetri/>�(http://yeniasya.com.tr/>)http://yenimesaj.com. > tr/>)http://yenisafak.com.tr/>@0/http://ync.ne.jp:8080/cms/html/171 > 10255567.html>[a�,http://youtu.be/IGEhSOsmq50>�0 > http://youtube.com/justthirdway>:��%http://yukichika.jp/>�.https:// > ehrincentives.cms.gov>/�0 > > Is this the right place to check? Or where to check its output result? > > Thanks. > > On Fri, Sep 23, 2011 at 1:55 PM, Thomas Anderson > > <[email protected]> wrote: > > I re-crawl from the injection stage, but it still throws > > `webgraph.LinkRank: LinkAnalysis: java.io.IOException: No links to > > process, is the webgraph empty?' > > > > Checking the source, it shows hdfs will read from > > numLinksPath/part-00000, of which numLinksPaths is constructed by > > webGraphDb/NUM_NODES where NUM_NODES is "_num_nodes_". However, > > listing hdfs content under webgraphdb, there is not such path > > existing. > > > > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:17 > > /crawl/webgraphdb/inlinks > > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:27 > > /crawl/webgraphdb/linkrank > > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:25 > > /crawl/webgraphdb/loops > > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:18 > > /crawl/webgraphdb/nodes > > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:16 > > /crawl/webgraphdb/outlinks > > drwxr-xr-x - crawler supergroup 0 2011-09-23 13:24 > > /crawl/webgraphdb/routes > > > > How can I examine what the properties values are used when nutch process? > > > > Thanks > > > > > > On Thu, Sep 22, 2011 at 5:26 PM, lewis john mcgibbney > > > > <[email protected]> wrote: > >> Hi Thomas, > >> > >> After adding the properties as you mentioned, did you re-start at the > >> injecting stage or did you just use the webgraph class? If the latter, > >> then I would try re-starting the whole process, maybe even checking you > >> reading your crawldb on the way to executing the webgraph class. > >> > >> Just a quick note on this one, Markus (I think) added the webgraph > >> commands to the nutch script so this creates a simpler working > >> environment from 1.4 onwards. > >> > >> On Thu, Sep 22, 2011 at 7:53 AM, Thomas Anderson > >> > >> <[email protected]>wrote: > >>> I follow the example tutorial at > >>> http://wiki.apache.org/nutch/NewScoringIndexingExample. Nearly all > >>> command executes well except LinkRank command. > >>> > >>> When executing LinkRank command `nutch > >>> org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb > >>> crawl/webgraphdb/,` it throws following exception. > >>> > >>> 11/09/22 14:44:56 FATAL webgraph.LinkRank: LinkAnalysis: > >>> java.io.IOException: No links to process, is the webgraph empty? > >>> at > >>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:131 > >>> ) at > >>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:610) > >>> at > >>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:686) at > >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at > >>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:656) > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> at > >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja > >>> va:39) at > >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso > >>> rImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > >>> > >>> At beginning I do not add the following properties to > >>> hadoop/conf/nutch-site.xml > >>> > >>> <!-- linkrank scoring properties --> > >>> <property> > >>> <name>link.ignore.internal.host</name> > >>> <value>true</value> > >>> <description>Ignore outlinks to the same hostname.</description> > >>> </property> > >>> > >>> <property> > >>> <name>link.ignore.internal.domain</name> > >>> <value>true</value> > >>> <description>Ignore outlinks to the same domain.</description> > >>> </property> > >>> > >>> <property> > >>> <name>link.ignore.limit.page</name> > >>> <value>true</value> > >>> <description>Limit to only a single outlink to the same > >>> page.</description> > >>> </property> > >>> > >>> <property> > >>> <name>link.ignore.limit.domain</name> > >>> <value>true</value> > >>> <description>Limit to only a single outlink to the same > >>> domain.</description> > >>> </property> > >>> > >>> But after adding those properties, the exception remains. > >>> What may cause such error? > >>> > >>> Environment: java "1.6.0_26", debian with 2.6.39-2-686-pae kernel, > >>> nutch 1.3, hadoop 0.20.2 > >>> > >>> Thanks > >> > >> -- > >> *Lewis* -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

