Re: No links to process, is the webgraph empty?

Thomas Anderson Fri, 23 Sep 2011 03:52:05 -0700

I solve this problem by re-run it several times.

nutch generate /path/to/crawldb /path/to/segments -topN 1000
nutch fetch crawl/to/segments/folder
nutch updatedb crawldb crawl/to/segments/folder
nutch parse crawl/to/segments/folder


nutch generate /path/to/crawldb /path/to/segments -topN 1000
nutch fetch crawl/to/segments/folder
nutch updatedb crawldb crawl/to/segments/folder
nutch parse crawl/to/segments/folder

But now a new question rises. How can I check/ see the cores
calculated by LinkRank?

http://wiki.apache.org/nutch/NewScoring#LinkRank states scores is
stored in node database. But when checking path/to/webgraphdb/nodes,
within which only some url and random values are recorded.

a�'http://xoom.myblog.it/>[a�K;:http://xyz.freeweblogger.com/stats/r/rgmdecorazionievetri/>�(http://yeniasya.com.tr/>)http://yenimesaj.com.tr/>)http://yenisafak.com.tr/>@0/http://ync.ne.jp:8080/cms/html/17110255567.html>[a�,http://youtu.be/IGEhSOsmq50>�0
http://youtube.com/justthirdway>:��%http://yukichika.jp/>�.https://ehrincentives.cms.gov>/�0

 Is this the right place to check? Or where to check its output result?

Thanks.

On Fri, Sep 23, 2011 at 1:55 PM, Thomas Anderson
<[email protected]> wrote:
> I re-crawl from the injection stage, but it still throws
> `webgraph.LinkRank: LinkAnalysis: java.io.IOException: No links to
> process, is the webgraph empty?'
>
> Checking the source, it shows hdfs will read from
> numLinksPath/part-00000, of which numLinksPaths is constructed by
> webGraphDb/NUM_NODES where NUM_NODES is "_num_nodes_". However,
> listing hdfs content under webgraphdb, there is not such path
> existing.
>
> drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:17
> /crawl/webgraphdb/inlinks
> drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:27
> /crawl/webgraphdb/linkrank
> drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:25
> /crawl/webgraphdb/loops
> drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:18
> /crawl/webgraphdb/nodes
> drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:16
> /crawl/webgraphdb/outlinks
> drwxr-xr-x   - crawler supergroup          0 2011-09-23 13:24
> /crawl/webgraphdb/routes
>
> How can I examine what the properties values are used when nutch process?
>
> Thanks
>
>
> On Thu, Sep 22, 2011 at 5:26 PM, lewis john mcgibbney
> <[email protected]> wrote:
>> Hi Thomas,
>>
>> After adding the properties as you mentioned, did you re-start at the
>> injecting stage or did you just use the webgraph class? If the latter, then
>> I would try re-starting the whole process, maybe even checking you reading
>> your crawldb on the way to executing the webgraph class.
>>
>> Just a quick note on this one, Markus (I think) added the webgraph commands
>> to the nutch script so this creates a simpler working environment from 1.4
>> onwards.
>>
>> On Thu, Sep 22, 2011 at 7:53 AM, Thomas Anderson
>> <[email protected]>wrote:
>>
>>> I follow the example tutorial at
>>> http://wiki.apache.org/nutch/NewScoringIndexingExample. Nearly all
>>> command executes well except LinkRank command.
>>>
>>> When executing LinkRank command `nutch
>>> org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb
>>> crawl/webgraphdb/,` it throws following exception.
>>>
>>> 11/09/22 14:44:56 FATAL webgraph.LinkRank: LinkAnalysis:
>>> java.io.IOException: No links to process, is the webgraph empty?
>>>        at
>>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:131)
>>>        at
>>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:610)
>>>        at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:686)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at
>>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:656)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>> At beginning I do not add the following properties to
>>> hadoop/conf/nutch-site.xml
>>>
>>> <!-- linkrank scoring properties -->
>>> <property>
>>>  <name>link.ignore.internal.host</name>
>>>  <value>true</value>
>>>  <description>Ignore outlinks to the same hostname.</description>
>>> </property>
>>>
>>> <property>
>>>  <name>link.ignore.internal.domain</name>
>>>  <value>true</value>
>>>  <description>Ignore outlinks to the same domain.</description>
>>> </property>
>>>
>>> <property>
>>>  <name>link.ignore.limit.page</name>
>>>  <value>true</value>
>>>  <description>Limit to only a single outlink to the same
>>> page.</description>
>>> </property>
>>>
>>> <property>
>>>  <name>link.ignore.limit.domain</name>
>>>  <value>true</value>
>>>  <description>Limit to only a single outlink to the same
>>> domain.</description>
>>> </property>
>>>
>>> But after adding those properties, the exception remains.
>>> What may cause such error?
>>>
>>> Environment: java "1.6.0_26", debian with 2.6.39-2-686-pae kernel,
>>> nutch 1.3, hadoop 0.20.2
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>

Re: No links to process, is the webgraph empty?

Reply via email to