Re: more links in parsechecker than in nutch fetch/parse

Julien Nioche Fri, 22 Nov 2013 01:45:22 -0800

The parserchecker simply does not apply any filtering on the outlinks,
that's all. There is also a limit applied to the number of outlinks per
page which is not applied by the parserchecker. Modify the parameter below
in nutch-site.xml in order to get more outlinks per page :


<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

HTH

Julien




On 21 November 2013 23:33, <[email protected]> wrote:

> Hello,
>
> I use nutch-1.6. I noticed that for some urls nutch fetches/parses much
> less inlniks that are in them. However, parsechecker outputs all inlinks as
> outlinks.
> For example, bin/nutch parsechecker http://mydomain.com gives around 500
> outlinks (which are actually inlinks with relative urls). On contrary if I
> do
>
> 1.bin/nutch inject crawl/crawldb urls/seed
>
> 2.bin/nutch generate crawl/crawldb  crawl/segments
>
> 3.s1=`ls -d crawl_test/segments/* | tail -1`
>
> 4.bin/nutch fetch $s1
>
> 5.bin/nutch readseg -dump $s1  dumpseg -nocontent
>
> in the file dumpseg/dump I see only 100 outlinks.
>
> So, 400 of inlinks are missing.
>
> Any ideas what is done wrong here?
>
> regex-urlfilter.txt accepts everything and nutch-site.xml has the
> following.
>
> <property>
>   <name>fetcher.parse</name>
>   <value>true</value>
>   <description>If true, fetcher will parse content. Default is false,
> which means
>   that a separate parsing step is required after fetching is
> finished.</description>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property><property>
>   <name>file.content.limit</name>
>   <value>-1</value>
> </property>
>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
> </property>
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-(http|file)|urlfilter-(regex|suffix)|parse-(html|tika|zip)|index-(basic|anchor|more)|scoring-opic</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: more links in parsechecker than in nutch fetch/parse

Reply via email to