more links in parsechecker than in nutch fetch/parse

alxsss Thu, 21 Nov 2013 15:34:37 -0800

Hello,

I use nutch-1.6. I noticed that for some urls nutch fetches/parses much less 
inlniks that are in them. However, parsechecker outputs all inlinks as outlinks.
For example, bin/nutch parsechecker http://mydomain.com gives around 500 
outlinks (which are actually inlinks with relative urls). On contrary if I do


1.bin/nutch inject crawl/crawldb urls/seed

2.bin/nutch generate crawl/crawldb  crawl/segments

3.s1=`ls -d crawl_test/segments/* | tail -1`

4.bin/nutch fetch $s1

5.bin/nutch readseg -dump $s1  dumpseg -nocontent

in the file dumpseg/dump I see only 100 outlinks.

So, 400 of inlinks are missing.

Any ideas what is done wrong here?

regex-urlfilter.txt accepts everything and nutch-site.xml has the following.

<property>
  <name>fetcher.parse</name>
  <value>true</value>
  <description>If true, fetcher will parse content. Default is false, which 
means
  that a separate parsing step is required after fetching is 
finished.</description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>
<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property><property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>

<property>
  <name>plugin.includes</name>
<value>protocol-(http|file)|urlfilter-(regex|suffix)|parse-(html|tika|zip)|index-(basic|anchor|more)|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

more links in parsechecker than in nutch fetch/parse

Reply via email to