The parserchecker simply does not apply any filtering on the outlinks, that's all. There is also a limit applied to the number of outlinks per page which is not applied by the parserchecker. Modify the parameter below in nutch-site.xml in order to get more outlinks per page :
<property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> HTH Julien On 21 November 2013 23:33, <[email protected]> wrote: > Hello, > > I use nutch-1.6. I noticed that for some urls nutch fetches/parses much > less inlniks that are in them. However, parsechecker outputs all inlinks as > outlinks. > For example, bin/nutch parsechecker http://mydomain.com gives around 500 > outlinks (which are actually inlinks with relative urls). On contrary if I > do > > 1.bin/nutch inject crawl/crawldb urls/seed > > 2.bin/nutch generate crawl/crawldb crawl/segments > > 3.s1=`ls -d crawl_test/segments/* | tail -1` > > 4.bin/nutch fetch $s1 > > 5.bin/nutch readseg -dump $s1 dumpseg -nocontent > > in the file dumpseg/dump I see only 100 outlinks. > > So, 400 of inlinks are missing. > > Any ideas what is done wrong here? > > regex-urlfilter.txt accepts everything and nutch-site.xml has the > following. > > <property> > <name>fetcher.parse</name> > <value>true</value> > <description>If true, fetcher will parse content. Default is false, > which means > that a separate parsing step is required after fetching is > finished.</description> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to include > only initially injected hosts, without creating complex URLFilters. > </description> > </property> > <property> > <name>db.ignore.internal.links</name> > <value>false</value> > <description>If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > </description> > </property><property> > <name>file.content.limit</name> > <value>-1</value> > </property> > > <property> > <name>http.content.limit</name> > <value>-1</value> > </property> > > <property> > <name>plugin.includes</name> > > <value>protocol-(http|file)|urlfilter-(regex|suffix)|parse-(html|tika|zip)|index-(basic|anchor|more)|scoring-opic</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

