Hello, I use nutch-1.6. I noticed that for some urls nutch fetches/parses much less inlniks that are in them. However, parsechecker outputs all inlinks as outlinks. For example, bin/nutch parsechecker http://mydomain.com gives around 500 outlinks (which are actually inlinks with relative urls). On contrary if I do
1.bin/nutch inject crawl/crawldb urls/seed 2.bin/nutch generate crawl/crawldb crawl/segments 3.s1=`ls -d crawl_test/segments/* | tail -1` 4.bin/nutch fetch $s1 5.bin/nutch readseg -dump $s1 dumpseg -nocontent in the file dumpseg/dump I see only 100 outlinks. So, 400 of inlinks are missing. Any ideas what is done wrong here? regex-urlfilter.txt accepts everything and nutch-site.xml has the following. <property> <name>fetcher.parse</name> <value>true</value> <description>If true, fetcher will parse content. Default is false, which means that a separate parsing step is required after fetching is finished.</description> </property> <property> <name>db.ignore.external.links</name> <value>false</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property><property> <name>file.content.limit</name> <value>-1</value> </property> <property> <name>http.content.limit</name> <value>-1</value> </property> <property> <name>plugin.includes</name> <value>protocol-(http|file)|urlfilter-(regex|suffix)|parse-(html|tika|zip)|index-(basic|anchor|more)|scoring-opic</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>

