Hi Sebastian, I am using Nutch 1.x, built from the source code of the master branch. And the indexer is ES 2.3.5.
Thanks, Yongyao On Thu, Apr 20, 2017 at 6:03 AM, Sebastian Nagel <[email protected] > wrote: > Hi Yongyao, > > this looks like a configuration issue of the index. > In case of Solr (plugin indexer-solr): > inlinks and outlinks should be configured as multivalued > > That's the default for Solr 5, older versions need to specify > this in the index configuration schema. > > Please, open also an issue on > https://issues.apache.org/jira/browse/NUTCH > to add appropriate values to the default schema.xml > > But what Nutch version and what indexer are you using? > > Best, > Sebastian > > On 04/18/2017 09:12 PM, Yongyao Jiang wrote: > > Hi, > > > > I have crawled 10K web pages with "index-links" turned on, and > > "linkdb.ignore.internal.links" set to false. But pretty much all pages I > > have got only have one outlink and one inlink. This makes me very > confused. > > > > Here is a sample, > > > > { > > "inlinks": "http://www.planetary.org/blogs/bruce-betts/", > > "tstamp": "2017-04-18T15:45:31.457Z", > > "nutch_score": 0.439538, > > "segment": "20170418154526", > > "digest": "1ef28e97795b40be08d312f630b1728f", > > "host": "www.planetary.org", > > "boost": "1.0", > > "contentLength": "10355", > > "outlinks": "http://ajax.googleapis.com/", > > } > > > > Thanks, > > Yongyao > > > > -- Yongyao Jiang https://www.linkedin.com/in/yongyao-jiang-42516164 Ph.D. Student in Earth Systems and GeoInformation Sciences NSF Spatiotemporal Innovation Center George Mason University

