Do you have db.update.max.inlinks set to 1? (Default set to 10000). This is a cap of the number of inlinks that will be written.
On Mon, Oct 29, 2012 at 7:07 PM, Thilina Gunarathne <[email protected]>wrote: > Hi Ferdy, > I ran the updatedb few times, but still I see only one inLink entry. > However, I see hundreds of inLinks if I invert the outLink data. Does nutch > do any sort of filtering (eg: ignoring inLinks from same domain, etc) when > calculating the inLinks? Or am I doing something wrong by generating the > inLinks using the outLinks data. > > thanks, > Thilina > > On Mon, Oct 29, 2012 at 5:57 AM, Ferdy Galema <[email protected] > >wrote: > > > Hi, > > > > The inlinks are populated with the DbUpdaterJob, which does a couple of > > other things too. (Such as updating scores, fetchtime etc.) > > > > On Mon, Oct 29, 2012 at 4:31 AM, Thilina Gunarathne <[email protected] > > >wrote: > > > > > Dear all, > > > I'm trying to extract the InLinks data from a not-so-large Nutch crawl > > > which uses HBase as the data store. First, I tried the 'il' column > > family, > > > but found only one page with inLinks listed in it. Then I used a simple > > > MapReduce program to invert the outlinks data in 'ol" column family and > > > found many more pages with inLinks. > > > I would like to know when the 'il' family get's populated? Also whether > > > using a simple MapReduce program to invert the outlinks data is the > > correct > > > way to extract any inLink information? > > > > > > thanks a lot in advance, > > > Thilina > > > > > > -- > > > https://www.cs.indiana.edu/~tgunarat/ > > > http://www.linkedin.com/in/thilina > > > http://thilina.gunarathne.org > > > > > > > > > -- > https://www.cs.indiana.edu/~tgunarat/ > http://www.linkedin.com/in/thilina > http://thilina.gunarathne.org >

