Hi Ferdy, I ran the updatedb few times, but still I see only one inLink entry. However, I see hundreds of inLinks if I invert the outLink data. Does nutch do any sort of filtering (eg: ignoring inLinks from same domain, etc) when calculating the inLinks? Or am I doing something wrong by generating the inLinks using the outLinks data.
thanks, Thilina On Mon, Oct 29, 2012 at 5:57 AM, Ferdy Galema <[email protected]>wrote: > Hi, > > The inlinks are populated with the DbUpdaterJob, which does a couple of > other things too. (Such as updating scores, fetchtime etc.) > > On Mon, Oct 29, 2012 at 4:31 AM, Thilina Gunarathne <[email protected] > >wrote: > > > Dear all, > > I'm trying to extract the InLinks data from a not-so-large Nutch crawl > > which uses HBase as the data store. First, I tried the 'il' column > family, > > but found only one page with inLinks listed in it. Then I used a simple > > MapReduce program to invert the outlinks data in 'ol" column family and > > found many more pages with inLinks. > > I would like to know when the 'il' family get's populated? Also whether > > using a simple MapReduce program to invert the outlinks data is the > correct > > way to extract any inLink information? > > > > thanks a lot in advance, > > Thilina > > > > -- > > https://www.cs.indiana.edu/~tgunarat/ > > http://www.linkedin.com/in/thilina > > http://thilina.gunarathne.org > > > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org

