Dear all, I'm trying to extract the InLinks data from a not-so-large Nutch crawl which uses HBase as the data store. First, I tried the 'il' column family, but found only one page with inLinks listed in it. Then I used a simple MapReduce program to invert the outlinks data in 'ol" column family and found many more pages with inLinks. I would like to know when the 'il' family get's populated? Also whether using a simple MapReduce program to invert the outlinks data is the correct way to extract any inLink information?
thanks a lot in advance, Thilina -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org

