Dear all,
I'm trying to extract the InLinks data from a not-so-large Nutch crawl
which uses HBase as the data store. First, I tried the 'il' column family,
but found only one page with inLinks listed in it. Then I used a simple
MapReduce program to invert the outlinks data in 'ol" column family and
found many more pages with inLinks.
I would like to know when the 'il' family get's populated? Also whether
using a simple MapReduce program to invert the outlinks data is the correct
way to extract any inLink information?

thanks a lot in advance,
Thilina

-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Reply via email to