Hi Ferdy,
I ran the updatedb few times, but still I see only one inLink entry.
However, I see hundreds of inLinks if I invert the outLink data. Does nutch
do any sort of filtering (eg: ignoring inLinks from same domain, etc) when
calculating the inLinks? Or am I doing something wrong by generating the
inLinks using the outLinks data.

thanks,
Thilina

On Mon, Oct 29, 2012 at 5:57 AM, Ferdy Galema <[email protected]>wrote:

> Hi,
>
> The inlinks are populated with the DbUpdaterJob, which does a couple of
> other things too. (Such as updating scores, fetchtime etc.)
>
> On Mon, Oct 29, 2012 at 4:31 AM, Thilina Gunarathne <[email protected]
> >wrote:
>
> > Dear all,
> > I'm trying to extract the InLinks data from a not-so-large Nutch crawl
> > which uses HBase as the data store. First, I tried the 'il' column
> family,
> > but found only one page with inLinks listed in it. Then I used a simple
> > MapReduce program to invert the outlinks data in 'ol" column family and
> > found many more pages with inLinks.
> > I would like to know when the 'il' family get's populated? Also whether
> > using a simple MapReduce program to invert the outlinks data is the
> correct
> > way to extract any inLink information?
> >
> > thanks a lot in advance,
> > Thilina
> >
> > --
> > https://www.cs.indiana.edu/~tgunarat/
> > http://www.linkedin.com/in/thilina
> > http://thilina.gunarathne.org
> >
>



-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Reply via email to