Do you have db.update.max.inlinks set to 1? (Default set to 10000). This is
a cap of the number of inlinks that will be written.

On Mon, Oct 29, 2012 at 7:07 PM, Thilina Gunarathne <[email protected]>wrote:

> Hi Ferdy,
> I ran the updatedb few times, but still I see only one inLink entry.
> However, I see hundreds of inLinks if I invert the outLink data. Does nutch
> do any sort of filtering (eg: ignoring inLinks from same domain, etc) when
> calculating the inLinks? Or am I doing something wrong by generating the
> inLinks using the outLinks data.
>
> thanks,
> Thilina
>
> On Mon, Oct 29, 2012 at 5:57 AM, Ferdy Galema <[email protected]
> >wrote:
>
> > Hi,
> >
> > The inlinks are populated with the DbUpdaterJob, which does a couple of
> > other things too. (Such as updating scores, fetchtime etc.)
> >
> > On Mon, Oct 29, 2012 at 4:31 AM, Thilina Gunarathne <[email protected]
> > >wrote:
> >
> > > Dear all,
> > > I'm trying to extract the InLinks data from a not-so-large Nutch crawl
> > > which uses HBase as the data store. First, I tried the 'il' column
> > family,
> > > but found only one page with inLinks listed in it. Then I used a simple
> > > MapReduce program to invert the outlinks data in 'ol" column family and
> > > found many more pages with inLinks.
> > > I would like to know when the 'il' family get's populated? Also whether
> > > using a simple MapReduce program to invert the outlinks data is the
> > correct
> > > way to extract any inLink information?
> > >
> > > thanks a lot in advance,
> > > Thilina
> > >
> > > --
> > > https://www.cs.indiana.edu/~tgunarat/
> > > http://www.linkedin.com/in/thilina
> > > http://thilina.gunarathne.org
> > >
> >
>
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>

Reply via email to