Re: Nutch trunk IndexWriter Plugin

Julien Nioche Wed, 29 May 2013 01:35:28 -0700

Hi Alex


On 28 May 2013 06:29, AC Nutch <[email protected]> wrote:

> Hi All,
>
> I'm using Nutch 1.7 (trunk) and writing a plugin to index to HBase (using
> Nutch2.1 is not an option - I had to use 1.7 and write an indexer myself).
> I believe I'm well on my way, but I had a few questions.
>
> So my first step in the process was to make sure that the NutchDocument
> held the fields that I need. For me, I'm trying to keep track of the
> Inlinks and Outlinks of individual pages. With that in mind I am adding
> (with *.add()*) to the document two fields:
>
> - An Inlinks iterator with something like the following: *inlks =
> inlinks.iterator();*
>
> - An array of Outlinks with something like the following: *outlinks =
> parse.getData().getOutlinks();*
>
> All seems to go well, and later I am able to access the inlinks and
> outlinks of the page from the NutchDocument as needed for the IndexWriter
> plugin. Great (as an aside - this really is a wonderful plugin structure).
> So here are my questions:
>
> (1) When I run the indexer with *bin/nutch index crawl/crawldb -linkdb
> crawl/linkdb -dir crawl/segments *what documents will this index? Put in
> another way, if I have the IndexWriter plugin indexing to a database with
> information read from a NutchDocument object, will there be one
> NutchDocument for every website in my crawldb? Or will there only be an
> instance for only the new sites that have been fetched? How does that work?
> I'm not familiar enough with the internals of Nutch to understand what
> exactly what NutchDocuments are built and when... Any clarification would
> be much appreciated. I'm trying to make sure I'm efficiently writing to the
> db while keeping my outlinks and inlinks up to date in HBase.
>

The indexer will generate a NutchDocument for every page in all the
segments (since you used -dir). It has no knowledge of whether a page is
already known or not.


>
> (2) When I add inlinks to my NutchDocument, I'm doing it from a filter
> method like:
> *NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum
> datum, Inlinks inlinks)*
> and simply using the inlinks instance passed into this method. However, I'm
> not entirely clear on where those inlinks 'come from' at execution - are
> they being read from the linkdb and then passed in to the method? Is there
> a better way to read inlinks? I'm concerned that every time I run
> "invertlinks" that the links in my HBase instance will no longer be up to
> date. Is this correct? And if so is there a way to resolve that?
>

the inlinks come from the linkdb indeed and you get them only for the docs
 in the segments passed as input to the indexer. The indexer will update
your HBAse tables with these inlinks

HTH

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch trunk IndexWriter Plugin

Reply via email to