Nutch trunk IndexWriter Plugin

AC Nutch Mon, 27 May 2013 22:30:32 -0700

Hi All,

I'm using Nutch 1.7 (trunk) and writing a plugin to index to HBase (using
Nutch2.1 is not an option - I had to use 1.7 and write an indexer myself).
I believe I'm well on my way, but I had a few questions.


So my first step in the process was to make sure that the NutchDocument
held the fields that I need. For me, I'm trying to keep track of the
Inlinks and Outlinks of individual pages. With that in mind I am adding
(with *.add()*) to the document two fields:

- An Inlinks iterator with something like the following: *inlks =
inlinks.iterator();*

- An array of Outlinks with something like the following: *outlinks =
parse.getData().getOutlinks();*

All seems to go well, and later I am able to access the inlinks and
outlinks of the page from the NutchDocument as needed for the IndexWriter
plugin. Great (as an aside - this really is a wonderful plugin structure).
So here are my questions:

(1) When I run the indexer with *bin/nutch index crawl/crawldb -linkdb
crawl/linkdb -dir crawl/segments *what documents will this index? Put in
another way, if I have the IndexWriter plugin indexing to a database with
information read from a NutchDocument object, will there be one
NutchDocument for every website in my crawldb? Or will there only be an
instance for only the new sites that have been fetched? How does that work?
I'm not familiar enough with the internals of Nutch to understand what
exactly what NutchDocuments are built and when... Any clarification would
be much appreciated. I'm trying to make sure I'm efficiently writing to the
db while keeping my outlinks and inlinks up to date in HBase.

(2) When I add inlinks to my NutchDocument, I'm doing it from a filter
method like:
*NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum
datum, Inlinks inlinks)*
and simply using the inlinks instance passed into this method. However, I'm
not entirely clear on where those inlinks 'come from' at execution - are
they being read from the linkdb and then passed in to the method? Is there
a better way to read inlinks? I'm concerned that every time I run
"invertlinks" that the links in my HBase instance will no longer be up to
date. Is this correct? And if so is there a way to resolve that?

Thanks,

Alex

Nutch trunk IndexWriter Plugin

Reply via email to