Re: Nutch trunk IndexWriter Plugin

AC Nutch Wed, 29 May 2013 14:07:35 -0700

Hi Julien,

Thanks a lot for the response. That clarifies everything!


Alex


On Wed, May 29, 2013 at 4:33 AM, Julien Nioche <
[email protected]> wrote:

> Hi Alex
>
>
> On 28 May 2013 06:29, AC Nutch <[email protected]> wrote:
>
> > Hi All,
> >
> > I'm using Nutch 1.7 (trunk) and writing a plugin to index to HBase (using
> > Nutch2.1 is not an option - I had to use 1.7 and write an indexer
> myself).
> > I believe I'm well on my way, but I had a few questions.
> >
> > So my first step in the process was to make sure that the NutchDocument
> > held the fields that I need. For me, I'm trying to keep track of the
> > Inlinks and Outlinks of individual pages. With that in mind I am adding
> > (with *.add()*) to the document two fields:
> >
> > - An Inlinks iterator with something like the following: *inlks =
> > inlinks.iterator();*
> >
> > - An array of Outlinks with something like the following: *outlinks =
> > parse.getData().getOutlinks();*
> >
> > All seems to go well, and later I am able to access the inlinks and
> > outlinks of the page from the NutchDocument as needed for the IndexWriter
> > plugin. Great (as an aside - this really is a wonderful plugin
> structure).
> > So here are my questions:
> >
> > (1) When I run the indexer with *bin/nutch index crawl/crawldb -linkdb
> > crawl/linkdb -dir crawl/segments *what documents will this index? Put in
> > another way, if I have the IndexWriter plugin indexing to a database with
> > information read from a NutchDocument object, will there be one
> > NutchDocument for every website in my crawldb? Or will there only be an
> > instance for only the new sites that have been fetched? How does that
> work?
> > I'm not familiar enough with the internals of Nutch to understand what
> > exactly what NutchDocuments are built and when... Any clarification would
> > be much appreciated. I'm trying to make sure I'm efficiently writing to
> the
> > db while keeping my outlinks and inlinks up to date in HBase.
> >
>
> The indexer will generate a NutchDocument for every page in all the
> segments (since you used -dir). It has no knowledge of whether a page is
> already known or not.
>
>
> >
> > (2) When I add inlinks to my NutchDocument, I'm doing it from a filter
> > method like:
> > *NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum
> > datum, Inlinks inlinks)*
> > and simply using the inlinks instance passed into this method. However,
> I'm
> > not entirely clear on where those inlinks 'come from' at execution - are
> > they being read from the linkdb and then passed in to the method? Is
> there
> > a better way to read inlinks? I'm concerned that every time I run
> > "invertlinks" that the links in my HBase instance will no longer be up to
> > date. Is this correct? And if so is there a way to resolve that?
> >
>
> the inlinks come from the linkdb indeed and you get them only for the docs
>  in the segments passed as input to the indexer. The indexer will update
> your HBAse tables with these inlinks
>
> HTH
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch trunk IndexWriter Plugin

Reply via email to