Hi Julien, Thanks a lot for the response. That clarifies everything!
Alex On Wed, May 29, 2013 at 4:33 AM, Julien Nioche < [email protected]> wrote: > Hi Alex > > > On 28 May 2013 06:29, AC Nutch <[email protected]> wrote: > > > Hi All, > > > > I'm using Nutch 1.7 (trunk) and writing a plugin to index to HBase (using > > Nutch2.1 is not an option - I had to use 1.7 and write an indexer > myself). > > I believe I'm well on my way, but I had a few questions. > > > > So my first step in the process was to make sure that the NutchDocument > > held the fields that I need. For me, I'm trying to keep track of the > > Inlinks and Outlinks of individual pages. With that in mind I am adding > > (with *.add()*) to the document two fields: > > > > - An Inlinks iterator with something like the following: *inlks = > > inlinks.iterator();* > > > > - An array of Outlinks with something like the following: *outlinks = > > parse.getData().getOutlinks();* > > > > All seems to go well, and later I am able to access the inlinks and > > outlinks of the page from the NutchDocument as needed for the IndexWriter > > plugin. Great (as an aside - this really is a wonderful plugin > structure). > > So here are my questions: > > > > (1) When I run the indexer with *bin/nutch index crawl/crawldb -linkdb > > crawl/linkdb -dir crawl/segments *what documents will this index? Put in > > another way, if I have the IndexWriter plugin indexing to a database with > > information read from a NutchDocument object, will there be one > > NutchDocument for every website in my crawldb? Or will there only be an > > instance for only the new sites that have been fetched? How does that > work? > > I'm not familiar enough with the internals of Nutch to understand what > > exactly what NutchDocuments are built and when... Any clarification would > > be much appreciated. I'm trying to make sure I'm efficiently writing to > the > > db while keeping my outlinks and inlinks up to date in HBase. > > > > The indexer will generate a NutchDocument for every page in all the > segments (since you used -dir). It has no knowledge of whether a page is > already known or not. > > > > > > (2) When I add inlinks to my NutchDocument, I'm doing it from a filter > > method like: > > *NutchDocument filter(NutchDocument doc, Parse parse, Text url, > CrawlDatum > > datum, Inlinks inlinks)* > > and simply using the inlinks instance passed into this method. However, > I'm > > not entirely clear on where those inlinks 'come from' at execution - are > > they being read from the linkdb and then passed in to the method? Is > there > > a better way to read inlinks? I'm concerned that every time I run > > "invertlinks" that the links in my HBase instance will no longer be up to > > date. Is this correct? And if so is there a way to resolve that? > > > > the inlinks come from the linkdb indeed and you get them only for the docs > in the segments passed as input to the indexer. The indexer will update > your HBAse tables with these inlinks > > HTH > > Julien > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

