Hi All, I'm using Nutch 1.7 (trunk) and writing a plugin to index to HBase (using Nutch2.1 is not an option - I had to use 1.7 and write an indexer myself). I believe I'm well on my way, but I had a few questions.
So my first step in the process was to make sure that the NutchDocument held the fields that I need. For me, I'm trying to keep track of the Inlinks and Outlinks of individual pages. With that in mind I am adding (with *.add()*) to the document two fields: - An Inlinks iterator with something like the following: *inlks = inlinks.iterator();* - An array of Outlinks with something like the following: *outlinks = parse.getData().getOutlinks();* All seems to go well, and later I am able to access the inlinks and outlinks of the page from the NutchDocument as needed for the IndexWriter plugin. Great (as an aside - this really is a wonderful plugin structure). So here are my questions: (1) When I run the indexer with *bin/nutch index crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments *what documents will this index? Put in another way, if I have the IndexWriter plugin indexing to a database with information read from a NutchDocument object, will there be one NutchDocument for every website in my crawldb? Or will there only be an instance for only the new sites that have been fetched? How does that work? I'm not familiar enough with the internals of Nutch to understand what exactly what NutchDocuments are built and when... Any clarification would be much appreciated. I'm trying to make sure I'm efficiently writing to the db while keeping my outlinks and inlinks up to date in HBase. (2) When I add inlinks to my NutchDocument, I'm doing it from a filter method like: *NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)* and simply using the inlinks instance passed into this method. However, I'm not entirely clear on where those inlinks 'come from' at execution - are they being read from the linkdb and then passed in to the method? Is there a better way to read inlinks? I'm concerned that every time I run "invertlinks" that the links in my HBase instance will no longer be up to date. Is this correct? And if so is there a way to resolve that? Thanks, Alex

