Wow, really glad to see this moving forward. With Manning I'm guessing?? My top advice: * Debugging, Debugging, DEBUGGING!!!!!!
I imagine you'd have a lot of ideas on this. In additional, I'd suggest: * Systematically break different parts of the system and record the symptoms, error messages, etc. Also: * I agree with Alex about incremental indexing * Setting up spidering for a lot of specific sites, how do you handle rules for hundreds of sites * As above, but also debuping www vs non-www prefix URLs from the same site * Detailed setup on Windows, including an outline of cygwin install and different path syntaxes * AND/or perhaps a rewrite in Windows CMD * Integrating with Solr. Yes, Nutch 1.0 had some prelim integration. And Lucid Imagination has an article on it. HOWEVER there needs to be a lot more info, like meta data fields, etc. Tradeoffs * Managing from a web GUI * A WARNING to always carefully check the Nutch matches Google brings back. For some reason it obsesses about 0.7 pages, but of course things changed quite a bit in 0.8. * A complete walk through setting up a debugging environment with Eclipse. To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and Nutch, with project linkages and sync'd source versions. And when you checkout Java code from ASF you can't just use the ant file to import into Ecliipse, it doesn't work right. * Also a bit about using patches and the patch submission process, again assuming Eclipse and covering any differences on Windows and Linux * Integrating with Open Pipeline or UIMA or whatever other flexible pipeline you like * Complex encoded URLs * Spider traps * Automatic restart on reboot, for Linux, Windows and Mac * Integrating filter packs for old and new MS Office and PDF files * CACHING with Squid or Apache or something, so that when you need to re-run over and over and over again to debug your document processing, you don't have to keep hitting the sites. I've seen two instances of where this was attempted but it didn't seem to work as expected, though I never found out why. * Benchmarking approximations: Assuming decent Internet connectivity, how much can you do with a single Nutch box (pick some stock configuration) * It'd be nice if you could include benchmarks comparing stock SATA drives to fast SCSI / Raid / Fiber. My advice is to stick with stock drives unless a project demonstrates that it needs caviar level storage, BUT I could be wrong, and this is certainly counter to what some of the enterprise search vendors advise. Some projects are small enough that they actually DON'T need high scalability - maybe they only need to index 10,000 pages on a LAN. * SATA RAID vs non-raided sata drives - some RAID can actually slow down writes. * Overhead (or benefit) of NAS, SAN, iSCSI * What is the initial hit in performance when going from a single box to a multibox configuration. In other words there is some overhead in distributing work - in one system I was consulted on it actually seemed to be *VERY* high, though I didn't have access to that system so never got to the bottom of it, but in many setups the client reported MUCH FASTER spidering with a 1 box setup than a 3 box setup - this seemed pretty consistent for them. I'm not suggesting you debug this per se, I just suggesting you do actual benchmarks in your book, believe no one! * Setting up Nutch in the Amazon Cloud, and specific issues with the various temp directories, "local drives" and persistent drives. Benchmarks. * Issues (if any) with VMWare, Xen and Microsoft HyperV virtual machines * Grabbing enough security metadata at spider/index time to do early binding security. Basically fetching the ACL info and injecting it in. This needs coordination on the client side, like from Solr. * Aging of your Nutch segments. When do you really need to blow away everything and start from scratch. * How do you recover from an interrupted / crashed spider / index run that took days or weeks to run (so you don't want to "just start over") -- Mark Bennett / New Idea Engineering, Inc. / [email protected] Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Sun, May 16, 2010 at 9:18 PM, Alex Basa <[email protected]> wrote: > Dennis, > > One topic that had taken me a long time to figure out and lots of people > have been having issues with is doing an incremental index. I don't think > it was documented anywhere and it would be great if you could cover it. > > Thanks, > > Alex > > --- On Sun, 5/16/10, Dennis Kubes <[email protected]> wrote: > > > From: Dennis Kubes <[email protected]> > > Subject: Writing a Book on Nutch > > To: [email protected] > > Date: Sunday, May 16, 2010, 8:27 PM > > Hi Everyone, > > > > It has been a long time coming but I have finally started > > to write a book on Nutch. It will be self published > > and should be available in PDF / paperback form in less than > > a month hopefully. > > > > A while back we discussed a Nutch training seminar on the > > list. I am not ready to do a full on seminar yet but I > > will be putting up some training and tutorial videos in the > > next few weeks. I will update the list as those become > > available. > > > > I already have a general outline but it would help me to > > know the following: > > > > 1) What types of things you would want explained in a book > > / videos on Nutch? > > 2) What are the biggest problems you face using Nutch? > > 3) Anything special you would like answered or explained? > > > > Thanks in advance for any responses. > > > > Dennis > > > > > > > > >

