Nice to hear: this book can be very helpful. I totally agree with the points that Mark shared I expecially feel urgent the point about describing "Grabbing enough security metadata at spider/index time to do early binding" and possibly what are the extension point to write to a different index (not Lucene/Solr) That brings the topic of configuring a development environment with a proper Eclipse set up
good news On Mon, May 17, 2010 at 7:42 AM, Mark Bennett <[email protected]> wrote: > Wow, really glad to see this moving forward. With Manning I'm guessing?? > > My top advice: > * Debugging, Debugging, DEBUGGING!!!!!! > > I imagine you'd have a lot of ideas on this. In additional, I'd suggest: > * Systematically break different parts of the system and record the > symptoms, error messages, etc. > > Also: > * I agree with Alex about incremental indexing > * Setting up spidering for a lot of specific sites, how do you handle rules > for hundreds of sites > * As above, but also debuping www vs non-www prefix URLs from the same site > * Detailed setup on Windows, including an outline of cygwin install and > different path syntaxes > * AND/or perhaps a rewrite in Windows CMD > * Integrating with Solr. Yes, Nutch 1.0 had some prelim integration. And > Lucid Imagination has an article on it. HOWEVER there needs to be a lot > more info, like meta data fields, etc. Tradeoffs > * Managing from a web GUI > * A WARNING to always carefully check the Nutch matches Google brings back. > For some reason it obsesses about 0.7 pages, but of course things changed > quite a bit in 0.8. > * A complete walk through setting up a debugging environment with Eclipse. > To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and > Nutch, with project linkages and sync'd source versions. And when you > checkout Java code from ASF you can't just use the ant file to import into > Ecliipse, it doesn't work right. > * Also a bit about using patches and the patch submission process, again > assuming Eclipse and covering any differences on Windows and Linux > * Integrating with Open Pipeline or UIMA or whatever other flexible pipeline > you like > * Complex encoded URLs > * Spider traps > * Automatic restart on reboot, for Linux, Windows and Mac > * Integrating filter packs for old and new MS Office and PDF files > * CACHING with Squid or Apache or something, so that when you need to re-run > over and over and over again to debug your document processing, you don't > have to keep hitting the sites. I've seen two instances of where this was > attempted but it didn't seem to work as expected, though I never found out > why. > * Benchmarking approximations: Assuming decent Internet connectivity, how > much can you do with a single Nutch box (pick some stock configuration) > * It'd be nice if you could include benchmarks comparing stock SATA drives > to fast SCSI / Raid / Fiber. My advice is to stick with stock drives unless > a project demonstrates that it needs caviar level storage, BUT I could be > wrong, and this is certainly counter to what some of the enterprise search > vendors advise. Some projects are small enough that they actually DON'T > need high scalability - maybe they only need to index 10,000 pages on a LAN. > * SATA RAID vs non-raided sata drives - some RAID can actually slow down > writes. > * Overhead (or benefit) of NAS, SAN, iSCSI > * What is the initial hit in performance when going from a single box to a > multibox configuration. In other words there is some overhead in > distributing work - in one system I was consulted on it actually seemed to > be *VERY* high, though I didn't have access to that system so never got to > the bottom of it, but in many setups the client reported MUCH FASTER > spidering with a 1 box setup than a 3 box setup - this seemed pretty > consistent for them. I'm not suggesting you debug this per se, I just > suggesting you do actual benchmarks in your book, believe no one! > * Setting up Nutch in the Amazon Cloud, and specific issues with the various > temp directories, "local drives" and persistent drives. Benchmarks. > * Issues (if any) with VMWare, Xen and Microsoft HyperV virtual machines > * Grabbing enough security metadata at spider/index time to do early binding > security. Basically fetching the ACL info and injecting it in. This needs > coordination on the client side, like from Solr. > * Aging of your Nutch segments. When do you really need to blow away > everything and start from scratch. > * How do you recover from an interrupted / crashed spider / index run that > took days or weeks to run (so you don't want to "just start over") > > -- > Mark Bennett / New Idea Engineering, Inc. / [email protected] > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 > > > On Sun, May 16, 2010 at 9:18 PM, Alex Basa <[email protected]> wrote: > >> Dennis, >> >> One topic that had taken me a long time to figure out and lots of people >> have been having issues with is doing an incremental index. I don't think >> it was documented anywhere and it would be great if you could cover it. >> >> Thanks, >> >> Alex >> >> --- On Sun, 5/16/10, Dennis Kubes <[email protected]> wrote: >> >> > From: Dennis Kubes <[email protected]> >> > Subject: Writing a Book on Nutch >> > To: [email protected] >> > Date: Sunday, May 16, 2010, 8:27 PM >> > Hi Everyone, >> > >> > It has been a long time coming but I have finally started >> > to write a book on Nutch. It will be self published >> > and should be available in PDF / paperback form in less than >> > a month hopefully. >> > >> > A while back we discussed a Nutch training seminar on the >> > list. I am not ready to do a full on seminar yet but I >> > will be putting up some training and tutorial videos in the >> > next few weeks. I will update the list as those become >> > available. >> > >> > I already have a general outline but it would help me to >> > know the following: >> > >> > 1) What types of things you would want explained in a book >> > / videos on Nutch? >> > 2) What are the biggest problems you face using Nutch? >> > 3) Anything special you would like answered or explained? >> > >> > Thanks in advance for any responses. >> > >> > Dennis >> > >> > >> >> >> >> >> >

