"re-crawling and controlling that process seems like an issue in need of covering to me"
I am also very interested in knowing that better .. But also better strategies for crawling a single site and some benchmarks, linking configuration to performance. "... configuring a development environment with a proper Eclipse set up" "Automatic restart on reboot..." Those also interest me Looking forward to it -- Emmanuel de Castro Santana 2010/5/17 Davide Del Vecchio <[email protected]> > Nice to hear: this book can be very helpful. > I totally agree with the points that Mark shared I expecially feel urgent > the point about describing "Grabbing enough security metadata at > spider/index time to do early binding" > and possibly what are the extension point to write to a different > index (not Lucene/Solr) > That brings the topic of configuring a development environment with a > proper Eclipse set up > > good news > > On Mon, May 17, 2010 at 7:42 AM, Mark Bennett <[email protected]> > wrote: > > Wow, really glad to see this moving forward. With Manning I'm guessing?? > > > > My top advice: > > * Debugging, Debugging, DEBUGGING!!!!!! > > > > I imagine you'd have a lot of ideas on this. In additional, I'd suggest: > > * Systematically break different parts of the system and record the > > symptoms, error messages, etc. > > > > Also: > > * I agree with Alex about incremental indexing > > * Setting up spidering for a lot of specific sites, how do you handle > rules > > for hundreds of sites > > * As above, but also debuping www vs non-www prefix URLs from the same > site > > * Detailed setup on Windows, including an outline of cygwin install and > > different path syntaxes > > * AND/or perhaps a rewrite in Windows CMD > > * Integrating with Solr. Yes, Nutch 1.0 had some prelim integration. > And > > Lucid Imagination has an article on it. HOWEVER there needs to be a lot > > more info, like meta data fields, etc. Tradeoffs > > * Managing from a web GUI > > * A WARNING to always carefully check the Nutch matches Google brings > back. > > For some reason it obsesses about 0.7 pages, but of course things changed > > quite a bit in 0.8. > > * A complete walk through setting up a debugging environment with > Eclipse. > > To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and > > Nutch, with project linkages and sync'd source versions. And when you > > checkout Java code from ASF you can't just use the ant file to import > into > > Ecliipse, it doesn't work right. > > * Also a bit about using patches and the patch submission process, again > > assuming Eclipse and covering any differences on Windows and Linux > > * Integrating with Open Pipeline or UIMA or whatever other flexible > pipeline > > you like > > * Complex encoded URLs > > * Spider traps > > * Automatic restart on reboot, for Linux, Windows and Mac > > * Integrating filter packs for old and new MS Office and PDF files > > * CACHING with Squid or Apache or something, so that when you need to > re-run > > over and over and over again to debug your document processing, you don't > > have to keep hitting the sites. I've seen two instances of where this was > > attempted but it didn't seem to work as expected, though I never found > out > > why. > > * Benchmarking approximations: Assuming decent Internet connectivity, how > > much can you do with a single Nutch box (pick some stock configuration) > > * It'd be nice if you could include benchmarks comparing stock SATA > drives > > to fast SCSI / Raid / Fiber. My advice is to stick with stock drives > unless > > a project demonstrates that it needs caviar level storage, BUT I could be > > wrong, and this is certainly counter to what some of the enterprise > search > > vendors advise. Some projects are small enough that they actually DON'T > > need high scalability - maybe they only need to index 10,000 pages on a > LAN. > > * SATA RAID vs non-raided sata drives - some RAID can actually slow down > > writes. > > * Overhead (or benefit) of NAS, SAN, iSCSI > > * What is the initial hit in performance when going from a single box to > a > > multibox configuration. In other words there is some overhead in > > distributing work - in one system I was consulted on it actually seemed > to > > be *VERY* high, though I didn't have access to that system so never got > to > > the bottom of it, but in many setups the client reported MUCH FASTER > > spidering with a 1 box setup than a 3 box setup - this seemed pretty > > consistent for them. I'm not suggesting you debug this per se, I just > > suggesting you do actual benchmarks in your book, believe no one! > > * Setting up Nutch in the Amazon Cloud, and specific issues with the > various > > temp directories, "local drives" and persistent drives. Benchmarks. > > * Issues (if any) with VMWare, Xen and Microsoft HyperV virtual machines > > * Grabbing enough security metadata at spider/index time to do early > binding > > security. Basically fetching the ACL info and injecting it in. This > needs > > coordination on the client side, like from Solr. > > * Aging of your Nutch segments. When do you really need to blow away > > everything and start from scratch. > > * How do you recover from an interrupted / crashed spider / index run > that > > took days or weeks to run (so you don't want to "just start over") > > > > -- > > Mark Bennett / New Idea Engineering, Inc. / [email protected] > > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 > > > > > > On Sun, May 16, 2010 at 9:18 PM, Alex Basa <[email protected]> wrote: > > > >> Dennis, > >> > >> One topic that had taken me a long time to figure out and lots of people > >> have been having issues with is doing an incremental index. I don't > think > >> it was documented anywhere and it would be great if you could cover it. > >> > >> Thanks, > >> > >> Alex > >> > >> --- On Sun, 5/16/10, Dennis Kubes <[email protected]> wrote: > >> > >> > From: Dennis Kubes <[email protected]> > >> > Subject: Writing a Book on Nutch > >> > To: [email protected] > >> > Date: Sunday, May 16, 2010, 8:27 PM > >> > Hi Everyone, > >> > > >> > It has been a long time coming but I have finally started > >> > to write a book on Nutch. It will be self published > >> > and should be available in PDF / paperback form in less than > >> > a month hopefully. > >> > > >> > A while back we discussed a Nutch training seminar on the > >> > list. I am not ready to do a full on seminar yet but I > >> > will be putting up some training and tutorial videos in the > >> > next few weeks. I will update the list as those become > >> > available. > >> > > >> > I already have a general outline but it would help me to > >> > know the following: > >> > > >> > 1) What types of things you would want explained in a book > >> > / videos on Nutch? > >> > 2) What are the biggest problems you face using Nutch? > >> > 3) Anything special you would like answered or explained? > >> > > >> > Thanks in advance for any responses. > >> > > >> > Dennis > >> > > >> > > >> > >> > >> > >> > >> > > > -- Emmanuel de Castro Santana

