Re: Writing a Book on Nutch

Mark Bennett Mon, 17 May 2010 01:08:43 -0700

Wow, really glad to see this moving forward.  With Manning I'm guessing??

My top advice:
* Debugging, Debugging, DEBUGGING!!!!!!

I imagine you'd have a lot of ideas on this.  In additional, I'd suggest:
* Systematically break different parts of the system and record the
symptoms, error messages, etc.

Also:
* I agree with Alex about incremental indexing
* Setting up spidering for a lot of specific sites, how do you handle rules
for hundreds of sites
* As above, but also debuping www vs non-www prefix URLs from the same site
* Detailed setup on Windows, including an outline of cygwin install and
different path syntaxes
* AND/or perhaps a rewrite in Windows CMD
* Integrating with Solr.  Yes, Nutch 1.0 had some prelim integration.  And
Lucid Imagination has an article on it.  HOWEVER there needs to be a lot
more info, like meta data fields, etc.  Tradeoffs
* Managing from a web GUI
* A WARNING to always carefully check the Nutch matches Google brings back.
For some reason it obsesses about 0.7 pages, but of course things changed
quite a bit in 0.8.
* A complete walk through setting up a debugging environment with Eclipse.
To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and
Nutch, with project linkages and sync'd source versions.  And when you
checkout Java code from ASF you can't just use the ant file to import into
Ecliipse, it doesn't work right.
* Also a bit about using patches and the patch submission process, again
assuming Eclipse and covering any differences on Windows and Linux
* Integrating with Open Pipeline or UIMA or whatever other flexible pipeline
you like
* Complex encoded URLs
* Spider traps
* Automatic restart on reboot, for Linux, Windows and Mac
* Integrating filter packs for old and new MS Office and PDF files
* CACHING with Squid or Apache or something, so that when you need to re-run
over and over and over again to debug your document processing, you don't
have to keep hitting the sites. I've seen two instances of where this was
attempted but it didn't seem to work as expected, though I never found out
why.
* Benchmarking approximations: Assuming decent Internet connectivity, how
much can you do with a single Nutch box (pick some stock configuration)
* It'd be nice if you could include benchmarks comparing stock SATA drives
to fast SCSI / Raid / Fiber.  My advice is to stick with stock drives unless
a project demonstrates that it needs caviar level storage, BUT I could be
wrong, and this is certainly counter to what some of the enterprise search
vendors advise.  Some projects are small enough that they actually DON'T
need high scalability - maybe they only need to index 10,000 pages on a LAN.
* SATA RAID vs non-raided sata drives - some RAID can actually slow down
writes.
* Overhead (or benefit) of NAS, SAN, iSCSI
* What is the initial hit in performance when going from a single box to a
multibox configuration.  In other words there is some overhead in
distributing work - in one system I was consulted on it actually seemed to
be *VERY* high, though I didn't have access to that system so never got to
the bottom of it, but in many setups the client reported MUCH FASTER
spidering with a 1 box setup than a 3 box setup - this seemed pretty
consistent for them.  I'm not suggesting you debug this per se, I just
suggesting you do actual benchmarks in your book, believe no one!
* Setting up Nutch in the Amazon Cloud, and specific issues with the various
temp directories, "local drives" and persistent drives.  Benchmarks.
* Issues (if any) with VMWare, Xen and Microsoft HyperV virtual machines
* Grabbing enough security metadata at spider/index time to do early binding
security.  Basically fetching the ACL info and injecting it in.  This needs
coordination on the client side, like from Solr.
* Aging of your Nutch segments.  When do you really need to blow away
everything and start from scratch.
* How do you recover from an interrupted / crashed spider / index run that
took days or weeks to run (so you don't want to "just start over")

--
Mark Bennett / New Idea Engineering, Inc. / [email protected]
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

On Sun, May 16, 2010 at 9:18 PM, Alex Basa <[email protected]> wrote:

> Dennis,
>
> One topic that had taken me a long time to figure out and lots of people
> have been having issues with is doing an incremental index.  I don't think
> it was documented anywhere and it would be great if you could cover it.
>
> Thanks,
>
> Alex
>
> --- On Sun, 5/16/10, Dennis Kubes <[email protected]> wrote:
>
> > From: Dennis Kubes <[email protected]>
> > Subject: Writing a Book on Nutch
> > To: [email protected]
> > Date: Sunday, May 16, 2010, 8:27 PM
> > Hi Everyone,
> >
> > It has been a long time coming but I have finally started
> > to write a book on Nutch.  It will be self published
> > and should be available in PDF / paperback form in less than
> > a month hopefully.
> >
> > A while back we discussed a Nutch training seminar on the
> > list.  I am not ready to do a full on seminar yet but I
> > will be putting up some training and tutorial videos in the
> > next few weeks.  I will update the list as those become
> > available.
> >
> > I already have a general outline but it would help me to
> > know the following:
> >
> > 1) What types of things you would want explained in a book
> > / videos on Nutch?
> > 2) What are the biggest problems you face using Nutch?
> > 3) Anything special you would like answered or explained?
> >
> > Thanks in advance for any responses.
> >
> > Dennis
> >
> >
>
>
>
>
>

Re: Writing a Book on Nutch

Reply via email to