Re: Writing a Book on Nutch

Davide Del Vecchio Mon, 17 May 2010 03:00:57 -0700

Nice to hear: this book can be very helpful.
I totally agree with the points that Mark shared I expecially feel urgent
the point about describing  "Grabbing enough security metadata at
spider/index time to do early binding"
and possibly what are the extension point to write to a different
index (not Lucene/Solr)
That brings the topic of configuring a development environment  with a
proper Eclipse set up


good news

On Mon, May 17, 2010 at 7:42 AM, Mark Bennett <[email protected]> wrote:
> Wow, really glad to see this moving forward.  With Manning I'm guessing??
>
> My top advice:
> * Debugging, Debugging, DEBUGGING!!!!!!
>
> I imagine you'd have a lot of ideas on this.  In additional, I'd suggest:
> * Systematically break different parts of the system and record the
> symptoms, error messages, etc.
>
> Also:
> * I agree with Alex about incremental indexing
> * Setting up spidering for a lot of specific sites, how do you handle rules
> for hundreds of sites
> * As above, but also debuping www vs non-www prefix URLs from the same site
> * Detailed setup on Windows, including an outline of cygwin install and
> different path syntaxes
> * AND/or perhaps a rewrite in Windows CMD
> * Integrating with Solr.  Yes, Nutch 1.0 had some prelim integration.  And
> Lucid Imagination has an article on it.  HOWEVER there needs to be a lot
> more info, like meta data fields, etc.  Tradeoffs
> * Managing from a web GUI
> * A WARNING to always carefully check the Nutch matches Google brings back.
> For some reason it obsesses about 0.7 pages, but of course things changed
> quite a bit in 0.8.
> * A complete walk through setting up a debugging environment with Eclipse.
> To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and
> Nutch, with project linkages and sync'd source versions.  And when you
> checkout Java code from ASF you can't just use the ant file to import into
> Ecliipse, it doesn't work right.
> * Also a bit about using patches and the patch submission process, again
> assuming Eclipse and covering any differences on Windows and Linux
> * Integrating with Open Pipeline or UIMA or whatever other flexible pipeline
> you like
> * Complex encoded URLs
> * Spider traps
> * Automatic restart on reboot, for Linux, Windows and Mac
> * Integrating filter packs for old and new MS Office and PDF files
> * CACHING with Squid or Apache or something, so that when you need to re-run
> over and over and over again to debug your document processing, you don't
> have to keep hitting the sites. I've seen two instances of where this was
> attempted but it didn't seem to work as expected, though I never found out
> why.
> * Benchmarking approximations: Assuming decent Internet connectivity, how
> much can you do with a single Nutch box (pick some stock configuration)
> * It'd be nice if you could include benchmarks comparing stock SATA drives
> to fast SCSI / Raid / Fiber.  My advice is to stick with stock drives unless
> a project demonstrates that it needs caviar level storage, BUT I could be
> wrong, and this is certainly counter to what some of the enterprise search
> vendors advise.  Some projects are small enough that they actually DON'T
> need high scalability - maybe they only need to index 10,000 pages on a LAN.
> * SATA RAID vs non-raided sata drives - some RAID can actually slow down
> writes.
> * Overhead (or benefit) of NAS, SAN, iSCSI
> * What is the initial hit in performance when going from a single box to a
> multibox configuration.  In other words there is some overhead in
> distributing work - in one system I was consulted on it actually seemed to
> be *VERY* high, though I didn't have access to that system so never got to
> the bottom of it, but in many setups the client reported MUCH FASTER
> spidering with a 1 box setup than a 3 box setup - this seemed pretty
> consistent for them.  I'm not suggesting you debug this per se, I just
> suggesting you do actual benchmarks in your book, believe no one!
> * Setting up Nutch in the Amazon Cloud, and specific issues with the various
> temp directories, "local drives" and persistent drives.  Benchmarks.
> * Issues (if any) with VMWare, Xen and Microsoft HyperV virtual machines
> * Grabbing enough security metadata at spider/index time to do early binding
> security.  Basically fetching the ACL info and injecting it in.  This needs
> coordination on the client side, like from Solr.
> * Aging of your Nutch segments.  When do you really need to blow away
> everything and start from scratch.
> * How do you recover from an interrupted / crashed spider / index run that
> took days or weeks to run (so you don't want to "just start over")
>
> --
> Mark Bennett / New Idea Engineering, Inc. / [email protected]
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Sun, May 16, 2010 at 9:18 PM, Alex Basa <[email protected]> wrote:
>
>> Dennis,
>>
>> One topic that had taken me a long time to figure out and lots of people
>> have been having issues with is doing an incremental index.  I don't think
>> it was documented anywhere and it would be great if you could cover it.
>>
>> Thanks,
>>
>> Alex
>>
>> --- On Sun, 5/16/10, Dennis Kubes <[email protected]> wrote:
>>
>> > From: Dennis Kubes <[email protected]>
>> > Subject: Writing a Book on Nutch
>> > To: [email protected]
>> > Date: Sunday, May 16, 2010, 8:27 PM
>> > Hi Everyone,
>> >
>> > It has been a long time coming but I have finally started
>> > to write a book on Nutch.  It will be self published
>> > and should be available in PDF / paperback form in less than
>> > a month hopefully.
>> >
>> > A while back we discussed a Nutch training seminar on the
>> > list.  I am not ready to do a full on seminar yet but I
>> > will be putting up some training and tutorial videos in the
>> > next few weeks.  I will update the list as those become
>> > available.
>> >
>> > I already have a general outline but it would help me to
>> > know the following:
>> >
>> > 1) What types of things you would want explained in a book
>> > / videos on Nutch?
>> > 2) What are the biggest problems you face using Nutch?
>> > 3) Anything special you would like answered or explained?
>> >
>> > Thanks in advance for any responses.
>> >
>> > Dennis
>> >
>> >
>>
>>
>>
>>
>>
>

Re: Writing a Book on Nutch

Reply via email to