"re-crawling and controlling that process seems like an issue in need of
covering to me"

I am also very interested in knowing that better ..
But also better strategies for crawling a single site and some benchmarks,
linking configuration to performance.

"... configuring a development environment  with a
proper Eclipse set up"

"Automatic restart on reboot..."

Those also interest me


Looking forward to it
--

Emmanuel de Castro Santana

2010/5/17 Davide Del Vecchio <[email protected]>

> Nice to hear: this book can be very helpful.
> I totally agree with the points that Mark shared I expecially feel urgent
> the point about describing  "Grabbing enough security metadata at
> spider/index time to do early binding"
> and possibly what are the extension point to write to a different
> index (not Lucene/Solr)
> That brings the topic of configuring a development environment  with a
> proper Eclipse set up
>
> good news
>
> On Mon, May 17, 2010 at 7:42 AM, Mark Bennett <[email protected]>
> wrote:
> > Wow, really glad to see this moving forward.  With Manning I'm guessing??
> >
> > My top advice:
> > * Debugging, Debugging, DEBUGGING!!!!!!
> >
> > I imagine you'd have a lot of ideas on this.  In additional, I'd suggest:
> > * Systematically break different parts of the system and record the
> > symptoms, error messages, etc.
> >
> > Also:
> > * I agree with Alex about incremental indexing
> > * Setting up spidering for a lot of specific sites, how do you handle
> rules
> > for hundreds of sites
> > * As above, but also debuping www vs non-www prefix URLs from the same
> site
> > * Detailed setup on Windows, including an outline of cygwin install and
> > different path syntaxes
> > * AND/or perhaps a rewrite in Windows CMD
> > * Integrating with Solr.  Yes, Nutch 1.0 had some prelim integration.
>  And
> > Lucid Imagination has an article on it.  HOWEVER there needs to be a lot
> > more info, like meta data fields, etc.  Tradeoffs
> > * Managing from a web GUI
> > * A WARNING to always carefully check the Nutch matches Google brings
> back.
> > For some reason it obsesses about 0.7 pages, but of course things changed
> > quite a bit in 0.8.
> > * A complete walk through setting up a debugging environment with
> Eclipse.
> > To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and
> > Nutch, with project linkages and sync'd source versions.  And when you
> > checkout Java code from ASF you can't just use the ant file to import
> into
> > Ecliipse, it doesn't work right.
> > * Also a bit about using patches and the patch submission process, again
> > assuming Eclipse and covering any differences on Windows and Linux
> > * Integrating with Open Pipeline or UIMA or whatever other flexible
> pipeline
> > you like
> > * Complex encoded URLs
> > * Spider traps
> > * Automatic restart on reboot, for Linux, Windows and Mac
> > * Integrating filter packs for old and new MS Office and PDF files
> > * CACHING with Squid or Apache or something, so that when you need to
> re-run
> > over and over and over again to debug your document processing, you don't
> > have to keep hitting the sites. I've seen two instances of where this was
> > attempted but it didn't seem to work as expected, though I never found
> out
> > why.
> > * Benchmarking approximations: Assuming decent Internet connectivity, how
> > much can you do with a single Nutch box (pick some stock configuration)
> > * It'd be nice if you could include benchmarks comparing stock SATA
> drives
> > to fast SCSI / Raid / Fiber.  My advice is to stick with stock drives
> unless
> > a project demonstrates that it needs caviar level storage, BUT I could be
> > wrong, and this is certainly counter to what some of the enterprise
> search
> > vendors advise.  Some projects are small enough that they actually DON'T
> > need high scalability - maybe they only need to index 10,000 pages on a
> LAN.
> > * SATA RAID vs non-raided sata drives - some RAID can actually slow down
> > writes.
> > * Overhead (or benefit) of NAS, SAN, iSCSI
> > * What is the initial hit in performance when going from a single box to
> a
> > multibox configuration.  In other words there is some overhead in
> > distributing work - in one system I was consulted on it actually seemed
> to
> > be *VERY* high, though I didn't have access to that system so never got
> to
> > the bottom of it, but in many setups the client reported MUCH FASTER
> > spidering with a 1 box setup than a 3 box setup - this seemed pretty
> > consistent for them.  I'm not suggesting you debug this per se, I just
> > suggesting you do actual benchmarks in your book, believe no one!
> > * Setting up Nutch in the Amazon Cloud, and specific issues with the
> various
> > temp directories, "local drives" and persistent drives.  Benchmarks.
> > * Issues (if any) with VMWare, Xen and Microsoft HyperV virtual machines
> > * Grabbing enough security metadata at spider/index time to do early
> binding
> > security.  Basically fetching the ACL info and injecting it in.  This
> needs
> > coordination on the client side, like from Solr.
> > * Aging of your Nutch segments.  When do you really need to blow away
> > everything and start from scratch.
> > * How do you recover from an interrupted / crashed spider / index run
> that
> > took days or weeks to run (so you don't want to "just start over")
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / [email protected]
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
> >
> > On Sun, May 16, 2010 at 9:18 PM, Alex Basa <[email protected]> wrote:
> >
> >> Dennis,
> >>
> >> One topic that had taken me a long time to figure out and lots of people
> >> have been having issues with is doing an incremental index.  I don't
> think
> >> it was documented anywhere and it would be great if you could cover it.
> >>
> >> Thanks,
> >>
> >> Alex
> >>
> >> --- On Sun, 5/16/10, Dennis Kubes <[email protected]> wrote:
> >>
> >> > From: Dennis Kubes <[email protected]>
> >> > Subject: Writing a Book on Nutch
> >> > To: [email protected]
> >> > Date: Sunday, May 16, 2010, 8:27 PM
> >> > Hi Everyone,
> >> >
> >> > It has been a long time coming but I have finally started
> >> > to write a book on Nutch.  It will be self published
> >> > and should be available in PDF / paperback form in less than
> >> > a month hopefully.
> >> >
> >> > A while back we discussed a Nutch training seminar on the
> >> > list.  I am not ready to do a full on seminar yet but I
> >> > will be putting up some training and tutorial videos in the
> >> > next few weeks.  I will update the list as those become
> >> > available.
> >> >
> >> > I already have a general outline but it would help me to
> >> > know the following:
> >> >
> >> > 1) What types of things you would want explained in a book
> >> > / videos on Nutch?
> >> > 2) What are the biggest problems you face using Nutch?
> >> > 3) Anything special you would like answered or explained?
> >> >
> >> > Thanks in advance for any responses.
> >> >
> >> > Dennis
> >> >
> >> >
> >>
> >>
> >>
> >>
> >>
> >
>



-- 
Emmanuel de Castro Santana

Reply via email to