Dear Piet,

First, you're absolutely right about the state of the documentation. We have to 
deal with this in the near future.
Now, although nutchgora currently is developed on a branch, it is actually 
still alive and kicking. More, a first Nutch 2.0 release, based on the 
nutchgora branch, is in the pipeline.
We are using the nutchgora branch with HBase on a multi-node Hadoop cluster. 
Not experimental, but for a real world customer. 

If you had good experiences with nutch+gora+mysql in the past: just checkout 
the nutchgora head or wait until the release.

Mathijs



 

On Mar 9, 2012, at 16:19 , Piet van Remortel wrote:

> Hi all,
> 
> Pretty new to nutch.  Trying to create a setup where nutch repeatedly
> crawls a selected set of webpages, to feed the content into a pipeline for
> text analysis etc. (e.g. Nutch, Tika, GATE, ...)
> 
> We are unclear about what setup/version/approach to use for this.   To be
> honest, the plethora of snippets of (outdated?) docs don't help in getting
> a clear view on things.
> 
> The major hurdle seems to be the flexible access to the crawled content.
> Both from a search (mentions of certain words) as from a systematic (e.g.
> database queries to process pages in batch) point of view.
> Next to solr queries, the only way seems dumping the segments with the
> SegmentReader, and processing those.
> But access to the segments seems cumbersome and not very flexible to
> integrate into a larger setup.  And slow.
> 
> I was happy to see the GORA access to e.g. MySQL in Nutch 2.0, but now that
> seems to all have been side-tracked.  I got crawled pages in MySQL in 15
> minutes, which is great !  I don't see what the alternative for a setup
> like that is in Nutch 1.4 ?
> 
> Alternatives to write to MySQL from Nutch 1.4 seem less straightforward as
> mentioned (extending nutch where the NutchPage gets written to SOLR and
> diverting to MySQL .. ?  There must be a better way.)
> 
> Could somebody with some experience in these kinds of setups advise in what
> direction we should consider going ?
> 
> I would like a flexible setup, where nutch can run continuously, being fed
> with new seed URLs through time, and flexible and efficient access to the
> crawled results to integrate this in a larger setup.
> 
> thanks !
> 
> pvremort

Reply via email to