Behemoth [1] eats Nutch 1.x segments and can push them a.o. to GATE. Nutch 
comes with its own Tika parser.

[1]: https://github.com/jnioche/behemoth

cheers

On Friday 09 March 2012 16:19:03 Piet van Remortel wrote:
> Hi all,
> 
> Pretty new to nutch.  Trying to create a setup where nutch repeatedly
> crawls a selected set of webpages, to feed the content into a pipeline for
> text analysis etc. (e.g. Nutch, Tika, GATE, ...)
> 
> We are unclear about what setup/version/approach to use for this.   To be
> honest, the plethora of snippets of (outdated?) docs don't help in getting
> a clear view on things.
> 
> The major hurdle seems to be the flexible access to the crawled content.
>  Both from a search (mentions of certain words) as from a systematic (e.g.
> database queries to process pages in batch) point of view.
> Next to solr queries, the only way seems dumping the segments with the
> SegmentReader, and processing those.
> But access to the segments seems cumbersome and not very flexible to
> integrate into a larger setup.  And slow.
> 
> I was happy to see the GORA access to e.g. MySQL in Nutch 2.0, but now that
> seems to all have been side-tracked.  I got crawled pages in MySQL in 15
> minutes, which is great !  I don't see what the alternative for a setup
> like that is in Nutch 1.4 ?
> 
> Alternatives to write to MySQL from Nutch 1.4 seem less straightforward as
> mentioned (extending nutch where the NutchPage gets written to SOLR and
> diverting to MySQL .. ?  There must be a better way.)
> 
> Could somebody with some experience in these kinds of setups advise in what
> direction we should consider going ?
> 
> I would like a flexible setup, where nutch can run continuously, being fed
> with new seed URLs through time, and flexible and efficient access to the
> crawled results to integrate this in a larger setup.
> 
> thanks !
> 
> pvremort

-- 
Markus Jelsma - CTO - Openindex

Reply via email to