Behemoth [1] eats Nutch 1.x segments and can push them a.o. to GATE. Nutch comes with its own Tika parser.
[1]: https://github.com/jnioche/behemoth cheers On Friday 09 March 2012 16:19:03 Piet van Remortel wrote: > Hi all, > > Pretty new to nutch. Trying to create a setup where nutch repeatedly > crawls a selected set of webpages, to feed the content into a pipeline for > text analysis etc. (e.g. Nutch, Tika, GATE, ...) > > We are unclear about what setup/version/approach to use for this. To be > honest, the plethora of snippets of (outdated?) docs don't help in getting > a clear view on things. > > The major hurdle seems to be the flexible access to the crawled content. > Both from a search (mentions of certain words) as from a systematic (e.g. > database queries to process pages in batch) point of view. > Next to solr queries, the only way seems dumping the segments with the > SegmentReader, and processing those. > But access to the segments seems cumbersome and not very flexible to > integrate into a larger setup. And slow. > > I was happy to see the GORA access to e.g. MySQL in Nutch 2.0, but now that > seems to all have been side-tracked. I got crawled pages in MySQL in 15 > minutes, which is great ! I don't see what the alternative for a setup > like that is in Nutch 1.4 ? > > Alternatives to write to MySQL from Nutch 1.4 seem less straightforward as > mentioned (extending nutch where the NutchPage gets written to SOLR and > diverting to MySQL .. ? There must be a better way.) > > Could somebody with some experience in these kinds of setups advise in what > direction we should consider going ? > > I would like a flexible setup, where nutch can run continuously, being fed > with new seed URLs through time, and flexible and efficient access to the > crawled results to integrate this in a larger setup. > > thanks ! > > pvremort -- Markus Jelsma - CTO - Openindex

