Have a look at Behemoth [https://github.com/jnioche/behemoth]. It can take Nutch segments as input, process docs with UIMA over Hadoop and generate vectors for Mahout
We know Mahout. But I think we still need the content of each document. We > would like to annotate the documents retrieved by nutch using UIMA and then > classify (probably using Mahout) and index them. So I think we need to take > the content of each document, annotate it, classify it and index it. Am I > wrong? > > > > 2012/1/21 Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> > > > Hi Adriana, > > > > Well you may with to fetch the docs with Nutch, store them in HDFS then > run > > Mahout [1] jobs on them. This would be a more logical pipeline, using > > interoperable technologies and I would hope that Mahout would be able to > > sort your classification stuff out as well.. maybe you can get int touch > > with the Mahout guys, they love to hear about new problems in this area, > > they also have some classification algorithms which you can use out of > the > > box. You should be able to then get the data into Solr for indexing and > > searching. > > > > [1] http://mahout.apache.org/ > > > > On Sat, Jan 21, 2012 at 11:02 AM, Adriana Farina < > > adriana.farin...@gmail.com > > > wrote: > > > > > Yes, you understood perfectly. :) > > > > > > Your question is absolutely reasonable but, in the company I work for, > > they > > > want both things: index the documents and also have them stored on a > hard > > > drive. The reason for such choise is that before index the documents, > we > > > need to classify them in a certain way with certain algorithms. In > broad > > > terms the pipeline we want to realize is: retrieve the documents with > > > nutch, annotate and classify them and then index them. Probably there > is > > a > > > smarter way to do this rather than store the documents on a hard disk, > > but > > > at the moment I can't figure out what I could do. > > > Can you help me? > > > > > > Thank you very much! > > > > > > > > > 2012/1/20 Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> > > > > > > > I'm not sure if I'm understanding you here. You are not wanting to > > index > > > > the documents, but merely wanting to have stored documents in your > hard > > > > disk? What is the reasoning behind this? > > > > > > > > Thanks > > > > > > > > On Fri, Jan 20, 2012 at 9:48 AM, Adriana Farina > > > > <adriana.farin...@gmail.com>wrote: > > > > > > > > > I forgot to write I'm using nutch 1.3. > > > > > > > > > > 2012/1/20 Adriana Farina <adriana.farin...@gmail.com> > > > > > > > > > > > Hello, > > > > > > > > > > > > I have a problem I'm not able to solve though I've googled > around. > > > > > > I have crawled a set of web pages containing documents of > different > > > > types > > > > > > (pdf, doc, ...) and I've configured nutch to parse all the > > documents > > > it > > > > > > founds. Now I would like to store the documents contained in the > > > > segments > > > > > > on my hard disk. How can I do that? > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > *Lewis* > > > > > > > > > > > > > > > -- > > *Lewis* > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com