Re: Extracting documents from nutch segments

Julien Nioche Mon, 23 Jan 2012 00:33:50 -0800

Have a look at Behemoth [https://github.com/jnioche/behemoth]. It can take
Nutch segments as input, process docs with UIMA over Hadoop and generate
vectors for Mahout




We know Mahout. But I think we still need the content of each document. We
> would like to annotate the documents retrieved by nutch using UIMA and then
> classify (probably using Mahout) and index them. So I think we need to take
> the content of each document, annotate it, classify it and index it. Am I
> wrong?
>
>
>
> 2012/1/21 Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
>
> > Hi Adriana,
> >
> > Well you may with to fetch the docs with Nutch, store them in HDFS then
> run
> > Mahout [1] jobs on them. This would be a more logical pipeline, using
> > interoperable technologies and I would hope that Mahout would be able to
> > sort your classification stuff out as well.. maybe you can get int touch
> > with the Mahout guys, they love to hear about new problems in this area,
> > they also have some classification algorithms which you can use out of
> the
> > box. You should be able to then get the data into Solr for indexing and
> > searching.
> >
> > [1] http://mahout.apache.org/
> >
> > On Sat, Jan 21, 2012 at 11:02 AM, Adriana Farina <
> > adriana.farin...@gmail.com
> > > wrote:
> >
> > > Yes, you understood perfectly. :)
> > >
> > > Your question is absolutely reasonable but, in the company I work for,
> > they
> > > want both things: index the documents and also have them stored on a
> hard
> > > drive. The reason for such choise is that before index the documents,
> we
> > > need to classify them in a certain way with certain algorithms. In
> broad
> > > terms the pipeline we want to realize is: retrieve the documents with
> > > nutch, annotate and classify them and then index them. Probably there
> is
> > a
> > > smarter way to do this rather than store the documents on a hard disk,
> > but
> > > at the moment I can't figure out what I could do.
> > > Can you help me?
> > >
> > > Thank you very much!
> > >
> > >
> > > 2012/1/20 Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
> > >
> > > > I'm not sure if I'm understanding you here. You are not wanting to
> > index
> > > > the documents, but merely wanting to have stored documents in your
> hard
> > > > disk? What is the reasoning behind this?
> > > >
> > > > Thanks
> > > >
> > > > On Fri, Jan 20, 2012 at 9:48 AM, Adriana Farina
> > > > <adriana.farin...@gmail.com>wrote:
> > > >
> > > > > I forgot to write I'm using nutch 1.3.
> > > > >
> > > > > 2012/1/20 Adriana Farina <adriana.farin...@gmail.com>
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I have a problem I'm not able to solve though I've googled
> around.
> > > > > > I have crawled a set of web pages containing documents of
> different
> > > > types
> > > > > > (pdf, doc, ...) and I've configured nutch to parse all the
> > documents
> > > it
> > > > > > founds. Now I would like to store the documents contained in the
> > > > segments
> > > > > > on my hard disk. How can I do that?
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Extracting documents from nutch segments

Reply via email to