Re: Apache Nutch Output structure

Mattmann, Chris A (3980) Thu, 01 Oct 2015 23:33:49 -0700

Please see:

http://wiki.apache.org/nutch/NutchFileFormats


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: sanjay singh <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, October 1, 2015 at 11:22 PM
To: "[email protected]" <[email protected]>
Subject: Apache Nutch Output structure

>Hi,
>I am trying to crawl certain set of websites using Apache nutch. I
>configured nutch with required parameters. After crawling I got various
>segments as output which I merged into one segement.
>But still I am unable to relate with the file structure that is there in
>output and meaning associated with it.
>I got in merged segment following directories
>content
>crawl_fetch
>crawl_generate
>crawl_parse
>parse_data
>parse_text
>
>Can someone please explain the significance of these directories or point
>me to certain documentation which explains it in detail.
>
>
>-- 
>Regards,
>Sanjay Singh, PICT Pune

Re: Apache Nutch Output structure

Reply via email to