Please see: http://wiki.apache.org/nutch/NutchFileFormats
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: sanjay singh <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, October 1, 2015 at 11:22 PM To: "[email protected]" <[email protected]> Subject: Apache Nutch Output structure >Hi, >I am trying to crawl certain set of websites using Apache nutch. I >configured nutch with required parameters. After crawling I got various >segments as output which I merged into one segement. >But still I am unable to relate with the file structure that is there in >output and meaning associated with it. >I got in merged segment following directories >content >crawl_fetch >crawl_generate >crawl_parse >parse_data >parse_text > >Can someone please explain the significance of these directories or point >me to certain documentation which explains it in detail. > > >-- >Regards, >Sanjay Singh, PICT Pune

