Yes. You can build a plugin that implements a parser. Check the wiki [1] to get started. If you intend to write a parser for an exotic mime-type consider contributing to Apache Tika.
What exactly are you trying to accomplish? There may be an easier method. [1]: http://wiki.apache.org/nutch/PluginCentral > I would like to access it and run my own / parser / analyzer if > necessary. can I read this segment data? > > Best > > On Sun, Jul 10, 2011 at 9:08 PM, Markus Jelsma > > <[email protected]> wrote: > > Well, the raw data is stored inside the segment. Without it there would > > be nothing to parse. What do you want to do with it. > > > >> Hi C.B., > >> > >> Can you please expand on this description? > >> > >> On Sun, Jul 10, 2011 at 11:52 AM, Cam Bazz <[email protected]> wrote: > >> > Hello All, > >> > > >> > Is there a way to save the plain htmls from the crawl? Or is this > >> > already stored in segments dir? > >> > > >> > Best Regards, > >> > -C.B.

