Re: html of the crawled pages.

Markus Jelsma Sun, 10 Jul 2011 14:55:22 -0700

Yes. You can build a plugin that implements a parser. Check the wiki [1] to 
get started. If you intend to write a parser for an exotic mime-type consider 
contributing to Apache Tika.


What exactly are you trying to accomplish? There may be an easier method.

[1]: http://wiki.apache.org/nutch/PluginCentral


> I would like to access it and run my own / parser / analyzer if
> necessary. can I read this segment data?
> 
> Best
> 
> On Sun, Jul 10, 2011 at 9:08 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> > Well, the raw data is stored inside the segment. Without it there would
> > be nothing to parse. What do you want to do with it.
> > 
> >> Hi C.B.,
> >> 
> >> Can you please expand on this description?
> >> 
> >> On Sun, Jul 10, 2011 at 11:52 AM, Cam Bazz <[email protected]> wrote:
> >> > Hello All,
> >> > 
> >> > Is there a way to save the plain htmls from the crawl? Or is this
> >> > already stored in segments dir?
> >> > 
> >> > Best Regards,
> >> > -C.B.

Re: html of the crawled pages.

Reply via email to