As I said use the Tika parser and implement your own HTMLPArseFmilter - it
will get called on the XHTML representation of the doc whatever its mimetype

J.

On 8 March 2012 07:29, nutch buddy <[email protected]> wrote:

> I looked at  HtmlParseFilter .
> I think that thats exactly what i need but for other file types as well,
> not just html.
> Any reason why this behaivour was implemented only for html files?
>
> I'm thinking of extending this implementation so it would be available for
> other types. any advice on that?
>
> On Wed, Mar 7, 2012 at 11:14 PM, Ferdy Galema <[email protected]
> >wrote:
>
> > Hi,
> >
> > Do you mean running multiple parsers in a single parse action? That is
> > currently only possible for html types. Take a look at HtmlParseFilter
> for
> > that. You can chain multiple parsers for a single url, in addition to
> > regular html parsing. For other types it's not possible.
> >
> > If this is about running a parse implementation on all urls regardless of
> > mimetype, you have to change the parser mappings in parse-plugins.xml
> > and the parser's plugin.xml. But again there is only support for running
> > one Parser on a single document.
> >
> > Ferdy.
> >
> > On Wed, Mar 7, 2012 at 2:34 PM, [email protected] <
> > [email protected]
> > > wrote:
> >
> > > Hi
> > > I've looked at nutch's code in ParseUtil and it seems that it was
> > designed
> > > so only one parses is eventually activated on a single url.
> > > What's the reason for this?
> > > What should I do if I want, in addition to the existing parsers, add a
> > > parser that will get a certain field out of the url, an run this
> > behaivour
> > > on all the urls?
> > > Do I have to add this code to all the parsers?
> > >
> > >
> > > thanks.
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Multiple-parsers-tp3806721p3806721.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to