Guys

Just to make sure there is no misunderstanding : the detection of the
MimeType is done BEFORE the parsing step and it is what allows the parsing
step to determine which parser to use. The mimetype detection uses Tika *and
*  there is a universal parser which is parse-tika (a.k.a the Tika
Wrapper). These are two different things and you don't need to use
parse-tika and can rely on other plugins.

Now what you can do is to write a Parser and associate it with the
mime-types of your choice : see conf/parse-plugins.xml and how to override
parse-tika for a given mimetype. Another approach is do implement a
HtmlParseFilter which will be called by parse-tika (assuming it is
activated) from where you can access the Content and store the base64 in
the parse-metadata (which you can index with the plugin index-metadata)

HTH

Julien

On 27 June 2012 22:07, Jorge Luis Betancourt Gonzalez
<jlbetanco...@uci.cu>wrote:

> Hi,
>
> I agree with you, and is a genius idea rely on Tika to parse the files,
> but in this particular case when all I want to do is encode the content
> into base64 should I wrote a custom parser to tika and rely on the
> parser-tika plugin to do its magic?
>
> Jorge
>
> ----- Mensaje original -----
> De: "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: MiƩrcoles, 27 de Junio 2012 16:55:12
> Asunto: Re: Problema with NullPointerException on custom Parser
>
> Hi,
>
> I think you are partly correct.
>
> The core Nutch code itself doesn't do any parsing as such. All parsing
> is relied upon by external parsing libraries.
>
> Basically we need to define a parser to do the parsing, using Tika as
> a wrapper for mimeType detection and subsequent parsing saves us a bit
> of overhead.
>
> Lewis
>
> On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez
> <jlbetanco...@uci.cu> wrote:
> > Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around
> Tika? I thought this was optional since I really don't parse the content
> searching for nothing, I only get the content, transform it into an Image
> object, resize it, and then I encode with base64 to store on the solr
> backend.
> >
> > So I thought that all this processing could be done getParse method.
> >
> > Is my assumption correct or is mandatory to write my desired logic using
> Tika?
> >
> > ----- Mensaje original -----
> > De: "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com>
> > Para: user@nutch.apache.org
> > Enviados: MiƩrcoles, 27 de Junio 2012 16:33:01
> > Asunto: Re: Problema with NullPointerException on custom Parser
> >
> > Hi Jorge,
> >
> > It doesn't look like your actually using Tika as a wrapper for your
> > custom parser at all...
> >
> > You would be need to specify the correct Tika config by calling
> > tikaConfig.getParser
> >
> > hth
> >
> > On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez
> > <jlbetanco...@uci.cu> wrote:
> >> Hi all:
> >>
> >> I'm working on a custom parser plugin to generate thumbnails from
> images fetched with nutch 1.4. I'm doing this because the humbnails will be
> converted into a base64 encoded string and stored on a Solr backend.
> >>
> >> So I basically wrote a custom parser (to which I send all png images,
> for example). I enable the plugin (image-thumbnail) in the nutch-site.xml,
> set some custom properties to load the width and height of the thumbnail.
> Also set the alias on the parse-plugins.xml and set the plugin to handle
> the image/png files, also in this file.
> >>
> >> the plugin is being loaded, but every time I get a png image to parse I
> get this:
> >>
> >> Error parsing:
> http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png:
> java.lang.NullPointerException
> >>        at
> org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388)
> >>        at
> org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397)
> >>        at
> org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296)
> >>        at
> org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262)
> >>        at
> org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234)
> >>        at
> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:119)
> >>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
> >>        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:86)
> >>        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:42)
> >>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>        at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >>
> >> The thing is that I have put some log messages inside the getParse()
> method but none of this message are being logged on the hadoop.log file, so
> for what I can tell the method is not being executed.
> >>
> >> Any one has any idea what I'm doing wrong?
> >>
> >> P.S: I've attached the source of the ImageThumbnailParser.
> >>
> >> Greetings!
> >>
> >>
> >> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> >> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> >>
> >> http://www.uci.cu
> >> http://www.facebook.com/universidad.uci
> >> http://www.flickr.com/photos/universidad_uci
> >
> >
> >
> > --
> > Lewis
> >
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> >
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
> >
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> >
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
>
>
>
> --
> Lewis
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to