Guys Just to make sure there is no misunderstanding : the detection of the MimeType is done BEFORE the parsing step and it is what allows the parsing step to determine which parser to use. The mimetype detection uses Tika *and * there is a universal parser which is parse-tika (a.k.a the Tika Wrapper). These are two different things and you don't need to use parse-tika and can rely on other plugins.
Now what you can do is to write a Parser and associate it with the mime-types of your choice : see conf/parse-plugins.xml and how to override parse-tika for a given mimetype. Another approach is do implement a HtmlParseFilter which will be called by parse-tika (assuming it is activated) from where you can access the Content and store the base64 in the parse-metadata (which you can index with the plugin index-metadata) HTH Julien On 27 June 2012 22:07, Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu>wrote: > Hi, > > I agree with you, and is a genius idea rely on Tika to parse the files, > but in this particular case when all I want to do is encode the content > into base64 should I wrote a custom parser to tika and rely on the > parser-tika plugin to do its magic? > > Jorge > > ----- Mensaje original ----- > De: "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com> > Para: user@nutch.apache.org > Enviados: MiƩrcoles, 27 de Junio 2012 16:55:12 > Asunto: Re: Problema with NullPointerException on custom Parser > > Hi, > > I think you are partly correct. > > The core Nutch code itself doesn't do any parsing as such. All parsing > is relied upon by external parsing libraries. > > Basically we need to define a parser to do the parsing, using Tika as > a wrapper for mimeType detection and subsequent parsing saves us a bit > of overhead. > > Lewis > > On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez > <jlbetanco...@uci.cu> wrote: > > Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around > Tika? I thought this was optional since I really don't parse the content > searching for nothing, I only get the content, transform it into an Image > object, resize it, and then I encode with base64 to store on the solr > backend. > > > > So I thought that all this processing could be done getParse method. > > > > Is my assumption correct or is mandatory to write my desired logic using > Tika? > > > > ----- Mensaje original ----- > > De: "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com> > > Para: user@nutch.apache.org > > Enviados: MiƩrcoles, 27 de Junio 2012 16:33:01 > > Asunto: Re: Problema with NullPointerException on custom Parser > > > > Hi Jorge, > > > > It doesn't look like your actually using Tika as a wrapper for your > > custom parser at all... > > > > You would be need to specify the correct Tika config by calling > > tikaConfig.getParser > > > > hth > > > > On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez > > <jlbetanco...@uci.cu> wrote: > >> Hi all: > >> > >> I'm working on a custom parser plugin to generate thumbnails from > images fetched with nutch 1.4. I'm doing this because the humbnails will be > converted into a base64 encoded string and stored on a Solr backend. > >> > >> So I basically wrote a custom parser (to which I send all png images, > for example). I enable the plugin (image-thumbnail) in the nutch-site.xml, > set some custom properties to load the width and height of the thumbnail. > Also set the alias on the parse-plugins.xml and set the plugin to handle > the image/png files, also in this file. > >> > >> the plugin is being loaded, but every time I get a png image to parse I > get this: > >> > >> Error parsing: > http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png: > java.lang.NullPointerException > >> at > org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388) > >> at > org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397) > >> at > org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296) > >> at > org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262) > >> at > org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234) > >> at > org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:119) > >> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71) > >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:86) > >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:42) > >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >> at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > >> at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > >> > >> The thing is that I have put some log messages inside the getParse() > method but none of this message are being logged on the hadoop.log file, so > for what I can tell the method is not being executed. > >> > >> Any one has any idea what I'm doing wrong? > >> > >> P.S: I've attached the source of the ImageThumbnailParser. > >> > >> Greetings! > >> > >> > >> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > >> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > >> > >> http://www.uci.cu > >> http://www.facebook.com/universidad.uci > >> http://www.flickr.com/photos/universidad_uci > > > > > > > > -- > > Lewis > > > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > > > http://www.uci.cu > > http://www.facebook.com/universidad.uci > > http://www.flickr.com/photos/universidad_uci > > > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > > > http://www.uci.cu > > http://www.facebook.com/universidad.uci > > http://www.flickr.com/photos/universidad_uci > > > > -- > Lewis > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/universidad.uci > http://www.flickr.com/photos/universidad_uci > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/universidad.uci > http://www.flickr.com/photos/universidad_uci > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble