You can use: image/(bmp|gif|jpeg|png|tiff)
in your plugin.xml, this will cover all/most images. On Jun 28, 2012, at 19:40 , Jorge Luis Betancourt Gonzalez wrote: > Hi Julien! > > Thank you for your explanation I realize that Tika indeed does a mimetype > detection. I just was searching a way to ensure that in the plugin I'm > developing only do the processing with images, this is just as a fail safe in > case that some wrong configuration its done in conf/parse-plugins.xml. I'm > thinking now that perhaps I can read the image types allowed to fetch in the > nutch configuration and use this as a filter, you think this would be > possible? > > ----- Mensaje original ----- > De: "Julien Nioche" <[email protected]> > Para: [email protected] > Enviados: Jueves, 28 de Junio 2012 12:37:29 > Asunto: Re: Problema with NullPointerException on custom Parser > > Guys > > Just to make sure there is no misunderstanding : the detection of the > MimeType is done BEFORE the parsing step and it is what allows the parsing > step to determine which parser to use. The mimetype detection uses Tika *and > * there is a universal parser which is parse-tika (a.k.a the Tika > Wrapper). These are two different things and you don't need to use > parse-tika and can rely on other plugins. > > Now what you can do is to write a Parser and associate it with the > mime-types of your choice : see conf/parse-plugins.xml and how to override > parse-tika for a given mimetype. Another approach is do implement a > HtmlParseFilter which will be called by parse-tika (assuming it is > activated) from where you can access the Content and store the base64 in > the parse-metadata (which you can index with the plugin index-metadata) > > HTH > > Julien > > On 27 June 2012 22:07, Jorge Luis Betancourt Gonzalez > <[email protected]>wrote: > >> Hi, >> >> I agree with you, and is a genius idea rely on Tika to parse the files, >> but in this particular case when all I want to do is encode the content >> into base64 should I wrote a custom parser to tika and rely on the >> parser-tika plugin to do its magic? >> >> Jorge >> >> ----- Mensaje original ----- >> De: "Lewis John Mcgibbney" <[email protected]> >> Para: [email protected] >> Enviados: MiƩrcoles, 27 de Junio 2012 16:55:12 >> Asunto: Re: Problema with NullPointerException on custom Parser >> >> Hi, >> >> I think you are partly correct. >> >> The core Nutch code itself doesn't do any parsing as such. All parsing >> is relied upon by external parsing libraries. >> >> Basically we need to define a parser to do the parsing, using Tika as >> a wrapper for mimeType detection and subsequent parsing saves us a bit >> of overhead. >> >> Lewis >> >> On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez >> <[email protected]> wrote: >>> Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around >> Tika? I thought this was optional since I really don't parse the content >> searching for nothing, I only get the content, transform it into an Image >> object, resize it, and then I encode with base64 to store on the solr >> backend. >>> >>> So I thought that all this processing could be done getParse method. >>> >>> Is my assumption correct or is mandatory to write my desired logic using >> Tika? >>> >>> ----- Mensaje original ----- >>> De: "Lewis John Mcgibbney" <[email protected]> >>> Para: [email protected] >>> Enviados: MiƩrcoles, 27 de Junio 2012 16:33:01 >>> Asunto: Re: Problema with NullPointerException on custom Parser >>> >>> Hi Jorge, >>> >>> It doesn't look like your actually using Tika as a wrapper for your >>> custom parser at all... >>> >>> You would be need to specify the correct Tika config by calling >>> tikaConfig.getParser >>> >>> hth >>> >>> On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez >>> <[email protected]> wrote: >>>> Hi all: >>>> >>>> I'm working on a custom parser plugin to generate thumbnails from >> images fetched with nutch 1.4. I'm doing this because the humbnails will be >> converted into a base64 encoded string and stored on a Solr backend. >>>> >>>> So I basically wrote a custom parser (to which I send all png images, >> for example). I enable the plugin (image-thumbnail) in the nutch-site.xml, >> set some custom properties to load the width and height of the thumbnail. >> Also set the alias on the parse-plugins.xml and set the plugin to handle >> the image/png files, also in this file. >>>> >>>> the plugin is being loaded, but every time I get a png image to parse I >> get this: >>>> >>>> Error parsing: >> http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png: >> java.lang.NullPointerException >>>> at >> org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388) >>>> at >> org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397) >>>> at >> org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296) >>>> at >> org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262) >>>> at >> org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234) >>>> at >> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:119) >>>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71) >>>> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:86) >>>> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:42) >>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >>>> at >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) >>>> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) >>>> >>>> The thing is that I have put some log messages inside the getParse() >> method but none of this message are being logged on the hadoop.log file, so >> for what I can tell the method is not being executed. >>>> >>>> Any one has any idea what I'm doing wrong? >>>> >>>> P.S: I've attached the source of the ImageThumbnailParser. >>>> >>>> Greetings! >>>> >>>> >>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS >> INFORMATICAS... >>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >>>> >>>> http://www.uci.cu >>>> http://www.facebook.com/universidad.uci >>>> http://www.flickr.com/photos/universidad_uci >>> >>> >>> >>> -- >>> Lewis >>> >>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS >> INFORMATICAS... >>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >>> >>> http://www.uci.cu >>> http://www.facebook.com/universidad.uci >>> http://www.flickr.com/photos/universidad_uci >>> >>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS >> INFORMATICAS... >>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >>> >>> http://www.uci.cu >>> http://www.facebook.com/universidad.uci >>> http://www.flickr.com/photos/universidad_uci >> >> >> >> -- >> Lewis >> >> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS >> INFORMATICAS... >> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >> >> http://www.uci.cu >> http://www.facebook.com/universidad.uci >> http://www.flickr.com/photos/universidad_uci >> >> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS >> INFORMATICAS... >> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >> >> http://www.uci.cu >> http://www.facebook.com/universidad.uci >> http://www.flickr.com/photos/universidad_uci >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/universidad.uci > http://www.flickr.com/photos/universidad_uci > > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/universidad.uci > http://www.flickr.com/photos/universidad_uci

