You can use: 

image/(bmp|gif|jpeg|png|tiff)

in your plugin.xml, this will cover all/most images.

On Jun 28, 2012, at 19:40 , Jorge Luis Betancourt Gonzalez wrote:

> Hi Julien!
> 
> Thank you for your explanation I realize that Tika indeed does a mimetype 
> detection. I just was searching a way to ensure that in the plugin I'm 
> developing only do the processing with images, this is just as a fail safe in 
> case that some wrong configuration its done in conf/parse-plugins.xml. I'm 
> thinking now that perhaps I can read the image types allowed to fetch in the 
> nutch configuration and use this as a filter, you think this would be 
> possible?
> 
> ----- Mensaje original -----
> De: "Julien Nioche" <[email protected]>
> Para: [email protected]
> Enviados: Jueves, 28 de Junio 2012 12:37:29
> Asunto: Re: Problema with NullPointerException on custom Parser
> 
> Guys
> 
> Just to make sure there is no misunderstanding : the detection of the
> MimeType is done BEFORE the parsing step and it is what allows the parsing
> step to determine which parser to use. The mimetype detection uses Tika *and
> *  there is a universal parser which is parse-tika (a.k.a the Tika
> Wrapper). These are two different things and you don't need to use
> parse-tika and can rely on other plugins.
> 
> Now what you can do is to write a Parser and associate it with the
> mime-types of your choice : see conf/parse-plugins.xml and how to override
> parse-tika for a given mimetype. Another approach is do implement a
> HtmlParseFilter which will be called by parse-tika (assuming it is
> activated) from where you can access the Content and store the base64 in
> the parse-metadata (which you can index with the plugin index-metadata)
> 
> HTH
> 
> Julien
> 
> On 27 June 2012 22:07, Jorge Luis Betancourt Gonzalez
> <[email protected]>wrote:
> 
>> Hi,
>> 
>> I agree with you, and is a genius idea rely on Tika to parse the files,
>> but in this particular case when all I want to do is encode the content
>> into base64 should I wrote a custom parser to tika and rely on the
>> parser-tika plugin to do its magic?
>> 
>> Jorge
>> 
>> ----- Mensaje original -----
>> De: "Lewis John Mcgibbney" <[email protected]>
>> Para: [email protected]
>> Enviados: MiƩrcoles, 27 de Junio 2012 16:55:12
>> Asunto: Re: Problema with NullPointerException on custom Parser
>> 
>> Hi,
>> 
>> I think you are partly correct.
>> 
>> The core Nutch code itself doesn't do any parsing as such. All parsing
>> is relied upon by external parsing libraries.
>> 
>> Basically we need to define a parser to do the parsing, using Tika as
>> a wrapper for mimeType detection and subsequent parsing saves us a bit
>> of overhead.
>> 
>> Lewis
>> 
>> On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez
>> <[email protected]> wrote:
>>> Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around
>> Tika? I thought this was optional since I really don't parse the content
>> searching for nothing, I only get the content, transform it into an Image
>> object, resize it, and then I encode with base64 to store on the solr
>> backend.
>>> 
>>> So I thought that all this processing could be done getParse method.
>>> 
>>> Is my assumption correct or is mandatory to write my desired logic using
>> Tika?
>>> 
>>> ----- Mensaje original -----
>>> De: "Lewis John Mcgibbney" <[email protected]>
>>> Para: [email protected]
>>> Enviados: MiƩrcoles, 27 de Junio 2012 16:33:01
>>> Asunto: Re: Problema with NullPointerException on custom Parser
>>> 
>>> Hi Jorge,
>>> 
>>> It doesn't look like your actually using Tika as a wrapper for your
>>> custom parser at all...
>>> 
>>> You would be need to specify the correct Tika config by calling
>>> tikaConfig.getParser
>>> 
>>> hth
>>> 
>>> On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez
>>> <[email protected]> wrote:
>>>> Hi all:
>>>> 
>>>> I'm working on a custom parser plugin to generate thumbnails from
>> images fetched with nutch 1.4. I'm doing this because the humbnails will be
>> converted into a base64 encoded string and stored on a Solr backend.
>>>> 
>>>> So I basically wrote a custom parser (to which I send all png images,
>> for example). I enable the plugin (image-thumbnail) in the nutch-site.xml,
>> set some custom properties to load the width and height of the thumbnail.
>> Also set the alias on the parse-plugins.xml and set the plugin to handle
>> the image/png files, also in this file.
>>>> 
>>>> the plugin is being loaded, but every time I get a png image to parse I
>> get this:
>>>> 
>>>> Error parsing:
>> http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png:
>> java.lang.NullPointerException
>>>>       at
>> org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388)
>>>>       at
>> org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397)
>>>>       at
>> org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296)
>>>>       at
>> org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262)
>>>>       at
>> org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234)
>>>>       at
>> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:119)
>>>>       at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
>>>>       at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:86)
>>>>       at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:42)
>>>>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>       at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>       at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>> 
>>>> The thing is that I have put some log messages inside the getParse()
>> method but none of this message are being logged on the hadoop.log file, so
>> for what I can tell the method is not being executed.
>>>> 
>>>> Any one has any idea what I'm doing wrong?
>>>> 
>>>> P.S: I've attached the source of the ImageThumbnailParser.
>>>> 
>>>> Greetings!
>>>> 
>>>> 
>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>> INFORMATICAS...
>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>> 
>>>> http://www.uci.cu
>>>> http://www.facebook.com/universidad.uci
>>>> http://www.flickr.com/photos/universidad_uci
>>> 
>>> 
>>> 
>>> --
>>> Lewis
>>> 
>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>> INFORMATICAS...
>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>> 
>>> http://www.uci.cu
>>> http://www.facebook.com/universidad.uci
>>> http://www.flickr.com/photos/universidad_uci
>>> 
>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>> INFORMATICAS...
>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>> 
>>> http://www.uci.cu
>>> http://www.facebook.com/universidad.uci
>>> http://www.flickr.com/photos/universidad_uci
>> 
>> 
>> 
>> --
>> Lewis
>> 
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>> 
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci
>> 
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>> 
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci
>> 
> 
> 
> 
> --
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

Reply via email to