Sounds good to me Jorge

We know that parse-tika is working flawlessly on 99.9% of occasions. I
very rarely have problems using it.

hth

On Wed, Jun 27, 2012 at 10:07 PM, Jorge Luis Betancourt Gonzalez
<[email protected]> wrote:
> Hi,
>
> I agree with you, and is a genius idea rely on Tika to parse the files, but 
> in this particular case when all I want to do is encode the content into 
> base64 should I wrote a custom parser to tika and rely on the parser-tika 
> plugin to do its magic?
>
> Jorge
>
> ----- Mensaje original -----
> De: "Lewis John Mcgibbney" <[email protected]>
> Para: [email protected]
> Enviados: Miércoles, 27 de Junio 2012 16:55:12
> Asunto: Re: Problema with NullPointerException on custom Parser
>
> Hi,
>
> I think you are partly correct.
>
> The core Nutch code itself doesn't do any parsing as such. All parsing
> is relied upon by external parsing libraries.
>
> Basically we need to define a parser to do the parsing, using Tika as
> a wrapper for mimeType detection and subsequent parsing saves us a bit
> of overhead.
>
> Lewis
>
> On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez
> <[email protected]> wrote:
>> Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around Tika? I 
>> thought this was optional since I really don't parse the content searching 
>> for nothing, I only get the content, transform it into an Image object, 
>> resize it, and then I encode with base64 to store on the solr backend.
>>
>> So I thought that all this processing could be done getParse method.
>>
>> Is my assumption correct or is mandatory to write my desired logic using 
>> Tika?
>>
>> ----- Mensaje original -----
>> De: "Lewis John Mcgibbney" <[email protected]>
>> Para: [email protected]
>> Enviados: Miércoles, 27 de Junio 2012 16:33:01
>> Asunto: Re: Problema with NullPointerException on custom Parser
>>
>> Hi Jorge,
>>
>> It doesn't look like your actually using Tika as a wrapper for your
>> custom parser at all...
>>
>> You would be need to specify the correct Tika config by calling
>> tikaConfig.getParser
>>
>> hth
>>
>> On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez
>> <[email protected]> wrote:
>>> Hi all:
>>>
>>> I'm working on a custom parser plugin to generate thumbnails from images 
>>> fetched with nutch 1.4. I'm doing this because the humbnails will be 
>>> converted into a base64 encoded string and stored on a Solr backend.
>>>
>>> So I basically wrote a custom parser (to which I send all png images, for 
>>> example). I enable the plugin (image-thumbnail) in the nutch-site.xml, set 
>>> some custom properties to load the width and height of the thumbnail. Also 
>>> set the alias on the parse-plugins.xml and set the plugin to handle the 
>>> image/png files, also in this file.
>>>
>>> the plugin is being loaded, but every time I get a png image to parse I get 
>>> this:
>>>
>>> Error parsing: 
>>> http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png:
>>>  java.lang.NullPointerException
>>>        at org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388)
>>>        at 
>>> org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397)
>>>        at 
>>> org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296)
>>>        at 
>>> org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262)
>>>        at 
>>> org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234)
>>>        at 
>>> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:119)
>>>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
>>>        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:86)
>>>        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:42)
>>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>        at 
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>
>>> The thing is that I have put some log messages inside the getParse() method 
>>> but none of this message are being logged on the hadoop.log file, so for 
>>> what I can tell the method is not being executed.
>>>
>>> Any one has any idea what I'm doing wrong?
>>>
>>> P.S: I've attached the source of the ImageThumbnailParser.
>>>
>>> Greetings!
>>>
>>>
>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>>> INFORMATICAS...
>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>
>>> http://www.uci.cu
>>> http://www.facebook.com/universidad.uci
>>> http://www.flickr.com/photos/universidad_uci
>>
>>
>>
>> --
>> Lewis
>>
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci
>>
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci
>
>
>
> --
> Lewis
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci



-- 
Lewis

Reply via email to