Hi

I've a plugin to generate a thumbnail from images and store this in solr.
> From a previous thread Julien recommended that this plugin should be
> rewrited as a HtmlParseFilter, and with this tika could extract the usual
> metadata from the image, and my custom plugin would generate the thumbnail
> in addition to all other metadata. So far so good, this works just fine.
>

Great


>
> But how can I configure nutch that my plugin nay get the images files,
> because right now the plugin try to generate a thumbnail for every HTML
> page crawled by nutch.
>
> I've this in my parse-plugins.xml
>
>         <mimeType name="image/png">
>           <plugin id="parse-thumb" />
>         </mimeType>
>
>         <mimeType name="image/jpg">
>           <plugin id="parse-thumb" />
>         </mimeType>
>

well that won;t prevent other mimetypes to go through Tika then your parser


>
> And in the plugin.xml inside my plugin's folder:
>
>       <implementation id="ImageThumbnailParser"
>
> class="org.apache.nutch.parse.thumbnail.ImageThumbnailParser"/>
>                       <parameter name="contentType"
> value="image/png|image/jpeg|image/jpg|image/gif|image/ico|image/bmp"/>
>                       <parameter name="pathSuffix"  value=""/>
>
> What I'm missing?
>

Simply add some code in your parser to get the mimetype of the current doc
and skip it if it does not match what you want.

HTH

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to