Hi
I've a plugin to generate a thumbnail from images and store this in solr. > From a previous thread Julien recommended that this plugin should be > rewrited as a HtmlParseFilter, and with this tika could extract the usual > metadata from the image, and my custom plugin would generate the thumbnail > in addition to all other metadata. So far so good, this works just fine. > Great > > But how can I configure nutch that my plugin nay get the images files, > because right now the plugin try to generate a thumbnail for every HTML > page crawled by nutch. > > I've this in my parse-plugins.xml > > <mimeType name="image/png"> > <plugin id="parse-thumb" /> > </mimeType> > > <mimeType name="image/jpg"> > <plugin id="parse-thumb" /> > </mimeType> > well that won;t prevent other mimetypes to go through Tika then your parser > > And in the plugin.xml inside my plugin's folder: > > <implementation id="ImageThumbnailParser" > > class="org.apache.nutch.parse.thumbnail.ImageThumbnailParser"/> > <parameter name="contentType" > value="image/png|image/jpeg|image/jpg|image/gif|image/ico|image/bmp"/> > <parameter name="pathSuffix" value=""/> > > What I'm missing? > Simply add some code in your parser to get the mimetype of the current doc and skip it if it does not match what you want. HTH -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

