Hi all: I'm trying to write a plugin to detect surrounding text around images inside HTML (img tags). Of course I wrote this plugin implementing HTMLParseFilter and when I got an HTML page I walk through the DocumentFragment detecting the img tags and then detecting the text in the closest neighbors (using some basic heuristics). This works just fine. I have other plugin that generate the thumbnails for each image (I check the mime type, but the thing is that I want to keep the thumbnail and the surrounding text linked to the same document, right now for testing purposes I've a test.html file with an Image embedded and some text around. Nutch first crawl the .html file and use the custom image caption plugin to extract the text and stores it as one document in solr, then it fetches the image and use the image thumbnailer plugin to generate the thumbnail and store it as another solr document, any way to mix this two fields in one solr document? If it's not possible any advice on how to structure the schema?
Greetings! 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci

