Hi all:

I'm trying to write a plugin to detect surrounding text around images inside 
HTML (img tags). Of course I wrote this plugin implementing HTMLParseFilter and 
when I got an HTML page I walk through the DocumentFragment detecting the img 
tags and then detecting the text in the closest neighbors (using some basic 
heuristics). This works just fine. I have other plugin that generate the 
thumbnails for each image (I check the mime type, but the thing is that I want 
to keep the thumbnail and the surrounding text linked to the same document, 
right now for testing purposes I've a test.html file with an Image embedded and 
some text around. Nutch first crawl the .html file and use the custom image 
caption plugin to extract the text and stores it as one document in solr, then 
it fetches the image and use the image thumbnailer plugin to generate the 
thumbnail and store it as another solr document, any way to mix this two fields 
in one solr document? If it's not possible any advice on how to structure the 
schema?

Greetings!
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Reply via email to