Hi all: I'm working on an image search engine, using the combination of nutch and solr. With nutch and tika I get some metadata from the images extracted, so far so good. But I'm trying to improve the accuracy of the results using the surrounding text of the images.
I know that there are several papers published around this subject, using several techniques and algorithms. Basically I'm trying to use some heuristics methods that don't require a lot of processing. In https://webarchive.jira.com/wiki/display/SOC06/Image+annotation+with+surrounding+text I've found a few heuristics methods, which I'm implementing in a custom nutch plugin: the upper, or below <tr> node's text, and the <tr> node's text in which the image appears, the text in the paragraph in which the image appears, the textual content of the headings preceding the image, But I think this is not enough, anyone can provide some advise or new heuristic methods to this quest? Thanks in advance, Greetings! PS: Sorry for my english but it's not my native language :-S 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci

