Custom TextExtractor

Nick Allmaker Tue, 21 Aug 2007 07:48:48 -0700

In my repository, I'm storing a variety of files, including some images
of paper documents.  I'd like to be able to hook up an OCR engine to do
full-text search against these images (usually TIFFs), but I'm having
issues getting Jackrabbit to pick up my class.  To ensure that I can get
the system to pick up my class, I've written a simple testing version of
the class for now before actually adding in any OCR.  I've included this
class at the bottom of the e-mail.


I've edited the workspace.xml to include my class in the
textFilterClasses parameter of the SearchIndex node, added my jar to the
classpath, deleted the index to force a re-index, and ran a very simple
test.  Yet, when I search for the test text, I get 0 results.

Can someone please tell me what I'm doing wrong?

Thanks,

--Nick Allmaker

--------ImageTextExtractor.java--------
package test.extractors; 

import java.io.InputStream;
import java.io.Reader;
import java.io.StringReader;
import org.apache.jackrabbit.extractor.AbstractTextExtractor;

public class ImageTextExtractor extends
org.apache.jackrabbit.extractor.AbstractTextExtractor 
{

        public ImageTextExtractor() 
        {
                super(new String[]{"image/tiff", "image/jpeg",
"image/png", "image/gif"});
        }

        public Reader extractText(InputStream stream, String type,
String encoding)
        {
                stream.close();
                return new StringReader("This is a test extraction.");
        }

}

Custom TextExtractor

Reply via email to