Tika with nutch

Haya AL-Tuwaijri Fri, 17 Feb 2012 21:50:19 -0800

Hi all ,,

I'm developing a plug-in in Nutch that implement HtmlParserFilter, I want to 
use Tika tool kit to be able to convert the web page to plain text to be 
processed.
I knew that Tika is now integrated with Nutch since version 1.1, so I didn't 
download anything and start coding.


found that BodyContentHandler may help so I use this code:

//=======
//import packages:

import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.apache.tika.io.TikaInputStream;

//=====


public ParseResult filter(Content content, ParseResult parseResult, 
HTMLMetaTags metaTags, DocumentFragment doc) 
      {
Metadata metadata = new Metadata();
BodyContentHandler texthandler = new BodyContentHandler();
Parser parser = new AutoDetectParser();
InputStream in = TikaInputStream.get(content.getContent());
parser.parse(in, texthandler, metadata, new ParseContext());    
LOG.info("Content: " + texthandler.toString());
LOG.info("is Empty? " + texthandler.toString().isEmpty());
     }

Now, The content is always empty, and isEmpty() gives me true all the time !

I don't know why, I've searched a lot, resources are rare, so I asked this 
question here in the mailing list

Thanks in advanced and I appreciated :)

Tika with nutch

Reply via email to