Mmmm... this is really a Tika question, this probably shadows why you have received very little response from the community unfortunately.
So the problem is that you are always getting back isEmpty indicating that _nothing_ is being produced as an output from your parser. I would add in a try catch, like we do in TikaParser to either feed content the output stream or catch when there is no content to be fed. Maybe you should have a look at http://tika.apache.org/1.0/parser.html there is content there on the BodyContentHandler as well as the various readers and writers you need to get your implementation up and running. 2012/2/19 HaYa aziz <[email protected]> > > > I try to use writer also without any luck ! > > StringWriter writer = new StringWriter(); > Metadata metadata = new Metadata(); > ContentHandler texthandler = new BodyContentHandler(writer); > Parser parser = new AutoDetectParser(); > InputStream in = TikaInputStream.get(content.getContent()); > parser.parse(in, texthandler, metadata, new ParseContext()); > LOG.info("Content: " + writer .toString()); > LOG.info("is Empty? " + writer .toString().isEmpty()); > > > Where is the problem !!!! > > > > To: [email protected] > > Subject: Tika with nutch > > Date: Sat, 18 Feb 2012 08:49:43 +0300 > > > > > > Hi all ,, > > > > I'm developing a plug-in in Nutch that implement HtmlParserFilter, I > want to use Tika tool kit to be able to convert the web page to plain text > to be processed. > > I knew that Tika is now integrated with Nutch since version 1.1, so I > didn't download anything and start coding. > > > > found that BodyContentHandler may help so I use this code: > > > > //======= > > //import packages: > > > > import org.apache.tika.sax.BodyContentHandler; > > import org.apache.tika.metadata.Metadata; > > import org.apache.tika.parser.ParseContext; > > import org.apache.tika.parser.AutoDetectParser; > > import org.apache.tika.parser.Parser; > > import org.apache.tika.io.TikaInputStream; > > > > //===== > > > > > > public ParseResult filter(Content content, ParseResult parseResult, > HTMLMetaTags metaTags, DocumentFragment doc) > > { > > Metadata metadata = new Metadata(); > > BodyContentHandler texthandler = new BodyContentHandler(); > > Parser parser = new AutoDetectParser(); > > InputStream in = TikaInputStream.get(content.getContent()); > > parser.parse(in, texthandler, metadata, new ParseContext()); > > LOG.info("Content: " + texthandler.toString()); > > LOG.info("is Empty? " + texthandler.toString().isEmpty()); > > } > > > > Now, The content is always empty, and isEmpty() gives me true all the > time ! > > > > I don't know why, I've searched a lot, resources are rare, so I asked > this question here in the mailing list > > > > Thanks in advanced and I appreciated :) > > > > > > -- *Lewis*

