try to put this line 

LOG.info ("Status : "+ new ParseStatus().toString());

after

parser.parse(in, texthandler, metadata, new ParseContext());    



And getting this result

Status : notparsed(0,0)

So Why it couldn't be parsed ?

Haya.
From: [email protected]
To: [email protected]
Subject: Tika with Nutch
Date: Mon, 27 Feb 2012 08:35:31 +0300








Hi all ,,


I'm developing a plug-in in Nutch that implement 
HtmlParserFilter, I want to use Tika tool kit to be able to convert the 
web page to plain text to be processed.

I knew that Tika is now integrated with Nutch since version 1.1, so I didn't 
download anything and start coding.


found that BodyContentHandler may help so I use this code:


//=======

//import packages:


import org.apache.tika.sax.BodyContentHandler;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.Parser;

import org.apache.tika.io.TikaInputStream;


//=====



public ParseResult filter(Content content, ParseResult parseResult, 
HTMLMetaTags metaTags, DocumentFragment doc) 

      {

Metadata metadata = new Metadata();

BodyContentHandler texthandler = new BodyContentHandler();

Parser parser = new AutoDetectParser();

InputStream in = TikaInputStream.get(content.getContent());

parser.parse(in, texthandler, metadata, new ParseContext());    

LOG.info("Content: " + texthandler.toString());

LOG.info("is Empty? " + texthandler.toString().isEmpty());

     }


Now, The content is always empty, isEmpty() gives me true all the time !
and there is no error or exceptions.

I don't know why, I've searched a lot, resources are rare, so I asked this 
question here in the mailing list


Thanks in advanced and I appreciated :)


                                          
        
        
                                                                                
  

Reply via email to