Sorry for re-asking.
Context:
I have thousands of pdf's that are extracted using tika and then 
indexed/analyzed in Lucene. An there seems to be "cryprtic" text (binary data?) 
in some of the pdfs.

Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
context.set( Parser.class, new AutoDetectParser()  );
                
try 
{
        parser.parse( is, handler, metadata, context );
        returnValue = handler.toString();
}
catch ( final Throwable e )
{
        logger.error( "failed to extract text from input stream", e );
}

As there are so many I don't know whichpdf's cause problems. BUT I know that I 
don't need this "garbage", i.e. I am only interested in the (mostly german) 
text, nothing else (no images et al). 
How can I make sure only real text is extracted?

Thx
Clemens

-----Ursprüngliche Nachricht-----
Von: Clemens Wyss DEV [mailto:[email protected]] 
Gesendet: Montag, 7. Juli 2014 08:18
An: [email protected]
Betreff: Determine binary pdf?

What, if at all possible, is the preferred way to determine if a document 
(namely a pdf) is of "binary nature"?

I am extracting text of many pdf user manuals for lucene indexing and some of 
them deliver "absurd binary terms", which I would like to omit

Thx
Clemens

Reply via email to