Sorry for re-asking.
Context:
I have thousands of pdf's that are extracted using tika and then
indexed/analyzed in Lucene. An there seems to be "cryprtic" text (binary data?)
in some of the pdfs.
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
context.set( Parser.class, new AutoDetectParser() );
try
{
parser.parse( is, handler, metadata, context );
returnValue = handler.toString();
}
catch ( final Throwable e )
{
logger.error( "failed to extract text from input stream", e );
}
As there are so many I don't know whichpdf's cause problems. BUT I know that I
don't need this "garbage", i.e. I am only interested in the (mostly german)
text, nothing else (no images et al).
How can I make sure only real text is extracted?
Thx
Clemens
-----Ursprüngliche Nachricht-----
Von: Clemens Wyss DEV [mailto:[email protected]]
Gesendet: Montag, 7. Juli 2014 08:18
An: [email protected]
Betreff: Determine binary pdf?
What, if at all possible, is the preferred way to determine if a document
(namely a pdf) is of "binary nature"?
I am extracting text of many pdf user manuals for lucene indexing and some of
them deliver "absurd binary terms", which I would like to omit
Thx
Clemens