Re: Text extraction from large PDF files

Anuj Kumar Sun, 01 Jul 2012 02:05:42 -0700

Hi Zab,

Have you tried TikaInputStream?


Here is a snippet with AutoDetectParser-

Initialize
-------------------
this.context = new ParseContext();
this.parser = new AutoDetectParser();
this.context.set(Parser.class, parser);

Parse
---------
// create a string writer object
StringWriter textBuffer = new StringWriter();
// get the Tika Input Stream from the input stream to target file
InputStream stream = TikaInputStream.get(inStream);
// create a content handler
ContentHandler handler = new TeeContentHandler(
getTextContentHandler(textBuffer));
// create metadata object
Metadata metadata = new Metadata();
try {
// parse the document
parser.parse(stream, handler, metadata, context);
// return the parsed text
return textBuffer.toString();
} catch (SAXException ex) {
// log the exception
LOG.error(ex.getMessage());
// throw as IOException
throw new IOException(ex);
} catch (TikaException ex) {
// log the exception
LOG.error(ex.getMessage());
ex.printStackTrace();
// throw as IOException
throw new IOException(ex);
} finally {
if (stream != null) {
// close the stream
stream.close();
}
}

Regards,
Anuj

On Sun, Jul 1, 2012 at 2:21 PM, Zabrane Mickael <[email protected]> wrote:

> Hi guys,
>
> I've a couple of big PDF files (between 100-200Mb).
>
> Can someone shows me a way to extract text from them chunk by chunk (i.e
> without
> loading the whole file in RAM)?
>
> Is there a simple way to it? Code to share?
>
> Thanks
> Zab

Re: Text extraction from large PDF files

Reply via email to