Re: Text extraction from large PDF files

Zabrane Mickael Sun, 01 Jul 2012 07:36:06 -0700

Thanks Kumar for sharing this code.
I'll test it tomorrow.

Regards,
Zabrane


On Jul 1, 2012, at 11:05 AM, Anuj Kumar wrote:

> Hi Zab,
> 
> Have you tried TikaInputStream?
> 
> Here is a snippet with AutoDetectParser-
> 
> Initialize
> -------------------
> this.context = new ParseContext();
> this.parser = new AutoDetectParser();
> this.context.set(Parser.class, parser);
> 
> Parse
> ---------
> // create a string writer object
> StringWriter textBuffer = new StringWriter();
> // get the Tika Input Stream from the input stream to target file
> InputStream stream = TikaInputStream.get(inStream);
> // create a content handler
> ContentHandler handler = new TeeContentHandler(
>               getTextContentHandler(textBuffer));
> // create metadata object
> Metadata metadata = new Metadata();
> try {
>       // parse the document
>       parser.parse(stream, handler, metadata, context);
>       // return the parsed text
>       return textBuffer.toString();
> } catch (SAXException ex) {
>       // log the exception
>       LOG.error(ex.getMessage());
>       // throw as IOException
>       throw new IOException(ex);
> } catch (TikaException ex) {
>       // log the exception
>       LOG.error(ex.getMessage());
>       ex.printStackTrace();
>       // throw as IOException
>       throw new IOException(ex);
> } finally {
>       if (stream != null) {
>               // close the stream
>               stream.close();
>       }
> }
> 
> Regards,
> Anuj
> 
> On Sun, Jul 1, 2012 at 2:21 PM, Zabrane Mickael <[email protected]> wrote:
> Hi guys,
> 
> I've a couple of big PDF files (between 100-200Mb).
> 
> Can someone shows me a way to extract text from them chunk by chunk (i.e 
> without
> loading the whole file in RAM)?
> 
> Is there a simple way to it? Code to share?
> 
> Thanks
> Zab
>

Re: Text extraction from large PDF files

Reply via email to