Thanks Kumar for sharing this code.
I'll test it tomorrow.
Regards,
Zabrane
On Jul 1, 2012, at 11:05 AM, Anuj Kumar wrote:
> Hi Zab,
>
> Have you tried TikaInputStream?
>
> Here is a snippet with AutoDetectParser-
>
> Initialize
> -------------------
> this.context = new ParseContext();
> this.parser = new AutoDetectParser();
> this.context.set(Parser.class, parser);
>
> Parse
> ---------
> // create a string writer object
> StringWriter textBuffer = new StringWriter();
> // get the Tika Input Stream from the input stream to target file
> InputStream stream = TikaInputStream.get(inStream);
> // create a content handler
> ContentHandler handler = new TeeContentHandler(
> getTextContentHandler(textBuffer));
> // create metadata object
> Metadata metadata = new Metadata();
> try {
> // parse the document
> parser.parse(stream, handler, metadata, context);
> // return the parsed text
> return textBuffer.toString();
> } catch (SAXException ex) {
> // log the exception
> LOG.error(ex.getMessage());
> // throw as IOException
> throw new IOException(ex);
> } catch (TikaException ex) {
> // log the exception
> LOG.error(ex.getMessage());
> ex.printStackTrace();
> // throw as IOException
> throw new IOException(ex);
> } finally {
> if (stream != null) {
> // close the stream
> stream.close();
> }
> }
>
> Regards,
> Anuj
>
> On Sun, Jul 1, 2012 at 2:21 PM, Zabrane Mickael <[email protected]> wrote:
> Hi guys,
>
> I've a couple of big PDF files (between 100-200Mb).
>
> Can someone shows me a way to extract text from them chunk by chunk (i.e
> without
> loading the whole file in RAM)?
>
> Is there a simple way to it? Code to share?
>
> Thanks
> Zab
>