PDF text extraction problems

Ehsan Thu, 03 Jun 2010 00:30:41 -0700

Hello,
I'm trying to parse a pdf file. I first tried this code

          InputStream input = new FileInputStream(new File(resourceLocation));//
the document to be parsed
          ContentHandler textHandler = new BodyContentHandler();
          Metadata metadata = new Metadata();
          PDFParser parser = new PDFParser();
          ParseContext context = new  ParseContext();
          parser.parse(input, textHandler, metadata, context);
          input.close();


then I tried the Tika class
       
        Tika tika = new Tika();
        InputStream input = new FileInputStream(new File(resourceLocation));
        Metadata metadata = new Metadata();
        String content = tika.parseToString(input, metadata);

both of these codes do the exact same thing, they read some of the text in the
PDF file, but leave the rest of the file out?? I tested it with a 1m file and a
100k file.
 I looked around and found this message in the tika mails "Tika maxStringLength
limit reached" where it was suggested that one could add the maxStringLength by
doing this
  tika.setMaxStringLength(10*1024*1024);

no result. Am I doing something wrong?how can I parse the entire file.

cheers
ehsan

PDF text extraction problems

Reply via email to