Hello,
I'm trying to parse a pdf file. I first tried this code
InputStream input = new FileInputStream(new File(resourceLocation));//
the document to be parsed
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
input.close();
then I tried the Tika class
Tika tika = new Tika();
InputStream input = new FileInputStream(new File(resourceLocation));
Metadata metadata = new Metadata();
String content = tika.parseToString(input, metadata);
both of these codes do the exact same thing, they read some of the text in the
PDF file, but leave the rest of the file out?? I tested it with a 1m file and a
100k file.
I looked around and found this message in the tika mails "Tika maxStringLength
limit reached" where it was suggested that one could add the maxStringLength by
doing this
tika.setMaxStringLength(10*1024*1024);
no result. Am I doing something wrong?how can I parse the entire file.
cheers
ehsan