On 04/01/13 20:00, Jack Park wrote:
A two-column scientific paper.
The PDF parser has a few options that can be set, to control how some
aspects of the parsing are done. Sorting text by position is one of
them, which makes the parsing take a little longer, but will often
improve accuracy on complicated pdfs, pdfs which are layout heavy, pdfs
where the order of text in the file doesn't match the layout order etc.
You may wish to try playing with those, and see if it helps for your case
Code used is this:
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
File f = new File("volume_73_part_3_p451-457.pdf");
TikaInputStream tis = TikaInputStream.get(f);
StringWriter writer = new StringWriter();
WriteOutContentHandler handler = new
WriteOutContentHandler(writer);
parser.parse(tis,handler,metadata,new ParseContext());
System.out.println(handler.toString());
If you know it's a problematic PDF, try creating the PDFParser directly
(not autodetect), and set some of the options on it, eg setSortByPosition
Nick