Re: PDF parse failing to capture entire text

Nick Burch Thu, 10 Jan 2013 04:28:05 -0800

On 04/01/13 20:00, Jack Park wrote:

A two-column scientific paper.


The PDF parser has a few options that can be set, to control how some
aspects of the parsing are done. Sorting text by position is one of
them, which makes the parsing take a little longer, but will often
improve accuracy on complicated pdfs, pdfs which are layout heavy, pdfs
where the order of text in the file doesn't match the layout order etc.
You may wish to try playing with those, and see if it helps for your case

Code used is this:

                      Parser parser = new AutoDetectParser();
                      Metadata metadata = new Metadata();
                      File f = new File("volume_73_part_3_p451-457.pdf");
                      TikaInputStream tis = TikaInputStream.get(f);
                      StringWriter writer = new StringWriter();
                      WriteOutContentHandler handler = new 
WriteOutContentHandler(writer);
                      parser.parse(tis,handler,metadata,new ParseContext());
                      System.out.println(handler.toString());


If you know it's a problematic PDF, try creating the PDFParser directly
(not autodetect), and set some of the options on it, eg setSortByPosition

Nick

Re: PDF parse failing to capture entire text

Reply via email to