Thanks Nick. I shall look into your suggestions; they never occurred to me before. What, in fact, I did after starting this thread was to expand the unit test to many different kinds of documents. The particular document mentioned here, thus far, has been the only document for which parsing stumbled. Other multi-column PDFs parsed well. PDFs which are images and not based on text return an empty string.
I'd like to think that if I play with tuning variables, I might get that document to parse well; OTOH, I might be forced to conclude that document is a bad one, perhaps find its content elsewhere, or copy and paste from it into a console and send that to Solr. Many thanks Jack On Thu, Jan 10, 2013 at 4:27 AM, Nick Burch <[email protected]> wrote: > On 04/01/13 20:00, Jack Park wrote: >> >> A two-column scientific paper. > > > The PDF parser has a few options that can be set, to control how some > aspects of the parsing are done. Sorting text by position is one of > them, which makes the parsing take a little longer, but will often > improve accuracy on complicated pdfs, pdfs which are layout heavy, pdfs > where the order of text in the file doesn't match the layout order etc. > You may wish to try playing with those, and see if it helps for your case > > >> Code used is this: >> >> Parser parser = new AutoDetectParser(); >> Metadata metadata = new Metadata(); >> File f = new File("volume_73_part_3_p451-457.pdf"); >> TikaInputStream tis = TikaInputStream.get(f); >> StringWriter writer = new StringWriter(); >> WriteOutContentHandler handler = new >> WriteOutContentHandler(writer); >> parser.parse(tis,handler,metadata,new >> ParseContext()); >> System.out.println(handler.toString()); > > > If you know it's a problematic PDF, try creating the PDFParser directly > (not autodetect), and set some of the options on it, eg setSortByPosition > > Nick >
