Re: PDF parse failing to capture entire text

Jack Park Thu, 10 Jan 2013 12:08:42 -0800

Thanks Nick. I shall look into your suggestions; they never occurred
to me before. What, in fact, I did after starting this thread was to
expand the unit test to many different kinds of documents. The
particular document mentioned here, thus far, has been the only
document for which parsing stumbled. Other multi-column PDFs parsed
well. PDFs which are images and not based on text return an empty
string.


I'd like to think that if I play with tuning variables, I might get
that document to parse well; OTOH, I might be forced to conclude that
document is a bad one, perhaps find its content elsewhere, or copy and
paste from it into a console and send that to Solr.

Many thanks
Jack

On Thu, Jan 10, 2013 at 4:27 AM, Nick Burch <[email protected]> wrote:
> On 04/01/13 20:00, Jack Park wrote:
>>
>> A two-column scientific paper.
>
>
> The PDF parser has a few options that can be set, to control how some
> aspects of the parsing are done. Sorting text by position is one of
> them, which makes the parsing take a little longer, but will often
> improve accuracy on complicated pdfs, pdfs which are layout heavy, pdfs
> where the order of text in the file doesn't match the layout order etc.
> You may wish to try playing with those, and see if it helps for your case
>
>
>> Code used is this:
>>
>>                       Parser parser = new AutoDetectParser();
>>                       Metadata metadata = new Metadata();
>>                       File f = new File("volume_73_part_3_p451-457.pdf");
>>                       TikaInputStream tis = TikaInputStream.get(f);
>>                       StringWriter writer = new StringWriter();
>>                       WriteOutContentHandler handler = new
>> WriteOutContentHandler(writer);
>>                       parser.parse(tis,handler,metadata,new
>> ParseContext());
>>                       System.out.println(handler.toString());
>
>
> If you know it's a problematic PDF, try creating the PDFParser directly
> (not autodetect), and set some of the options on it, eg setSortByPosition
>
> Nick
>

Re: PDF parse failing to capture entire text

Reply via email to