Re: PDF parse failing to capture entire text

Jack Park Fri, 04 Jan 2013 14:59:03 -0800

Further information is available.
First, this work is using Tika 1.2.

In further tests with different files, both single column and two
column, e.g. this one:
http://arxiv.org/ftp/arxiv/papers/0705/0705.1886.pdf


parsing went well.  It appears that a PDF which is a scanned document
rather than a conversion from original text returns nothing.

This further testing is suggesting that the issue might relate to the
file itself, perhaps something in the coding.

Thanks in advance for comments.
Jack

On Fri, Jan 4, 2013 at 12:00 PM, Jack Park <[email protected]> wrote:
> A two-column scientific paper. One column reads:
>
> The effect of muscle a-tocopherol concentration
> (induced by dietary treatment) on TBARS at different
> storage times was evaluated (Figure 2). There was a
> linear effect (P < 0·001) of muscle a-tocopherol
> concentration on TBARS on day 0, but a linear plus
> quadratic effect on the following days (P < 0·001).
> Also in this case the linear plus quadratic effect
> indicated an exponential response, which was fitted
> in each case as follows:
>
>
> The parser (code below) returns this:
>
> The effect of m
> (induced by dietar
> storage times was
> linear effect (P <
> concentration on T
> quadratic effect o
> Also in this case
> indicated an expo
> in each case as foll
>
>
> On some lines of parsing, characters at the left are missing, as if
> the parser started after the beginning of the text, case in point:
>
> ted storage (L = linear effect, P < 0·001;
>  P< 0·001). The data were adjusted to a
> l equation (solid line) as indicated in
>
> is the fragment extracted from:
>
> Figure 2 Relationship between a-tocopherol concentration
> and lipid oxidation (assessed by the concentration of
> thiobarbituric acid reactive substances, TBARS, mg
> malonaldehyde per kg muscle) in longissimus lumborum
> muscle of Manchego lambs after 0 (u), 3 (n), 6 (s) and 9
> (l) days of refrigerated storage (L = linear effect, P< 0·001;
> Q = quadratic effect, P<0·001). The data were adjusted to a
> linear or exponential equation (solid line) as indicated in
> the text.
>
> The paper itself is found by following the link from here:
> http://openagricola.nal.usda.gov/Record/IND23271089
>
> (I will send the file offlist if needed; it's 64k)
>
> Code used is this:
>
>                         Parser parser = new AutoDetectParser();
>                         Metadata metadata = new Metadata();
>                         File f = new File("volume_73_part_3_p451-457.pdf");
>                         TikaInputStream tis = TikaInputStream.get(f);
>                         StringWriter writer = new StringWriter();
>                         WriteOutContentHandler handler = new 
> WriteOutContentHandler(writer);
>                         parser.parse(tis,handler,metadata,new ParseContext());
>                         System.out.println(handler.toString());
>
> My questions are these:
>
> Can Tika (PdfBox) correctly parse multi-column content?
> What am I missing?
>
> Many thanks in advance.
> Jack

Re: PDF parse failing to capture entire text

Reply via email to