Further information is available. First, this work is using Tika 1.2. In further tests with different files, both single column and two column, e.g. this one: http://arxiv.org/ftp/arxiv/papers/0705/0705.1886.pdf
parsing went well. It appears that a PDF which is a scanned document rather than a conversion from original text returns nothing. This further testing is suggesting that the issue might relate to the file itself, perhaps something in the coding. Thanks in advance for comments. Jack On Fri, Jan 4, 2013 at 12:00 PM, Jack Park <[email protected]> wrote: > A two-column scientific paper. One column reads: > > The effect of muscle a-tocopherol concentration > (induced by dietary treatment) on TBARS at different > storage times was evaluated (Figure 2). There was a > linear effect (P < 0·001) of muscle a-tocopherol > concentration on TBARS on day 0, but a linear plus > quadratic effect on the following days (P < 0·001). > Also in this case the linear plus quadratic effect > indicated an exponential response, which was fitted > in each case as follows: > > > The parser (code below) returns this: > > The effect of m > (induced by dietar > storage times was > linear effect (P < > concentration on T > quadratic effect o > Also in this case > indicated an expo > in each case as foll > > > On some lines of parsing, characters at the left are missing, as if > the parser started after the beginning of the text, case in point: > > ted storage (L = linear effect, P < 0·001; > P< 0·001). The data were adjusted to a > l equation (solid line) as indicated in > > is the fragment extracted from: > > Figure 2 Relationship between a-tocopherol concentration > and lipid oxidation (assessed by the concentration of > thiobarbituric acid reactive substances, TBARS, mg > malonaldehyde per kg muscle) in longissimus lumborum > muscle of Manchego lambs after 0 (u), 3 (n), 6 (s) and 9 > (l) days of refrigerated storage (L = linear effect, P< 0·001; > Q = quadratic effect, P<0·001). The data were adjusted to a > linear or exponential equation (solid line) as indicated in > the text. > > The paper itself is found by following the link from here: > http://openagricola.nal.usda.gov/Record/IND23271089 > > (I will send the file offlist if needed; it's 64k) > > Code used is this: > > Parser parser = new AutoDetectParser(); > Metadata metadata = new Metadata(); > File f = new File("volume_73_part_3_p451-457.pdf"); > TikaInputStream tis = TikaInputStream.get(f); > StringWriter writer = new StringWriter(); > WriteOutContentHandler handler = new > WriteOutContentHandler(writer); > parser.parse(tis,handler,metadata,new ParseContext()); > System.out.println(handler.toString()); > > My questions are these: > > Can Tika (PdfBox) correctly parse multi-column content? > What am I missing? > > Many thanks in advance. > Jack
