Re: Apache Tika issue

Tilman Hausherr Sun, 27 Dec 2020 05:47:30 -0800

Hi,

If it works with PDFBox but not with Tika, then it means it is relatedto a change in PDFBox, probably this one

https://issues.apache.org/jira/browse/PDFBOX-5002


You could try a tika 1.26 snapshot:

https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.26-SNAPSHOT/

Tilman


Am 27.12.2020 um 14:36 schrieb sofien benharchache:

Hi,
Thanks for the help and for answering so quickly ! Veryappreciated. It works now with PDFBox. Changing the config filewas not sufficient.Still, I wanted to use Apache Tika for the parsing because I’msometimes dealing with other formats. Would you have any further ideafor me to obtain similar results with Apache Tika ?
The lines are indeed very close to each other.

Thanks !
Le 27 déc. 2020 à 05:35, Tilman Hausherr <[email protected]<mailto:[email protected]>> a écrit :
Check if the flag has any effect on other PDFs. If not, then there isa mistake setting the option.
Here's a config.xml , the option is different than you did

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
    </parser>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="enableAutoSpace" type="bool">true</param>
        <param name="sortByPosition" type="bool">true</param>
      </params>
    </parser>
  </parsers>
</properties>

Second thing: try with PDFBox directly, download pdfbox-app from
https://pdfbox.apache.org/download.html

and then run
java -jar pdfbox-app-2.0.22.jar<https://www.apache.org/dyn/closer.lua?filename=pdfbox/2.0.22/pdfbox-app-2.0.22.jar&action=download>ExtractText -sort XXXX.pdf
third possibility: the lines are very close to each other. Is yourPDF like that?
Tilman


Am 26.12.2020 um 23:18 schrieb Tim Allison:
On Sat, Dec 26, 2020 at 12:54 PM sofien benharchache<[email protected]<mailto:[email protected]>> wrote:
    Hello,

    I am using Apache Tika with Python to extract text from PDF. I
    have a problem in extracting the content of PDF files. The order
    of the text is sometimes messed up.

    I have some PDF files containing free-form text. Some lines are
    in the form of two columns. One column represents a year and the
    other represents a description associated to the year.

    Let’s say :
    dateA   description A
    dateB   description B

    For example, here is an extract of one file :

    I can’t provide the whole file, as the data is not meant to be
    shared.

    I expect Apache Tika to extract content in the form :
    dateA descriptionA dateB descriptionB.

    But the output is the following :
    dateA dateB descriptionA descriptionB

    I included this property in my configuration file :
    <property name="sortByPosition" value="true"/>

    then this code
    parsed = parser.from_file('/path/to/file',
    config_path='/my/path/tika.config’)

    But it doesn’t change the output.

    Do you have any idea to resolve this issue ?

    Thanks,

Re: Apache Tika issue

Reply via email to