Check if the flag has any effect on other PDFs. If not, then there is a
mistake setting the option.
Here's a config.xml , the option is different than you did
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
</parser>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="enableAutoSpace" type="bool">true</param>
<param name="sortByPosition" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
Second thing: try with PDFBox directly, download pdfbox-app from
https://pdfbox.apache.org/download.html
and then run
java -jar pdfbox-app-2.0.22.jar
<https://www.apache.org/dyn/closer.lua?filename=pdfbox/2.0.22/pdfbox-app-2.0.22.jar&action=download>
ExtractText -sort XXXX.pdf
third possibility: the lines are very close to each other. Is your PDF
like that?
Tilman
Am 26.12.2020 um 23:18 schrieb Tim Allison:
On Sat, Dec 26, 2020 at 12:54 PM sofien benharchache
<[email protected] <mailto:[email protected]>>
wrote:
Hello,
I am using Apache Tika with Python to extract text from PDF. I
have a problem in extracting the content of PDF files. The order
of the text is sometimes messed up.
I have some PDF files containing free-form text. Some lines are in
the form of two columns. One column represents a year and the
other represents a description associated to the year.
Let’s say :
dateA description A
dateB description B
For example, here is an extract of one file :
I can’t provide the whole file, as the data is not meant to be shared.
I expect Apache Tika to extract content in the form :
dateA descriptionA dateB descriptionB.
But the output is the following :
dateA dateB descriptionA descriptionB
I included this property in my configuration file :
<property name="sortByPosition" value="true"/>
then this code
parsed = parser.from_file('/path/to/file',
config_path='/my/path/tika.config’)
But it doesn’t change the output.
Do you have any idea to resolve this issue ?
Thanks,