On Sat, Dec 26, 2020 at 12:54 PM sofien benharchache <
[email protected]> wrote:
> Hello,
>
> I am using Apache Tika with Python to extract text from PDF. I have a
> problem in extracting the content of PDF files. The order of the text is
> sometimes messed up.
>
> I have some PDF files containing free-form text. Some lines are in the
> form of two columns. One column represents a year and the other represents
> a description associated to the year.
>
> Let’s say :
> dateA description A
> dateB description B
>
> For example, here is an extract of one file :
>
> I can’t provide the whole file, as the data is not meant to be shared.
>
> I expect Apache Tika to extract content in the form :
> dateA descriptionA dateB descriptionB.
>
> But the output is the following :
> dateA dateB descriptionA descriptionB
>
> I included this property in my configuration file :
> <property name="sortByPosition" value="true"/>
>
> then this code
> parsed = parser.from_file('/path/to/file',
> config_path='/my/path/tika.config’)
>
> But it doesn’t change the output.
>
> Do you have any idea to resolve this issue ?
>
> Thanks,
>
>