[jira] [Resolved] (TIKA-1863) --text-main content missing in output file

Tim Allison (JIRA) Mon, 22 Feb 2016 06:00:32 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison resolved TIKA-1863.
-------------------------------
    Resolution: Won't Fix

{{--text-main}} uses the {{BoilerpipeContentHandler}}, which tries to determine 
what the "main content" of a document is -- mainly designed to remove 
advertising/links/noise on html documents.  

I confirmed that Boilerpipe categorizes some of the first paragraphs and the 
last paragraph (note 15 through "624 KPK") as "not content."  

At the general Tika level, we don't control what Boilerpipe does, and I'm not 
aware of a method to alter its algorithm for determining content vs. not 
content.

In short, I don't think we can fix this.  We can recommend using other 
extraction options mentioned in an earlier comment.


> --text-main content missing in output file
> ------------------------------------------
>
>                 Key: TIKA-1863
>                 URL: https://issues.apache.org/jira/browse/TIKA-1863
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.12
>         Environment: Windows 10 64
>            Reporter: Marcin Gil
>
> When converting both PDF and DOC files to text with following command
> java -jar tika.jar --text-main --encoding=UTF-8 input.pdf > output.txt
> The output file is missing a random amount of LAST and FIRST lines in input 
> file. 
> Example file:
> https://dl.dropboxusercontent.com/u/11435743/tika-issue-1.pdf
> Text starting from "15 Akt oskarżenia" is missing (at the bottom of the file).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1863) --text-main content missing in output file

Reply via email to