[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330882#comment-17330882 ] David Pilato commented on TIKA-3364: Oh my god! I'm feeling stupid. Anyway, I was not able to choose

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330851#comment-17330851 ] Tim Allison commented on TIKA-3364: --- try {{pdfParser.setExtractBookmarksText(false);}} > PDF Content is

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330827#comment-17330827 ] Nick Burch commented on TIKA-3364: -- I'm not sure if we already have outlines/bookmarks elsewhere in other

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330824#comment-17330824 ] David Pilato commented on TIKA-3364: So I trie this: {code:java} PDFParser pdfParser =

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330810#comment-17330810 ] Tim Allison commented on TIKA-3364: --- We should probably add extra markup in the xhtml to identify the

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330809#comment-17330809 ] Tim Allison commented on TIKA-3364: --- You can see the text under the {{Outlines}} node. > PDF Content is

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330805#comment-17330805 ] Tim Allison commented on TIKA-3364: --- {noformat} Dummy PDF file {noformat} > PDF Content is

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330799#comment-17330799 ] Tim Allison commented on TIKA-3364: --- The PDF contains bookmark text, which is what is triggering the .