[jira] [Created] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain
Tim Allison created TIKA-1124: - Summary: Nested documents not extracted if a PDF file is in the chain Key: TIKA-1124 URL: https://issues.apache.org/jira/browse/TIKA-1124 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Reporter: Tim Allison Priority: Minor Tika 1.3 is not able to get attachments from the attached PDF. The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted. I'm not sure of a solution, but I found two things that might help with the diagnosis: 1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk). 2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved. The cause may be in the MatchingContentHandler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1124: -- Attachment: pdf_attachment_issues.zip outer.docx contains the attached.pdf, which itself contains an attachment. Toy examples of avoiding the use of MatchingContentHandler also attached. Nested documents not extracted if a PDF file is in the chain Key: TIKA-1124 URL: https://issues.apache.org/jira/browse/TIKA-1124 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Reporter: Tim Allison Priority: Minor Attachments: pdf_attachment_issues.zip Tika 1.3 is not able to get attachments from the attached PDF. The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted. I'm not sure of a solution, but I found two things that might help with the diagnosis: 1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk). 2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved. The cause may be in the MatchingContentHandler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676495#comment-13676495 ] Tim Allison commented on TIKA-1130: --- I've submitted a patch to POI for this (https://issues.apache.org/bugzilla/show_bug.cgi?id=54849). I haven't gotten any feedback after my initial trivial fix. The issue is that sdt/content controls can stand alone as the equivalent of a paragraph or table. POI isn't currently picking those up. .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Attachments: Resume 6.4.13.docx When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677957#comment-13677957 ] Tim Allison commented on TIKA-1130: --- I'll try to submit the Tika portion of the POI-54849 patch by early next week in case anyone wants to apply both patches at home. .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Attachments: Resume 6.4.13.docx When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682628#comment-13682628 ] Tim Allison commented on TIKA-1132: --- Tika gui took longer than I was willing to wait, too. tika.parseToString() returned a value in about 30 seconds. As you both suggested, the fraction formatter was likely the culprit. I just submitted a patch to poi 54686. Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Fix For: 1.1 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692110#comment-13692110 ] Tim Allison commented on TIKA-1130: --- Nick, I think I have to make modifications to Tika to execute the new SDT components. Should my patch be to Tika trunk? .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Attachments: Resume 6.4.13.docx When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692517#comment-13692517 ] Tim Allison commented on TIKA-1130: --- Maven proxy setting in my settings.xml file is working for grabbing dependencies, but the proxy info isn't being transferred to testUrlOnly's url.openStream() in MimeDetectionTest. The proxy props appear correctly in the surefire-report for MimeDetectionTest, but the proxy settings are null when I insert this into testUrlOnly: System.out.println(HOST: + System.getProperty(http.proxyHost)); System.out.println(PORT: + System.getProperty(http.proxyPort)); Will likely find the answer as soon as I post this... .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Attachments: Resume 6.4.13.docx When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693068#comment-13693068 ] Tim Allison commented on TIKA-973: -- Will submit patch and tests by end of the week. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1130: -- Attachment: TIKA-1130.patch Ray's initial test restored after POI-55142 was committed. Thank you, Nick! .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Attachments: Resume 6.4.13.docx, TIKA-1130.patch, TIKA-1130.patch When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: TIKA-973-patch.tar.gz Patch attached. Dumps contents of pdf forms at end of document. AcroForm field name metadata is in attribute values. Basic format is ol. Let me know how this looks. Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694774#comment-13694774 ] Tim Allison commented on TIKA-973: -- Agree on both. Also would appreciate feedback on what the output should be. The current code extracts this unseemly xhtml: div class=acroform ol li partialName=form1[0] fullName=form1[0]/ ol li partialName=#subform[6] fullName=form1[0].#subform[6]/ li partialName=MiddleInitial[0] fullName=form1[0].#subform[6].MiddleInitial[0] altName=Enter Middle Initial (MI)X/li li partialName=FamilyName[0] fullName=form1[0].#subform[6].FamilyName[0] altName=Section 1. Employee Information and Attestation. Family Name (Last Name)Doe/li li partialName=GivenName[0] fullName=form1[0].#subform[6].GivenName[0] altName=Given Name (First Name)John/li li partialName=OtherNamesUsed[0] fullName=form1[0].#subform[6].OtherNamesUsed[0] altName=Maiden NameMr. Doe/li li partialName=StreetNumberName[0] fullName=form1[0].#subform[6].StreetNumberName[0] altName= Street Number and Name123 Main St./li ... Another idea I had was to include the partialName in the contents and not fill out the attrs: liStreetNumberName[0]: 123 Main St/li More unit tests on way... PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: i-9_screenshot.png Screenshot attached. Thanks again to: http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and example doc. The middle ground that you recommend makes sense. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1139) Modify Tika-1129 to test against a local file
Tim Allison created TIKA-1139: - Summary: Modify Tika-1129 to test against a local file Key: TIKA-1139 URL: https://issues.apache.org/jira/browse/TIKA-1139 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.3 Reporter: Tim Allison Priority: Trivial Fix For: 1.5 Would prefer to avoid requiring a network call in test unless necessary. The website that was causing the initial issue (Tika-367) has modified their content and it now causes no problems for Tika 0.5 (the version against which the original issue was raised). I simplified Tika-367's evilhtml.html (took out all content and truncated most of it). The modified test file causes the original problem in Tika 0.5, but it causes no problems in Tika 0.6 or trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1139) Modify Tika-1129 to test against a local file
[ https://issues.apache.org/jira/browse/TIKA-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1139: -- Attachment: TIKA-1139.patch.tar.gz Patch attached. Modify Tika-1129 to test against a local file - Key: TIKA-1139 URL: https://issues.apache.org/jira/browse/TIKA-1139 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.3 Reporter: Tim Allison Priority: Trivial Fix For: 1.5 Attachments: TIKA-1139.patch.tar.gz Would prefer to avoid requiring a network call in test unless necessary. The website that was causing the initial issue (Tika-367) has modified their content and it now causes no problems for Tika 0.5 (the version against which the original issue was raised). I simplified Tika-367's evilhtml.html (took out all content and truncated most of it). The modified test file causes the original problem in Tika 0.5, but it causes no problems in Tika 0.6 or trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: TIKA-973.patch.tar.gz Middle-road change made. The alternate name is an attribute and partial name is added to content followed by a :. I also added a few more tests. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz, TIKA-973.patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698009#comment-13698009 ] Tim Allison commented on TIKA-1130: --- That was fast. Thank you! .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Fix For: 1.5 Attachments: Resume 6.4.13.docx, TIKA-1130.patch, TIKA-1130.patch When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704699#comment-13704699 ] Tim Allison commented on TIKA-1130: --- Haven't had a chance to build from trunk today, but the latest attachment seems to work on my local build of Tika. Which portions are missing? .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Fix For: 1.5 Attachments: OwenResume.docx, Resume 6.4.13.docx, tee internal resme.docx, TIKA-1130.patch, TIKA-1130.patch When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704711#comment-13704711 ] Tim Allison commented on TIKA-1130: --- Tested with freshly built trunk, and the text looks good to me. Let me know if you don't find the same. There is a case that this bug fix didn't cover: if a content control takes up an entire table cell/is the equivalent of a table cell, then that content is not currently being pulled by POI. That is on my todo list. .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Fix For: 1.5 Attachments: OwenResume.docx, Resume 6.4.13.docx, tee internal resme.docx, TIKA-1130.patch, TIKA-1130.patch When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1150) Extract text from textbox in XLSX
Tim Allison created TIKA-1150: - Summary: Extract text from textbox in XLSX Key: TIKA-1150 URL: https://issues.apache.org/jira/browse/TIKA-1150 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.4 Reporter: Tim Allison Priority: Minor Underlying POI library doesn't appear to support easy extraction of text from text boxes in XLSX files. Personal preference would be to wait for modifications in POI and then make a few small changes to Tika to run XSSFTextBox code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1150) Extract text from textbox in XLSX
[ https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1150: -- Attachment: testEXCEL_textbox.xlsx Simple file that shows issue. Extract text from textbox in XLSX - Key: TIKA-1150 URL: https://issues.apache.org/jira/browse/TIKA-1150 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.4 Reporter: Tim Allison Priority: Minor Attachments: testEXCEL_textbox.xlsx Underlying POI library doesn't appear to support easy extraction of text from text boxes in XLSX files. Personal preference would be to wait for modifications in POI and then make a few small changes to Tika to run XSSFTextBox code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1150) Extract text from textbox in XLSX
[ https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716429#comment-13716429 ] Tim Allison commented on TIKA-1150: --- Duplicate of http://issues.apache.org/jira/browse/TIKA-1100. Closing this one. Extract text from textbox in XLSX - Key: TIKA-1150 URL: https://issues.apache.org/jira/browse/TIKA-1150 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.4 Reporter: Tim Allison Priority: Minor Attachments: testEXCEL_textbox.xlsx Underlying POI library doesn't appear to support easy extraction of text from text boxes in XLSX files. Personal preference would be to wait for modifications in POI and then make a few small changes to Tika to run XSSFTextBox code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (TIKA-1150) Extract text from textbox in XLSX
[ https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1150. - Resolution: Duplicate Extract text from textbox in XLSX - Key: TIKA-1150 URL: https://issues.apache.org/jira/browse/TIKA-1150 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.4 Reporter: Tim Allison Priority: Minor Attachments: testEXCEL_textbox.xlsx Underlying POI library doesn't appear to support easy extraction of text from text boxes in XLSX files. Personal preference would be to wait for modifications in POI and then make a few small changes to Tika to run XSSFTextBox code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716432#comment-13716432 ] Tim Allison commented on TIKA-1100: --- Waiting for improvements in POI-55292. Will make Tika-side upgrades when the next version of POI is released. Reference: http://issues.apache.org/bugzilla/show_bug.cgi?id=55292 cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm) - Key: TIKA-1100 URL: https://issues.apache.org/jira/browse/TIKA-1100 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Windows7 64bit Reporter: Kazuaki Matsuba When I launch Tika gui from command-line and drag and drop .xlsx file that have textbox, no text in the textbox are extracted. When drag and drop .xls file, text in the textbox are extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1100: -- Attachment: testEXCEL_textbox.xlsx Simple example file attached for now. Will fill out with test cases when POI is ready. cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm) - Key: TIKA-1100 URL: https://issues.apache.org/jira/browse/TIKA-1100 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Windows7 64bit Reporter: Kazuaki Matsuba Attachments: testEXCEL_textbox.xlsx When I launch Tika gui from command-line and drag and drop .xlsx file that have textbox, no text in the textbox are extracted. When drag and drop .xls file, text in the textbox are extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-792: - Attachment: test10.docx Example document that triggers no such method exceptions for: CTMarkupRangeImpl, CTMarkupImpl and CTBookmarkRangeImpl NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document Key: TIKA-792 URL: https://issues.apache.org/jira/browse/TIKA-792 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x Reporter: Torsten Krah Fix For: 1.2 Attachments: test10.docx Parsing some OOXML documents, this stacktrace is logged many times: java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) at java.lang.Class.getConstructor0(Class.java:2723) at java.lang.Class.getDeclaredConstructor(Class.java:2002) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021) at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893) at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657) at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654) at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647) at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995) at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904) at org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) at org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) Looking at the poi code java is right here, there is no constructor with a SchemaType and a boolean, only with SchemaType. My guess is this one was missed during upgrade to poi beta4, but only a guess, anyway needs a fix :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729781#comment-13729781 ] Tim Allison commented on TIKA-1124: --- If anyone has a chance to look into this, I'd appreciate it. I suspect something is going awry with the recursion in the triggering documents + xpath query in MatchingContentHandler. Thank you! Nested documents not extracted if a PDF file is in the chain Key: TIKA-1124 URL: https://issues.apache.org/jira/browse/TIKA-1124 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Reporter: Tim Allison Priority: Minor Attachments: pdf_attachment_issues.zip Tika 1.3 is not able to get attachments from the attached PDF. The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted. I'm not sure of a solution, but I found two things that might help with the diagnosis: 1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk). 2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved. The cause may be in the MatchingContentHandler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729903#comment-13729903 ] Tim Allison commented on TIKA-1124: --- Ok, I think I figured this out... AbstractOOXML includes contents from embedded documents before calling handler.endDocument() PDFParser, however, calls handler.endDocument() and then tries to append content from embedded documents. I think this means that the parent handler sees an end of body and therefore does not process the contents of the embedded document. trivial fix: move handler.endDocument() out of PDF2XHTML and call it after processing the embedded documents in PDFParser. Unless I hear otherwise, I'll commit this over the next few days. Nested documents not extracted if a PDF file is in the chain Key: TIKA-1124 URL: https://issues.apache.org/jira/browse/TIKA-1124 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Reporter: Tim Allison Priority: Minor Attachments: pdf_attachment_issues.zip Tika 1.3 is not able to get attachments from the attached PDF. The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted. I'm not sure of a solution, but I found two things that might help with the diagnosis: 1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk). 2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved. The cause may be in the MatchingContentHandler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1124: -- Attachment: TIKA-1124.patch Chose to move embedded file code into PDF2XHTML. This allows the proper closing of /body with the PDF2XHTML's XHTMLContentHandler. Will strip Windows noise before committing, but I wanted to submit this draft in case anyone wants to review it. Nested documents not extracted if a PDF file is in the chain Key: TIKA-1124 URL: https://issues.apache.org/jira/browse/TIKA-1124 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Reporter: Tim Allison Priority: Minor Attachments: pdf_attachment_issues.zip, TIKA-1124.patch Tika 1.3 is not able to get attachments from the attached PDF. The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted. I'm not sure of a solution, but I found two things that might help with the diagnosis: 1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk). 2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved. The cause may be in the MatchingContentHandler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1124. - Resolution: Fixed Fix Version/s: 1.5 Added tests (thanks to Nick's advice to use model of POIContainerExtractionTest). Committed r1511901 and r1511908. Nested documents not extracted if a PDF file is in the chain Key: TIKA-1124 URL: https://issues.apache.org/jira/browse/TIKA-1124 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Reporter: Tim Allison Priority: Minor Fix For: 1.5 Attachments: pdf_attachment_issues.zip, TIKA-1124.patch Tika 1.3 is not able to get attachments from the attached PDF. The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted. I'm not sure of a solution, but I found two things that might help with the diagnosis: 1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk). 2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved. The cause may be in the MatchingContentHandler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733804#comment-13733804 ] Tim Allison commented on TIKA-792: -- Committed in POI. Once POI3.9beta2 is released, I'll increment POI's version in Tika's build file and confirm that this is taken care of. There may be other sources of this than the one that my test document triggered. NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document Key: TIKA-792 URL: https://issues.apache.org/jira/browse/TIKA-792 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x Reporter: Torsten Krah Fix For: 1.2 Attachments: test10.docx Parsing some OOXML documents, this stacktrace is logged many times: java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) at java.lang.Class.getConstructor0(Class.java:2723) at java.lang.Class.getDeclaredConstructor(Class.java:2002) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021) at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893) at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657) at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654) at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647) at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995) at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904) at org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) at org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) Looking at the poi code java is right here, there is no constructor with a SchemaType and a boolean, only with SchemaType. My guess is this one was missed during upgrade to poi beta4, but only a guess, anyway needs a fix :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736875#comment-13736875 ] Tim Allison commented on TIKA-1153: --- Fellow Tika committers, I made this change locally and all tests passed. Is there something else I should do before committing this change? Upgrade pdfbox to latest 1.8.2 version -- Key: TIKA-1153 URL: https://issues.apache.org/jira/browse/TIKA-1153 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: Windows/Linux Reporter: Hong-Thai Nguyen Priority: Critical Fix For: 1.5 Current version is 1.8.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1001: -- Attachment: TIKA-1001v1.tar.gz This is a draft that simplifies the extraction of the charset attribute within a meta tag (old html and new HTML5) and should make the charset extraction more robust to noisy metaheaders. The strategy is: 1) find the meta tags 2) find charset=x within the meta tag 3) return the first valid charset Is the proposed strategy too broad? Will there be false positives? Will commit in a few days if there is no feedback. Thank you! P.S. Ignore the patch.xml file, of course. :) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset - Key: TIKA-1001 URL: https://issues.apache.org/jira/browse/TIKA-1001 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: david lemon Attachments: badarabic.html, TIKA-1001v1.tar.gz attached document extracts correctly in Tika 1.1 attached document extracts incorrectly in tika 1.2. The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8. tika 1.2 appears to ignore the charset specified in the meta tag. Some noodling seems to indicate that the problem is the charset. it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740580#comment-13740580 ] Tim Allison commented on TIKA-1001: --- Fixed as of r1514126. Thank you for submitting this issue with test file! tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset - Key: TIKA-1001 URL: https://issues.apache.org/jira/browse/TIKA-1001 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: david lemon Attachments: badarabic.html, TIKA-1001v1.tar.gz attached document extracts correctly in Tika 1.1 attached document extracts incorrectly in tika 1.2. The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8. tika 1.2 appears to ignore the charset specified in the meta tag. Some noodling seems to indicate that the problem is the charset. it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741785#comment-13741785 ] Tim Allison commented on TIKA-1153: --- Committed as of r1514551. Upgrade pdfbox to latest 1.8.2 version -- Key: TIKA-1153 URL: https://issues.apache.org/jira/browse/TIKA-1153 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: Windows/Linux Reporter: Hong-Thai Nguyen Priority: Critical Fix For: 1.5 Current version is 1.8.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741785#comment-13741785 ] Tim Allison commented on TIKA-1153: --- Committed as of r1514551. Upgrade pdfbox to latest 1.8.2 version -- Key: TIKA-1153 URL: https://issues.apache.org/jira/browse/TIKA-1153 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: Windows/Linux Reporter: Hong-Thai Nguyen Priority: Critical Fix For: 1.5 Current version is 1.8.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1001. - tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset - Key: TIKA-1001 URL: https://issues.apache.org/jira/browse/TIKA-1001 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: david lemon Attachments: badarabic.html, TIKA-1001v1.tar.gz attached document extracts correctly in Tika 1.1 attached document extracts incorrectly in tika 1.2. The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8. tika 1.2 appears to ignore the charset specified in the meta tag. Some noodling seems to indicate that the problem is the charset. it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1153. - Resolution: Fixed Upgrade pdfbox to latest 1.8.2 version -- Key: TIKA-1153 URL: https://issues.apache.org/jira/browse/TIKA-1153 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: Windows/Linux Reporter: Hong-Thai Nguyen Priority: Critical Fix For: 1.5 Current version is 1.8.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1001. --- Resolution: Fixed tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset - Key: TIKA-1001 URL: https://issues.apache.org/jira/browse/TIKA-1001 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: david lemon Attachments: badarabic.html, TIKA-1001v1.tar.gz attached document extracts correctly in Tika 1.1 attached document extracts incorrectly in tika 1.2. The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8. tika 1.2 appears to ignore the charset specified in the meta tag. Some noodling seems to indicate that the problem is the charset. it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742129#comment-13742129 ] Tim Allison commented on TIKA-1162: --- Would you be willing to attach a document/test case that triggers this issue? content-type/charset problem with RFC822Parser -- Key: TIKA-1162 URL: https://issues.apache.org/jira/browse/TIKA-1162 Project: Tika Issue Type: Bug Components: parser Reporter: Maciej Lizewski RFC822Parser (mime mail) uses MailContentHandler which internally uses AutoDetectParser to handle each mime part. The problem is that MailContentHandler reads mime part headers and sets CONTENT_TYPE and CONTENT_ENCODING metadata properly and passes this metadata to AutoDetectParser::parse method. But that method ignores those headers and overwrites it: MediaType type = this.getDetector().detect(tis, metadata); metadata.set(Metadata.CONTENT_TYPE, type.toString()); this leads to some additional recursion loops (Detector returns message/rfc822 mime type instead of proper mimetype for current mime part) and finally somehow it skips out of the loop but without proper content-type and content-encoding headers... My proposition is to add check if metadata already contains CONTENT_TYPE in AutoDetectPArser::parse and in such case do not override it. If this is not valid behavior in general - then RFC822Parser should use custom parser in MailContentHandler which respects passed content-type... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742266#comment-13742266 ] Tim Allison commented on TIKA-1001: --- David, Thank you for submitting this. I fixed the issue triggered by your file and a few other variants that occurred to me. I wouldn't be surprised if we'll need to make more modifications. Please submit any other issues you find. Thank you, again. tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset - Key: TIKA-1001 URL: https://issues.apache.org/jira/browse/TIKA-1001 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: david lemon Attachments: badarabic.html, TIKA-1001v1.tar.gz attached document extracts correctly in Tika 1.1 attached document extracts incorrectly in tika 1.2. The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8. tika 1.2 appears to ignore the charset specified in the meta tag. Some noodling seems to indicate that the problem is the charset. it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1132: --- Assignee: Tim Allison Will add test case in Tika. Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Assignee: Tim Allison Fix For: 1.5 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1173) Upgrade to POI-3.10-beta2
Tim Allison created TIKA-1173: - Summary: Upgrade to POI-3.10-beta2 Key: TIKA-1173 URL: https://issues.apache.org/jira/browse/TIKA-1173 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1173) Upgrade to POI-3.10-beta2
[ https://issues.apache.org/jira/browse/TIKA-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1173. --- Resolution: Fixed Upgrade to POI-3.10-beta2 - Key: TIKA-1173 URL: https://issues.apache.org/jira/browse/TIKA-1173 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772116#comment-13772116 ] Tim Allison edited comment on TIKA-1132 at 9/20/13 5:35 PM: Any recommendations for a test? The underlying problem was that POI was doing on the order of 10^18 division calculations...so not infinite, but exceedingly slow. Would a jUnit timeout of, say, 10 seconds be reasonable? was (Author: talli...@mitre.org): Will add test case in Tika. Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Assignee: Tim Allison Fix For: 1.5 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772116#comment-13772116 ] Tim Allison edited comment on TIKA-1132 at 9/20/13 5:36 PM: Any recommendations for a test? The underlying problem was that POI was doing on the order of 10^24 division calculations...so not infinite, but exceedingly slow. Would a jUnit timeout of, say, 10 seconds be reasonable? was (Author: talli...@mitre.org): Any recommendations for a test? The underlying problem was that POI was doing on the order of 10^18 division calculations...so not infinite, but exceedingly slow. Would a jUnit timeout of, say, 10 seconds be reasonable? Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Assignee: Tim Allison Fix For: 1.5 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773212#comment-13773212 ] Tim Allison commented on TIKA-792: -- This is now fixed by TIKA-1173. Can anyone recommend a more obvious test of the solution to this than kicking off a process to extract text from the document and capturing std.err? It would be nice to have something that we can generalize to other documents that trigger this issue because of a different set of missing beans. NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document Key: TIKA-792 URL: https://issues.apache.org/jira/browse/TIKA-792 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x Reporter: Torsten Krah Fix For: 1.2 Attachments: test10.docx Parsing some OOXML documents, this stacktrace is logged many times: java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) at java.lang.Class.getConstructor0(Class.java:2723) at java.lang.Class.getDeclaredConstructor(Class.java:2002) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021) at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893) at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657) at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654) at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647) at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995) at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904) at org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) at org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) Looking at the poi code java is right here, there is no constructor with a SchemaType and a boolean, only with SchemaType. My guess is this one was missed during upgrade to poi beta4, but only a guess, anyway needs a fix :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778801#comment-13778801 ] Tim Allison commented on TIKA-1100: --- Updated XSSFExcelExtractorDecorator and added test as of r1526489. cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm) - Key: TIKA-1100 URL: https://issues.apache.org/jira/browse/TIKA-1100 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Windows7 64bit Reporter: Kazuaki Matsuba Attachments: testEXCEL_textbox.xlsx When I launch Tika gui from command-line and drag and drop .xlsx file that have textbox, no text in the textbox are extracted. When drag and drop .xls file, text in the textbox are extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1100. --- Resolution: Fixed Fix Version/s: 1.5 r1526498 cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm) - Key: TIKA-1100 URL: https://issues.apache.org/jira/browse/TIKA-1100 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Windows7 64bit Reporter: Kazuaki Matsuba Fix For: 1.5 Attachments: testEXCEL_textbox.xlsx When I launch Tika gui from command-line and drag and drop .xlsx file that have textbox, no text in the textbox are extracted. When drag and drop .xls file, text in the textbox are extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-792: -- added test that catches stderr. r1526570. reopening just to record this. NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document Key: TIKA-792 URL: https://issues.apache.org/jira/browse/TIKA-792 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x Reporter: Torsten Krah Fix For: 1.2 Attachments: test10.docx Parsing some OOXML documents, this stacktrace is logged many times: java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) at java.lang.Class.getConstructor0(Class.java:2723) at java.lang.Class.getDeclaredConstructor(Class.java:2002) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021) at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893) at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657) at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654) at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647) at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995) at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904) at org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) at org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) Looking at the poi code java is right here, there is no constructor with a SchemaType and a boolean, only with SchemaType. My guess is this one was missed during upgrade to poi beta4, but only a guess, anyway needs a fix :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-792. -- Resolution: Fixed NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document Key: TIKA-792 URL: https://issues.apache.org/jira/browse/TIKA-792 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x Reporter: Torsten Krah Fix For: 1.2 Attachments: test10.docx Parsing some OOXML documents, this stacktrace is logged many times: java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) at java.lang.Class.getConstructor0(Class.java:2723) at java.lang.Class.getDeclaredConstructor(Class.java:2002) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021) at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893) at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657) at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654) at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647) at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995) at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904) at org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) at org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) Looking at the poi code java is right here, there is no constructor with a SchemaType and a boolean, only with SchemaType. My guess is this one was missed during upgrade to poi beta4, but only a guess, anyway needs a fix :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1132. --- Resolution: Fixed Resolved with upgrade to poi-3.10-beta2. Could use help getting jUnit's timeout to work. Currently no unit tests for this. Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Assignee: Tim Allison Fix For: 1.5 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1076) Upgrade to Apache POI 3.9
[ https://issues.apache.org/jira/browse/TIKA-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1076. --- Resolution: Fixed Added some code similar to the fix to POI-54722 to HSLFExtractor. Uncommented old test. Text is now extracted from tables in HSLF. Upgrade to Apache POI 3.9 - Key: TIKA-1076 URL: https://issues.apache.org/jira/browse/TIKA-1076 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Nick Burch Fix For: 1.5 We should upgrade to Apache POI 3.9, which is the latest version -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-817) (PPT/PPTX) Missing date/time in text content.
[ https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-817. -- Resolution: Fixed As mentioned above, this was fixed a while ago. I added test documents from POI-52367 and POI-52368, and I created simple tests to confirm behavior described in POI issues. (PPT/PPTX) Missing date/time in text content. - Key: TIKA-817 URL: https://issues.apache.org/jira/browse/TIKA-817 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.0 Environment: Win7-64 + java version 1.6.0_26 Reporter: Albert L. Fix For: 1.5 Missing date/time text in text content for PPT and PPTX files. The date and time are missing from the text content. This occurs when one chooses the following with MS-PowerPoint 2010: 1) Insert 2) Date Time 3) Update automatically 4) save to PPT or PPTX -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781922#comment-13781922 ] Tim Allison commented on TIKA-1162: --- Dear Colleague, I'm on paternity leave. Will be back part time on October 14. Best, Tim content-type/charset problem with RFC822Parser -- Key: TIKA-1162 URL: https://issues.apache.org/jira/browse/TIKA-1162 Project: Tika Issue Type: Bug Components: parser Reporter: Maciej Lizewski RFC822Parser (mime mail) uses MailContentHandler which internally uses AutoDetectParser to handle each mime part. The problem is that MailContentHandler reads mime part headers and sets CONTENT_TYPE and CONTENT_ENCODING metadata properly and passes this metadata to AutoDetectParser::parse method. But that method ignores those headers and overwrites it: MediaType type = this.getDetector().detect(tis, metadata); metadata.set(Metadata.CONTENT_TYPE, type.toString()); this leads to some additional recursion loops (Detector returns message/rfc822 mime type instead of proper mimetype for current mime part) and finally somehow it skips out of the loop but without proper content-type and content-encoding headers... My proposition is to add check if metadata already contains CONTENT_TYPE in AutoDetectPArser::parse and in such case do not override it. If this is not valid behavior in general - then RFC822Parser should use custom parser in MailContentHandler which respects passed content-type... -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-817) (PPT/PPTX) Missing date/time in text content.
[ https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811201#comment-13811201 ] Tim Allison commented on TIKA-817: -- Thank you! (PPT/PPTX) Missing date/time in text content. - Key: TIKA-817 URL: https://issues.apache.org/jira/browse/TIKA-817 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.0 Environment: Win7-64 + java version 1.6.0_26 Reporter: Albert L. Fix For: 1.5 Missing date/time text in text content for PPT and PPTX files. The date and time are missing from the text content. This occurs when one chooses the following with MS-PowerPoint 2010: 1) Insert 2) Date Time 3) Update automatically 4) save to PPT or PPTX -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (TIKA-1200) Upgrade pdfbox 1.8.3
[ https://issues.apache.org/jira/browse/TIKA-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1200. --- Resolution: Fixed Fixed in r1547037. Waiting for Jenkins to pick up change to confirm. Thank you! Upgrade pdfbox 1.8.3 Key: TIKA-1200 URL: https://issues.apache.org/jira/browse/TIKA-1200 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Priority: Critical Fix For: 1.5 pdfbox just released new 1.8.3 version http://www.apache.org/dist/pdfbox/1.8.3/RELEASE-NOTES.txt -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1201: - Assignee: Tim Allison Add possibility for switching to pdfbox NonSequentialPDFParser -- Key: TIKA-1201 URL: https://issues.apache.org/jira/browse/TIKA-1201 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Assignee: Tim Allison Priority: Critical As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0. ref.: https://issues.apache.org/jira/browse/PDFBOX-1104 http://pdfbox.apache.org/ideas.html We should provide an extended parser or parameter current PDFParser to call: {code} PDDocument.loadNonSeq(file, scratchFile); {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1201: -- Attachment: TIKA-1201.patch Trivial patch Add possibility for switching to pdfbox NonSequentialPDFParser -- Key: TIKA-1201 URL: https://issues.apache.org/jira/browse/TIKA-1201 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Assignee: Tim Allison Priority: Critical Attachments: TIKA-1201.patch As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0. ref.: https://issues.apache.org/jira/browse/PDFBOX-1104 http://pdfbox.apache.org/ideas.html We should provide an extended parser or parameter current PDFParser to call: {code} PDDocument.loadNonSeq(file, scratchFile); {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1201. --- Resolution: Fixed Fix Version/s: 1.5 Basic parameter-based capability added in r1547250. User beware that there may be differences in metadata processing between the NonSequentialPDFParser and the traditional parser. Will open issue to track failure to extract metadata from testAnnotations.pdf. Add possibility for switching to pdfbox NonSequentialPDFParser -- Key: TIKA-1201 URL: https://issues.apache.org/jira/browse/TIKA-1201 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Assignee: Tim Allison Priority: Critical Fix For: 1.5 Attachments: TIKA-1201.patch As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0. ref.: https://issues.apache.org/jira/browse/PDFBOX-1104 http://pdfbox.apache.org/ideas.html We should provide an extended parser or parameter current PDFParser to call: {code} PDDocument.loadNonSeq(file, scratchFile); {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (TIKA-1202) Refactor PDFParser to enable easier parameter setting
Tim Allison created TIKA-1202: - Summary: Refactor PDFParser to enable easier parameter setting Key: TIKA-1202 URL: https://issues.apache.org/jira/browse/TIKA-1202 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial It would be handy to be able to set PDFParser parameters (extractAnnotationText, etc) in a config file and via ParseContext. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (TIKA-1202) Refactor PDFParser to enable easier parameter setting
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1202: -- Attachment: TIKA-1202.patch Would appreciate community feedback on this before I commit it (December 6?). Is it ok to deprecate the setters and getters for the PDFParser parameters? Is the use of a simple properties file and integration via ParseContext consistent with design principles of Tika? Thank you! Refactor PDFParser to enable easier parameter setting - Key: TIKA-1202 URL: https://issues.apache.org/jira/browse/TIKA-1202 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Attachments: TIKA-1202.patch It would be handy to be able to set PDFParser parameters (extractAnnotationText, etc) in a config file and via ParseContext. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (TIKA-1203) Some metadata not extracted from PDF files when NonSequentialPDFParser is used
Tim Allison created TIKA-1203: - Summary: Some metadata not extracted from PDF files when NonSequentialPDFParser is used Key: TIKA-1203 URL: https://issues.apache.org/jira/browse/TIKA-1203 Project: Tika Issue Type: Bug Components: parser Reporter: Tim Allison Priority: Minor While working on TIKA-1201, I noticed that metadata was not being extracted from the testAnnotations.pdf file when the NonSequentialPDFParser was being used. I opened PDFBOX-1792. This TIKA issue is a placeholder. When PDFBOX-1792 is fixed, we can stop skipping testAnnotations.pdf in PDFParserTest. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837169#comment-13837169 ] Tim Allison edited comment on TIKA-1201 at 12/3/13 4:25 PM: Basic parameter-based capability added in r1547250. User beware that there may be differences in metadata processing between the NonSequentialPDFParser and the traditional parser. (See TIKA-1203 for failure of NonSequentialPDFParser to extract metadata from testAnnotations.pdf). was (Author: talli...@mitre.org): Basic parameter-based capability added in r1547250. User beware that there may be differences in metadata processing between the NonSequentialPDFParser and the traditional parser. Will open issue to track failure to extract metadata from testAnnotations.pdf. Add possibility for switching to pdfbox NonSequentialPDFParser -- Key: TIKA-1201 URL: https://issues.apache.org/jira/browse/TIKA-1201 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Assignee: Tim Allison Priority: Critical Fix For: 1.5 Attachments: TIKA-1201.patch As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0. ref.: https://issues.apache.org/jira/browse/PDFBOX-1104 http://pdfbox.apache.org/ideas.html We should provide an extended parser or parameter current PDFParser to call: {code} PDDocument.loadNonSeq(file, scratchFile); {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1199) Tika extracts weird signs instead of text
[ https://issues.apache.org/jira/browse/TIKA-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838856#comment-13838856 ] Tim Allison commented on TIKA-1199: --- Doh! Duplicated Marc's PDFBOX-1783. Sorry about that. Tika extracts weird signs instead of text - Key: TIKA-1199 URL: https://issues.apache.org/jira/browse/TIKA-1199 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: MacOSX, Linux Reporter: Marc Teutelink Attachments: gaat fout.pdf, plain_text_tika_output_from_gaat_fout_pdf.txt, structured_text_tika_output_from_gaat_fout_pdf.xml Tika extracts complete bogus text from the attached document. I have attached the .PDF in question and also added the plain and structured text output from Tika. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (TIKA-1202) Refactor PDFParser to enable easier parameter setting
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1202. --- Resolution: Fixed Fix Version/s: 1.5 Committed in r1548700. Thank you, Mike and Hong-Thai for feedback. More parameters on the way... Refactor PDFParser to enable easier parameter setting - Key: TIKA-1202 URL: https://issues.apache.org/jira/browse/TIKA-1202 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 Attachments: TIKA-1202.patch It would be handy to be able to set PDFParser parameters (extractAnnotationText, etc) in a config file and via ParseContext. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Reopened] (TIKA-1202) Refactor PDFParser to enable easier parameter setting
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1202: --- Small bug in using default vs config. Refactor PDFParser to enable easier parameter setting - Key: TIKA-1202 URL: https://issues.apache.org/jira/browse/TIKA-1202 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 Attachments: TIKA-1202.patch It would be handy to be able to set PDFParser parameters (extractAnnotationText, etc) in a config file and via ParseContext. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Resolved] (TIKA-1202) Refactor PDFParser to enable easier parameter setting
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1202. --- Resolution: Fixed r1549646 Refactor PDFParser to enable easier parameter setting - Key: TIKA-1202 URL: https://issues.apache.org/jira/browse/TIKA-1202 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 Attachments: TIKA-1202.patch It would be handy to be able to set PDFParser parameters (extractAnnotationText, etc) in a config file and via ParseContext. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
Tim Allison created TIKA-1205: - Summary: Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser encounters an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fall back to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fall back to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845429#comment-13845429 ] Tim Allison commented on TIKA-1205: --- Thank you for your feedback! TIKA-456 is the existing issue for general timeout capability. I agree that it would be great to add. TIKA-1205 is a very narrowly defined improvement for PDFParser. Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser throws an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fallback to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fallback to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Reopened] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-973: -- Assignee: Tim Allison In hindsight, would prefer to use test documents that are unequivocally consistent with Apache License. I've removed docs from trunk and commented out test cases (r1550725). If anyone would like to contribute an example doc that is unequivocally consistent with Apache License 2.0, I'll modify the test case for that doc. I'll be on the lookout for test docs and will leave this open until test cases are turned back on. The functionality within Tika is still available, of course. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Assignee: Tim Allison Priority: Minor Fix For: 1.5 Attachments: TIKA-973-patch.tar.gz, TIKA-973.patch.tar.gz, i-9_screenshot.png When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852938#comment-13852938 ] Tim Allison commented on TIKA-1212: --- On first issue: do you mean that you'd like to have a parameter that would unzip the abc.zip file but not unzip the pqr.zip file? Or do you want to be able to select embedded document types that you don't want to recurse through? Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1212: -- Attachment: abc.zip Does this test file meet your description? Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Attachments: abc.zip Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1205: -- Due Date: 17/Jan/14 (was: 20/Dec/13) Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser throws an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fallback to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fallback to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864393#comment-13864393 ] Tim Allison commented on TIKA-1216: --- Give this a shot: https://builds.apache.org/job/Tika-trunk/org.apache.tika$tika-app/lastSuccessfulBuild/artifact/org.apache.tika/tika-app/1.5-20131229.024202-48/tika-app-1.5-20131229.024202-48.jar parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1216. --- Resolution: Fixed Fix Version/s: 1.5 Following reporter's comment, this looks to be fixed in 1.5-SNAPSHOT. If it turns out to be a duplicate of TIKA-1215, I'll switch resolution to duplicate. Thank you for reporting this! parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Fix For: 1.5 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866916#comment-13866916 ] Tim Allison commented on TIKA-1216: --- Agreed. I didn't think this was a duplicate. It is fixed, though, in trunk? If so, let's close this issue. parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Fix For: 1.5 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869528#comment-13869528 ] Tim Allison commented on TIKA-1215: --- [~thaichat04] thank you for sending a clean patch. This area of the code base is not exceedingly familiar to me, but if I understand Tika's history and your code correctly, your if statement wasn't necessary in 1.4, and (based on a very quick look) it looks like nothing else in the relevant lines of the MP3 parser changed between 1.4 and trunk. Are you able to determine what changed btwn 1.4 and trunk that led to this regression? Thank you! Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1226: - Assignee: Tim Allison PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField Key: TIKA-1226 URL: https://issues.apache.org/jira/browse/TIKA-1226 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Eric Knauel Assignee: Tim Allison I have a PDF document that contains a filled in form. Among the various fields of type text and radio button there are multiple fields for digital signatures. When I load this document into tika-app I get the following exception: {noformat} Caused by: java.lang.RuntimeException: Can't get signature as String, use getSignature() instead. at org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 43 more {noformat} The problem seems to be that PDF2XHTML seems to expect that it can call getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this is not true for the sub class PDSignatureField: http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html The java doc says that getSignature() should be called instead. Assuming that the information inside the signature is not relevant for the extraction process and can be discarded the following patch helps: {noformat} Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP +UTF-8 === --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision 1560617) +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision ) @@ -40,6 +40,7 @@ import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import org.apache.tika.exception.TikaException; @@ -464,7 +465,9 @@ } String value = ; try { + if (!(field instanceof PDSignatureField)) { - value = field.getValue(); + value = field.getValue(); + } } catch (IOException e) { //swallow } {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880130#comment-13880130 ] Tim Allison commented on TIKA-1226: --- Eric, Thank you for reporting this. I'll make the fix shortly. Are you able to share your document as a test case? Thank you, again. PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField Key: TIKA-1226 URL: https://issues.apache.org/jira/browse/TIKA-1226 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Eric Knauel Assignee: Tim Allison I have a PDF document that contains a filled in form. Among the various fields of type text and radio button there are multiple fields for digital signatures. When I load this document into tika-app I get the following exception: {noformat} Caused by: java.lang.RuntimeException: Can't get signature as String, use getSignature() instead. at org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 43 more {noformat} The problem seems to be that PDF2XHTML seems to expect that it can call getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this is not true for the sub class PDSignatureField: http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html The java doc says that getSignature() should be called instead. Assuming that the information inside the signature is not relevant for the extraction process and can be discarded the following patch helps: {noformat} Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP +UTF-8 === --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision 1560617) +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision ) @@ -40,6 +40,7 @@ import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import org.apache.tika.exception.TikaException; @@ -464,7 +465,9 @@ } String value = ; try { + if (!(field instanceof PDSignatureField)) { - value = field.getValue(); + value = field.getValue(); + } } catch (IOException e) { //swallow } {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880273#comment-13880273 ] Tim Allison commented on TIKA-1226: --- How about we grab the name? {noformat} if (field instanceof PDSignatureField){ PDSignature sig = ((PDSignatureField)field).getSignature(); if (sig != null){ value = sig.getName(); } } else { value = field.getValue(); } {noformat} Should we also grab the contactinfo, location, the date or the reason? PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField Key: TIKA-1226 URL: https://issues.apache.org/jira/browse/TIKA-1226 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Eric Knauel Assignee: Tim Allison I have a PDF document that contains a filled in form. Among the various fields of type text and radio button there are multiple fields for digital signatures. When I load this document into tika-app I get the following exception: {noformat} Caused by: java.lang.RuntimeException: Can't get signature as String, use getSignature() instead. at org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 43 more {noformat} The problem seems to be that PDF2XHTML seems to expect that it can call getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this is not true for the sub class PDSignatureField: http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html The java doc says that getSignature() should be called instead. Assuming that the information inside the signature is not relevant for the extraction process and can be discarded the following patch helps: {noformat} Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP +UTF-8 === --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision 1560617) +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision ) @@ -40,6 +40,7 @@ import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import org.apache.tika.exception.TikaException; @@ -464,7 +465,9 @@ } String value = ; try { + if (!(field instanceof PDSignatureField)) { - value = field.getValue(); + value = field.getValue(); + } } catch (IOException e) { //swallow } {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383 ] Tim Allison commented on TIKA-1226: --- Thank you for the test file. I'll use that in the formal test. I used another doc for dev that I unfortunately can't share. Does this format look good? My dev doc only had name and date, but the other info would also show up if it existed... {noformat} div class=acroform ol li altName=nameName: my name/li li ol type=signaturedata li signdata=date2014-01-17T11:57:26-0500/li li signdata=namemy name/li /ol /li /ol /div {noformat} PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField Key: TIKA-1226 URL: https://issues.apache.org/jira/browse/TIKA-1226 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Eric Knauel Assignee: Tim Allison Attachments: pdf-form-with-signature-field-empty.pdf I have a PDF document that contains a filled in form. Among the various fields of type text and radio button there are multiple fields for digital signatures. When I load this document into tika-app I get the following exception: {noformat} Caused by: java.lang.RuntimeException: Can't get signature as String, use getSignature() instead. at org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 43 more {noformat} The problem seems to be that PDF2XHTML seems to expect that it can call getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this is not true for the sub class PDSignatureField: http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html The java doc says that getSignature() should be called instead. Assuming that the information inside the signature is not relevant for the extraction process and can be discarded the following patch helps: {noformat} Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP +UTF-8 === --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision 1560617) +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision ) @@ -40,6 +40,7 @@ import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import org.apache.tika.exception.TikaException; @@ -464,7 +465,9 @@ } String value = ; try { + if (!(field instanceof PDSignatureField)) { - value = field.getValue(); + value = field.getValue(); + } } catch (IOException e) { //swallow } {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383 ] Tim Allison edited comment on TIKA-1226 at 1/24/14 8:22 PM: Thank you for the test file. I'll use that in the formal test. I used another doc for dev that I unfortunately can't share. Does this format look good? My dev doc only had name and date, but the other info would also show up if it existed... {noformat} div class=acroform ol li altName=nameName: my name/li li ol type=signaturedata li signdata=date2014-01-17T11:57:26-0500/li li signdata=namemy name/li /ol /li /ol /div {noformat} was (Author: talli...@mitre.org): Thank you for the test file. I'll use that in the formal test. I used another doc for dev that I unfortunately can't share. Does this format look good? My dev doc only had name and date, but the other info would also show up if it existed... {noformat} div class=acroform ol li altName=nameName: my name/li li ol type=signaturedata li signdata=date2014-01-17T11:57:26-0500/li li signdata=namemy name/li /ol /li /ol /div {noformat} PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField Key: TIKA-1226 URL: https://issues.apache.org/jira/browse/TIKA-1226 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Eric Knauel Assignee: Tim Allison Attachments: pdf-form-with-signature-field-empty.pdf I have a PDF document that contains a filled in form. Among the various fields of type text and radio button there are multiple fields for digital signatures. When I load this document into tika-app I get the following exception: {noformat} Caused by: java.lang.RuntimeException: Can't get signature as String, use getSignature() instead. at org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 43 more {noformat} The problem seems to be that PDF2XHTML seems to expect that it can call getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this is not true for the sub class PDSignatureField: http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html The java doc says that getSignature() should be called instead. Assuming that the information inside the signature is not relevant for the extraction process and can be discarded the following patch helps: {noformat} Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP +UTF-8 === --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision 1560617) +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision ) @@ -40,6 +40,7 @@ import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import org.apache.tika.exception.TikaException; @@ -464,7 +465,9 @@ } String value = ; try { + if (!(field instanceof PDSignatureField)) { - value = field.getValue(); + value = field.getValue(); + } } catch (IOException e) { //swallow } {noformat} -- This message was sent
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison commented on TIKA-1228: --- I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {no-format} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {no-format} where processEmbedded is shorthand for the existing code: {no-format} if (embeddedFileNames != null){ ... } {no-format} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison edited comment on TIKA-1228 at 2/3/14 6:09 PM: --- I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {noformat} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {noformat} where processEmbedded is shorthand for the existing code: {noformat} if (embeddedFileNames != null){ ... } {noformat} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) was (Author: talli...@mitre.org): I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {no-format} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {no-format} where processEmbedded is shorthand for the existing code: {no-format} if (embeddedFileNames != null){ ... } {no-format} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison edited comment on TIKA-1228 at 2/3/14 6:11 PM: --- I won't have time to fix this for a week or so, but, I'll take this unless another committer has time sooner. was (Author: talli...@mitre.org): I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {noformat} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {noformat} where processEmbedded is shorthand for the existing code: {noformat} if (embeddedFileNames != null){ ... } {noformat} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1228. --- Resolution: Fixed Fix Version/s: 1.5 Fixed in r1564042. Thank you, [~agi20dla], for reporting this and diagnosing the cause and solution for this bug! I'm resolving this for now. I'm waiting to hear back from users@pdfbox to see if we should search recursively for non-null attachment data. The example that you provided does show only checking the children. I'll reopen this issue if we need to switch to full recursion. Thank you, again. Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Issue Comment Deleted] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1228: -- Comment: was deleted (was: I won't have time to fix this for a week or so, but, I'll take this unless another committer has time sooner.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890605#comment-13890605 ] Tim Allison commented on TIKA-1228: --- Not sure I understand. Is this the snippet that you refer to in PDNameTreeNode: {noformat} public MapString, COSObjectable getNames() throws IOException { COSArray namesArray = (COSArray)node.getDictionaryObject( COSName.NAMES ); {noformat} The above throws a class cast exception, but the code that you show doesn't? Are you getting a class cast exception on the document that you submitted with this issue or is it a different document? Thank you, again. Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890610#comment-13890610 ] Tim Allison commented on TIKA-1228: --- Y. That's the point of open source. :) Enjoy! Now that I'm looking at this issue again, I dragged out some of my pre-Tika code for pdf attachments using a different pdf library. It looks like the pdf files I was coding against could have the file name in a parent node and the actual bytes in a child or more distant descendant node. Will see if I can dig up the triggering files and see if Tika needs any more mods on PDF attachment extraction. {noformat} private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment attach, int recursiveDepth){ COSName fCOSName = COSName.create(F); COSName efCOSName = COSName.create(EF); COSObject fObj = dict.get(fCOSName); COSObject efObj = dict.get(efCOSName); if (null != fObj){ if (fObj.getClass() == COSString.class){ attach.setName(fObj.stringValue()); } else if (fObj.getClass() == COSStream.class){ attach.setBytes(((COSStream)fObj).getDecodedBytes()); return attach; } } if (null != efObj efObj.getClass() == COSDictionary.class){ int tmpI = recursiveDepth; tmpI++; return lookForByteStream((COSDictionary)efObj, attach, tmpI); } return null; } {noformat} Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890613#comment-13890613 ] Tim Allison commented on TIKA-1228: --- Ok, to confirm, the PDNameTreeNode class cast exception is a non-issue? Thanks again. Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1230) Update PDFBox to v1.8.4
Tim Allison created TIKA-1230: - Summary: Update PDFBox to v1.8.4 Key: TIKA-1230 URL: https://issues.apache.org/jira/browse/TIKA-1230 Project: Tika Issue Type: Improvement Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Fix For: 1.5 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1230) Update PDFBox to v1.8.4
[ https://issues.apache.org/jira/browse/TIKA-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1230. --- Resolution: Fixed r1564335 Update PDFBox to v1.8.4 --- Key: TIKA-1230 URL: https://issues.apache.org/jira/browse/TIKA-1230 Project: Tika Issue Type: Improvement Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Fix For: 1.5 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1231) Safely handle null embedded files in PDFs
Tim Allison created TIKA-1231: - Summary: Safely handle null embedded files in PDFs Key: TIKA-1231 URL: https://issues.apache.org/jira/browse/TIKA-1231 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.5 I filed a potential fix, unit test and test doc for this in PDFBOX-1884. We'll need to add one test for null in the Tika PDFParser to handle this change once it is fixed in PDFBox. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1232: - Assignee: Tim Allison Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892146#comment-13892146 ] Tim Allison commented on TIKA-1232: --- How about Application-Version to follow the deprecated example in org.apache.tika.metadata.MSOffice? Tika Community, Is there a more appropriate label for this? I didn't find anything relevant in TikaCoreProperties. Thank you. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates
Tim Allison created TIKA-1233: - Summary: PDFBox can throw StringIndexOutOfBoundsException on some dates Key: TIKA-1233 URL: https://issues.apache.org/jira/browse/TIKA-1233 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Fix For: 1.6 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. I've raised PDFBOX-1883. Until that is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} I'd commit now, but I don't want to interfere with cutting of 1.5. Let me know if I should commit, or please do it for me if appropriate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380 ] Tim Allison commented on TIKA-1232: --- Interesting. Thank you, [~johanvanderknijff] and [~anjackson]. I personally like Extended-Content-Type, but following (http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar with Dublin Core and/or XMP who could recommend appropriate tags? Many apologies if either one of those recommends Extended-Content-Type :). Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380 ] Tim Allison edited comment on TIKA-1232 at 2/6/14 2:31 PM: --- Interesting. Thank you, [~johanvanderknijff] and [~anjackson]. I personally like Extended-Content-Type, but following (http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar (than I am) with Dublin Core and/or XMP who could recommend appropriate tags? Many apologies if either one of those recommends Extended-Content-Type :). was (Author: talli...@mitre.org): Interesting. Thank you, [~johanvanderknijff] and [~anjackson]. I personally like Extended-Content-Type, but following (http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar with Dublin Core and/or XMP who could recommend appropriate tags? Many apologies if either one of those recommends Extended-Content-Type :). Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893426#comment-13893426 ] Tim Allison commented on TIKA-1232: --- [~anjackson], y, I'd like to add your code if others agree that it would be useful. No need for a formal patch. I'll take your github code nearly directly. Two items: 1) Would you be interested in contributing your extension-level extraction code to PDFBox if it doesn't currently exist there (I haven't checked but I assume you wouldn't reinvent the wheel). I think that would be more at home within PDFBox. 2) How much testing have you done for potential exceptions thrown by PDFBox on pdfs in the wild when grabbing this new metadata (cf. null pointer checks around date parsing in current metadata code and TIKA-1226, TIKA-1232, TIKA-1233)? Thank you, again. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1233: -- Description: PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} was: PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. I've raised PDFBOX-1883. Until that is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} I'd commit now, but I don't want to interfere with cutting of 1.5. Let me know if I should commit, or please do it for me if appropriate. PDFBox can throw StringIndexOutOfBoundsException on some dates -- Key: TIKA-1233 URL: https://issues.apache.org/jira/browse/TIKA-1233 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Labels: easyfix Fix For: 1.6 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)