[jira] [Created] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-05-23 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1124: - Summary: Nested documents not extracted if a PDF file is in the chain Key: TIKA-1124 URL: https://issues.apache.org/jira/browse/TIKA-1124 Project: Tika Issue

[jira] [Updated] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-05-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1124: -- Attachment: pdf_attachment_issues.zip outer.docx contains the attached.pdf, which itself contains an

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676495#comment-13676495 ] Tim Allison commented on TIKA-1130: --- I've submitted a patch to POI for this

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677957#comment-13677957 ] Tim Allison commented on TIKA-1130: --- I'll try to submit the Tika portion of the POI-54849

[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682628#comment-13682628 ] Tim Allison commented on TIKA-1132: --- Tika gui took longer than I was willing to wait,

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692110#comment-13692110 ] Tim Allison commented on TIKA-1130: --- Nick, I think I have to make modifications to

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692517#comment-13692517 ] Tim Allison commented on TIKA-1130: --- Maven proxy setting in my settings.xml file is

[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693068#comment-13693068 ] Tim Allison commented on TIKA-973: -- Will submit patch and tests by end of the week.

[jira] [Updated] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1130: -- Attachment: TIKA-1130.patch Ray's initial test restored after POI-55142 was committed. Thank you,

[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: TIKA-973-patch.tar.gz Patch attached. Dumps contents of pdf forms at end of document.

[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694774#comment-13694774 ] Tim Allison commented on TIKA-973: -- Agree on both. Also would appreciate feedback on what

[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: i-9_screenshot.png Screenshot attached. Thanks again to:

[jira] [Created] (TIKA-1139) Modify Tika-1129 to test against a local file

2013-07-02 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1139: - Summary: Modify Tika-1129 to test against a local file Key: TIKA-1139 URL: https://issues.apache.org/jira/browse/TIKA-1139 Project: Tika Issue Type: Improvement

[jira] [Updated] (TIKA-1139) Modify Tika-1129 to test against a local file

2013-07-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1139: -- Attachment: TIKA-1139.patch.tar.gz Patch attached. Modify Tika-1129 to test against a

[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-07-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: TIKA-973.patch.tar.gz Middle-road change made. The alternate name is an attribute and partial

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-07-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698009#comment-13698009 ] Tim Allison commented on TIKA-1130: --- That was fast. Thank you! .docx

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-07-10 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704699#comment-13704699 ] Tim Allison commented on TIKA-1130: --- Haven't had a chance to build from trunk today, but

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-07-10 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704711#comment-13704711 ] Tim Allison commented on TIKA-1130: --- Tested with freshly built trunk, and the text looks

[jira] [Created] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1150: - Summary: Extract text from textbox in XLSX Key: TIKA-1150 URL: https://issues.apache.org/jira/browse/TIKA-1150 Project: Tika Issue Type: New Feature

[jira] [Updated] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1150: -- Attachment: testEXCEL_textbox.xlsx Simple file that shows issue. Extract text from

[jira] [Commented] (TIKA-1150) Extract text from textbox in XLSX

2013-07-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716429#comment-13716429 ] Tim Allison commented on TIKA-1150: --- Duplicate of

[jira] [Closed] (TIKA-1150) Extract text from textbox in XLSX

2013-07-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1150. - Resolution: Duplicate Extract text from textbox in XLSX -

[jira] [Commented] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-07-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716432#comment-13716432 ] Tim Allison commented on TIKA-1100: --- Waiting for improvements in POI-55292. Will make

[jira] [Updated] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-07-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1100: -- Attachment: testEXCEL_textbox.xlsx Simple example file attached for now. Will fill out with test cases

[jira] [Updated] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-792: - Attachment: test10.docx Example document that triggers no such method exceptions for: CTMarkupRangeImpl,

[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729781#comment-13729781 ] Tim Allison commented on TIKA-1124: --- If anyone has a chance to look into this, I'd

[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729903#comment-13729903 ] Tim Allison commented on TIKA-1124: --- Ok, I think I figured this out... AbstractOOXML

[jira] [Updated] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1124: -- Attachment: TIKA-1124.patch Chose to move embedded file code into PDF2XHTML. This allows the proper

[jira] [Closed] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-08 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1124. - Resolution: Fixed Fix Version/s: 1.5 Added tests (thanks to Nick's advice to use model of

[jira] [Commented] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-08-08 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733804#comment-13733804 ] Tim Allison commented on TIKA-792: -- Committed in POI. Once POI3.9beta2 is released, I'll

[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736875#comment-13736875 ] Tim Allison commented on TIKA-1153: --- Fellow Tika committers, I made this change locally

[jira] [Updated] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1001: -- Attachment: TIKA-1001v1.tar.gz This is a draft that simplifies the extraction of the charset attribute

[jira] [Commented] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-14 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740580#comment-13740580 ] Tim Allison commented on TIKA-1001: --- Fixed as of r1514126. Thank you for submitting this

[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741785#comment-13741785 ] Tim Allison commented on TIKA-1153: --- Committed as of r1514551. Upgrade

[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741785#comment-13741785 ] Tim Allison commented on TIKA-1153: --- Committed as of r1514551. Upgrade

[jira] [Closed] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-16 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1001. - tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

[jira] [Closed] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-16 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1153. - Resolution: Fixed Upgrade pdfbox to latest 1.8.2 version --

[jira] [Resolved] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-16 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1001. --- Resolution: Fixed tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6

[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-08-16 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742129#comment-13742129 ] Tim Allison commented on TIKA-1162: --- Would you be willing to attach a document/test case

[jira] [Commented] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-16 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742266#comment-13742266 ] Tim Allison commented on TIKA-1001: --- David, Thank you for submitting this. I fixed the

[jira] [Reopened] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-19 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1132: --- Assignee: Tim Allison Will add test case in Tika. Parsing some XLS documents

[jira] [Created] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1173: - Summary: Upgrade to POI-3.10-beta2 Key: TIKA-1173 URL: https://issues.apache.org/jira/browse/TIKA-1173 Project: Tika Issue Type: Improvement Reporter:

[jira] [Resolved] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1173. --- Resolution: Fixed Upgrade to POI-3.10-beta2 - Key:

[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772116#comment-13772116 ] Tim Allison edited comment on TIKA-1132 at 9/20/13 5:35 PM: Any

[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772116#comment-13772116 ] Tim Allison edited comment on TIKA-1132 at 9/20/13 5:36 PM: Any

[jira] [Commented] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-09-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773212#comment-13773212 ] Tim Allison commented on TIKA-792: -- This is now fixed by TIKA-1173. Can anyone recommend

[jira] [Commented] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-09-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778801#comment-13778801 ] Tim Allison commented on TIKA-1100: --- Updated XSSFExcelExtractorDecorator and added test

[jira] [Resolved] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-09-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1100. --- Resolution: Fixed Fix Version/s: 1.5 r1526498 cannot extract text in

[jira] [Reopened] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-09-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-792: -- added test that catches stderr. r1526570. reopening just to record this.

[jira] [Resolved] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-09-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-792. -- Resolution: Fixed NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType,

[jira] [Resolved] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1132. --- Resolution: Fixed Resolved with upgrade to poi-3.10-beta2. Could use help getting jUnit's timeout to

[jira] [Resolved] (TIKA-1076) Upgrade to Apache POI 3.9

2013-09-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1076. --- Resolution: Fixed Added some code similar to the fix to POI-54722 to HSLFExtractor. Uncommented old

[jira] [Resolved] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2013-09-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-817. -- Resolution: Fixed As mentioned above, this was fixed a while ago. I added test documents from

[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-09-30 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781922#comment-13781922 ] Tim Allison commented on TIKA-1162: --- Dear Colleague, I'm on paternity leave. Will be

[jira] [Commented] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2013-11-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811201#comment-13811201 ] Tim Allison commented on TIKA-817: -- Thank you! (PPT/PPTX) Missing date/time in text

[jira] [Resolved] (TIKA-1200) Upgrade pdfbox 1.8.3

2013-12-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1200. --- Resolution: Fixed Fixed in r1547037. Waiting for Jenkins to pick up change to confirm. Thank you!

[jira] [Assigned] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1201: - Assignee: Tim Allison Add possibility for switching to pdfbox NonSequentialPDFParser

[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1201: -- Attachment: TIKA-1201.patch Trivial patch Add possibility for switching to pdfbox

[jira] [Resolved] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1201. --- Resolution: Fixed Fix Version/s: 1.5 Basic parameter-based capability added in r1547250. User

[jira] [Created] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-02 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1202: - Summary: Refactor PDFParser to enable easier parameter setting Key: TIKA-1202 URL: https://issues.apache.org/jira/browse/TIKA-1202 Project: Tika Issue Type:

[jira] [Updated] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1202: -- Attachment: TIKA-1202.patch Would appreciate community feedback on this before I commit it (December

[jira] [Created] (TIKA-1203) Some metadata not extracted from PDF files when NonSequentialPDFParser is used

2013-12-03 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1203: - Summary: Some metadata not extracted from PDF files when NonSequentialPDFParser is used Key: TIKA-1203 URL: https://issues.apache.org/jira/browse/TIKA-1203 Project: Tika

[jira] [Comment Edited] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837169#comment-13837169 ] Tim Allison edited comment on TIKA-1201 at 12/3/13 4:25 PM:

[jira] [Commented] (TIKA-1199) Tika extracts weird signs instead of text

2013-12-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838856#comment-13838856 ] Tim Allison commented on TIKA-1199: --- Doh! Duplicated Marc's PDFBOX-1783. Sorry about

[jira] [Resolved] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1202. --- Resolution: Fixed Fix Version/s: 1.5 Committed in r1548700. Thank you, Mike and Hong-Thai for

[jira] [Reopened] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1202: --- Small bug in using default vs config. Refactor PDFParser to enable easier parameter setting

[jira] [Resolved] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1202. --- Resolution: Fixed r1549646 Refactor PDFParser to enable easier parameter setting

[jira] [Created] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1205: - Summary: Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika

[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845429#comment-13845429 ] Tim Allison commented on TIKA-1205: --- Thank you for your feedback! TIKA-456 is the

[jira] [Reopened] (TIKA-973) PDF form data isn't included in extracted content.

2013-12-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-973: -- Assignee: Tim Allison In hindsight, would prefer to use test documents that are unequivocally

[jira] [Commented] (TIKA-1212) Recursive Extraction of Archive File

2013-12-19 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852938#comment-13852938 ] Tim Allison commented on TIKA-1212: --- On first issue: do you mean that you'd like to have

[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2013-12-19 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1212: -- Attachment: abc.zip Does this test file meet your description? Recursive Extraction of Archive File

[jira] [Updated] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1205: -- Due Date: 17/Jan/14 (was: 20/Dec/13) Allow PDFParser to fallback to other parser if there is an

[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864393#comment-13864393 ] Tim Allison commented on TIKA-1216: --- Give this a shot:

[jira] [Resolved] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1216. --- Resolution: Fixed Fix Version/s: 1.5 Following reporter's comment, this looks to be fixed in

[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866916#comment-13866916 ] Tim Allison commented on TIKA-1216: --- Agreed. I didn't think this was a duplicate. It is

[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869528#comment-13869528 ] Tim Allison commented on TIKA-1215: --- [~thaichat04] thank you for sending a clean patch.

[jira] [Assigned] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1226: - Assignee: Tim Allison PDFTextStripper fails while getting data of PDF form fields of type

[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880130#comment-13880130 ] Tim Allison commented on TIKA-1226: --- Eric, Thank you for reporting this. I'll make the

[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880273#comment-13880273 ] Tim Allison commented on TIKA-1226: --- How about we grab the name? {noformat}

[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383 ] Tim Allison commented on TIKA-1226: --- Thank you for the test file. I'll use that in the

[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383 ] Tim Allison edited comment on TIKA-1226 at 1/24/14 8:22 PM:

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison commented on TIKA-1228: --- I won't have time to fix this for a week or so, but

[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison edited comment on TIKA-1228 at 2/3/14 6:09 PM: --- I

[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison edited comment on TIKA-1228 at 2/3/14 6:11 PM: --- I

[jira] [Resolved] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1228. --- Resolution: Fixed Fix Version/s: 1.5 Fixed in r1564042. Thank you, [~agi20dla], for reporting

[jira] [Issue Comment Deleted] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1228: -- Comment: was deleted (was: I won't have time to fix this for a week or so, but, I'll take this unless

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890605#comment-13890605 ] Tim Allison commented on TIKA-1228: --- Not sure I understand. Is this the snippet that you

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890610#comment-13890610 ] Tim Allison commented on TIKA-1228: --- Y. That's the point of open source. :) Enjoy! Now

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890613#comment-13890613 ] Tim Allison commented on TIKA-1228: --- Ok, to confirm, the PDNameTreeNode class cast

[jira] [Created] (TIKA-1230) Update PDFBox to v1.8.4

2014-02-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1230: - Summary: Update PDFBox to v1.8.4 Key: TIKA-1230 URL: https://issues.apache.org/jira/browse/TIKA-1230 Project: Tika Issue Type: Improvement Affects Versions:

[jira] [Resolved] (TIKA-1230) Update PDFBox to v1.8.4

2014-02-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1230. --- Resolution: Fixed r1564335 Update PDFBox to v1.8.4 --- Key:

[jira] [Created] (TIKA-1231) Safely handle null embedded files in PDFs

2014-02-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1231: - Summary: Safely handle null embedded files in PDFs Key: TIKA-1231 URL: https://issues.apache.org/jira/browse/TIKA-1231 Project: Tika Issue Type: Bug

[jira] [Assigned] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1232: - Assignee: Tim Allison Add PDF version to PDFParser output ---

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892146#comment-13892146 ] Tim Allison commented on TIKA-1232: --- How about Application-Version to follow the

[jira] [Created] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-02-05 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1233: - Summary: PDFBox can throw StringIndexOutOfBoundsException on some dates Key: TIKA-1233 URL: https://issues.apache.org/jira/browse/TIKA-1233 Project: Tika Issue

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380 ] Tim Allison commented on TIKA-1232: --- Interesting. Thank you, [~johanvanderknijff] and

[jira] [Comment Edited] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380 ] Tim Allison edited comment on TIKA-1232 at 2/6/14 2:31 PM: ---

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893426#comment-13893426 ] Tim Allison commented on TIKA-1232: --- [~anjackson], y, I'd like to add your code if others

[jira] [Updated] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-02-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1233: -- Description: PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for

  1   2   3   4   5   6   7   8   9   10   >