[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency

2015-07-07 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617457#comment-14617457 ] Michael McCandless commented on TIKA-1675: -- bq. If the project is dead and not

[jira] [Resolved] (TIKA-1628) ExternalParser.check should return false if it hits SecurityException

2015-05-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1628. -- Resolution: Pending Closed Thanks [~gagravarr] and [~thetaphi] ExternalParser.check

[jira] [Commented] (TIKA-1544) empty lines are not preserved

2015-02-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309956#comment-14309956 ] Michael McCandless commented on TIKA-1544: -- bq. Michael McCandless, is the fix

[jira] [Commented] (TIKA-1544) empty lines are not preserved

2015-02-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310014#comment-14310014 ] Michael McCandless commented on TIKA-1544: -- bq. I have hesitation about changing

[jira] [Commented] (TIKA-1305) New list processing changes appear to be causing RTFParser exception

2014-05-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013647#comment-14013647 ] Michael McCandless commented on TIKA-1305: -- Net/net the RTF is corrupted right?

[jira] [Resolved] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-14 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1078. -- Resolution: Fixed Thanks Stefano, I made one small change (added generics:

[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869205#comment-13869205 ] Michael McCandless commented on TIKA-1078: -- Thanks Stefano! Can you fix the

[jira] [Commented] (TIKA-1211) OpenDocument (ODF) parser produces multiple startDocument() events

2013-12-17 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850567#comment-13850567 ] Michael McCandless commented on TIKA-1211: -- +1 to fix XHTMLContentHandler to allow

[jira] [Resolved] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1192. -- Resolution: Fixed Fix Version/s: 1.5 Thanks Dave, I just committed this.

[jira] [Assigned] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-08 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1192: Assignee: Michael McCandless ArrayIndexOutOfBoundsException: 9 parsing RTF

[jira] [Commented] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-08 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817472#comment-13817472 ] Michael McCandless commented on TIKA-1192: -- bq. Yes, when that fragment is part of

[jira] [Commented] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-08 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817512#comment-13817512 ] Michael McCandless commented on TIKA-1192: -- Thanks Dave.

[jira] [Commented] (TIKA-1181) RTFParser not keeping HTML font colors and underscore tags.

2013-10-07 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788163#comment-13788163 ] Michael McCandless commented on TIKA-1181: -- The RTFParser currently only carries

[jira] [Commented] (TIKA-1143) Fails to parse some PPT file

2013-07-03 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698837#comment-13698837 ] Michael McCandless commented on TIKA-1143: -- Are you able to extract text from the

[jira] [Assigned] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1128: Assignee: Michael McCandless Replace line tabulation with line break

[jira] [Updated] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1128: - Fix Version/s: 1.5 Replace line tabulation with line break

[jira] [Commented] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670252#comment-13670252 ] Michael McCandless commented on TIKA-1128: -- Thanks Privezentsev. Do you have an

[jira] [Resolved] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1128. -- Resolution: Fixed Fix Version/s: (was: 1.5) 1.4 Thanks

[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2013-03-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615793#comment-13615793 ] Michael McCandless commented on TIKA-1098: -- Hmm PDFBox is hitting that exception

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-23 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585082#comment-13585082 ] Michael McCandless commented on TIKA-1074: -- bq. My app needs to extract text even

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584176#comment-13584176 ] Michael McCandless commented on TIKA-1074: -- Thanks Jukka. InterruptedException is

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584249#comment-13584249 ] Michael McCandless commented on TIKA-1074: -- {quote} bq. InterruptedException is

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584362#comment-13584362 ] Michael McCandless commented on TIKA-1074: -- OK I'll remove the future proofing.

[jira] [Resolved] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-21 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1074. -- Resolution: Fixed Extraction should continue if an exception is hit visiting an

[jira] [Resolved] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1074. -- Resolution: Fixed Extraction should continue if an exception is hit visiting an

[jira] [Updated] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1074: - Attachment: TIKA-1074.patch Patch, catching Exception not Throwable, and restoring the

[jira] [Reopened] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened TIKA-1074: -- Extraction should continue if an exception is hit visiting an embedded document

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582481#comment-13582481 ] Michael McCandless commented on TIKA-1074: -- Thanks Uwe, I'll change to catching

[jira] [Assigned] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1074: Assignee: Michael McCandless Extraction should continue if an exception is hit

[jira] [Updated] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1074: - Attachment: TIKA-1074.patch Patch, just logging a warning and continuing, if we hit the

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2013-02-07 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13573492#comment-13573492 ] Michael McCandless commented on TIKA-369: - The language-detection lib is now in

[jira] [Resolved] (TIKA-1053) Upgrade Tika Parsers to use ASM 4.x

2013-02-07 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1053. -- Resolution: Fixed Fix Version/s: 1.4 Thanks Uwe. Upgrade Tika

[jira] [Created] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2013-02-05 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1078: Summary: TikaCLI: invalid characters in embedded document name causes FNFE when trying to save Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078

[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2013-02-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1078: - Attachment: T-DS_Excel2003-PPT2003_1.xls TikaCLI: invalid characters in embedded

[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2013-02-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1079: - Attachment: guide_to_daips_(id_3152_ver_1.0.0).doc Word document hits AIOOBE in

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13571288#comment-13571288 ] Michael McCandless commented on TIKA-1074: -- TIKA-1079 is another example where if

[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570208#comment-13570208 ] Michael McCandless commented on TIKA-1072: -- OK I did some digging on this. The

[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570305#comment-13570305 ] Michael McCandless commented on TIKA-1072: -- Thanks Nick, I'll try asking on

[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570308#comment-13570308 ] Michael McCandless commented on TIKA-1072: -- OK I opened TIKA-1074; this issue will

[jira] [Created] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-04 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1074: Summary: Extraction should continue if an exception is hit visiting an embedded document Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074

[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1072: - Attachment: Ole10NativeEntry.bin I'm attaching the 40 byte \U0001Ole10Native entry (40

[jira] [Created] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-03 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1072: Summary: AIOOBE when handling embedded document in .doc file Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue

[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-03 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1072: - Attachment: 20-Force-on-a-current-S00.doc AIOOBE when handling embedded document in

[jira] [Created] (TIKA-1067) Tika extracts non-existent asterisks (*) from .ppt files

2013-01-29 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1067: Summary: Tika extracts non-existent asterisks (*) from .ppt files Key: TIKA-1067 URL: https://issues.apache.org/jira/browse/TIKA-1067 Project: Tika

[jira] [Commented] (TIKA-1062) Add list detection to RTFParser

2013-01-24 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562060#comment-13562060 ] Michael McCandless commented on TIKA-1062: -- Hi Axel, I don't actually know that

[jira] [Commented] (TIKA-1062) Add list detection to RTFParser

2013-01-23 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560927#comment-13560927 ] Michael McCandless commented on TIKA-1062: -- Should the ListDescriptor list =

[jira] [Resolved] (TIKA-1048) XMLParser should add whitespace between elements

2013-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1048. -- Resolution: Fixed XMLParser should add whitespace between elements

[jira] [Created] (TIKA-1048) XMLParser should add whitespace between elements

2012-12-20 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1048: Summary: XMLParser should add whitespace between elements Key: TIKA-1048 URL: https://issues.apache.org/jira/browse/TIKA-1048 Project: Tika Issue

[jira] [Updated] (TIKA-1048) XMLParser should add whitespace between elements

2012-12-20 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1048: - Attachment: TIKA-1048.patch Patch w/ failing test ... I'm not sure where/how to best fix

[jira] [Resolved] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files

2012-12-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1031. -- Resolution: Fixed TikaCLI doesn't create sub-dirs when extracting Zip files

[jira] [Resolved] (TIKA-1032) Powerpoint (.pptx) can have duplicate embedded ids

2012-12-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1032. -- Resolution: Fixed Fix Version/s: 1.3 Powerpoint (.pptx) can have duplicate

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2012-12-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508010#comment-13508010 ] Michael McCandless commented on TIKA-712: - I committed the patch; I'll leave this

[jira] [Resolved] (TIKA-1035) PDF bookmark text is not extracted

2012-12-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1035. -- Resolution: Fixed PDF bookmark text is not extracted

[jira] [Resolved] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry

2012-12-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1036. -- Resolution: Fixed Fix Version/s: 1.3 ZIP parsing doesn't leave placeholders

[jira] [Created] (TIKA-1035) PDF bookmark text is not extracted

2012-11-30 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1035: Summary: PDF bookmark text is not extracted Key: TIKA-1035 URL: https://issues.apache.org/jira/browse/TIKA-1035 Project: Tika Issue Type: Bug

[jira] [Updated] (TIKA-1035) PDF bookmark text is not extracted

2012-11-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1035: - Attachment: TIKA-1035.patch Patch w/ test ... PDF bookmark text is not

[jira] [Created] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry

2012-11-30 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1036: Summary: ZIP parsing doesn't leave placeholders for each package entry Key: TIKA-1036 URL: https://issues.apache.org/jira/browse/TIKA-1036 Project: Tika

[jira] [Updated] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry

2012-11-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1036: - Attachment: TIKA-1036.patch Patch w/ test ... ZIP parsing doesn't leave

[jira] [Updated] (TIKA-712) Master slide text isn't extracted

2012-11-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-712: Attachment: TIKA-712.patch I think I found a committable workaround (patch) for including

[jira] [Created] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1033: Summary: Tika doesn't parse embedded OLE Chart/Graph objects Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue

[jira] [Updated] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1033: - Attachment: emb.ppt Tika doesn't parse embedded OLE Chart/Graph objects

[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504563#comment-13504563 ] Michael McCandless commented on TIKA-1033: -- Here's the full stack trace when I

[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504668#comment-13504668 ] Michael McCandless commented on TIKA-1033: -- I asked the person who created this

[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504673#comment-13504673 ] Michael McCandless commented on TIKA-1033: -- bq. The raw chart object looks to

[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504703#comment-13504703 ] Michael McCandless commented on TIKA-1033: -- Interesting: with PowerPoint 2007,

[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504726#comment-13504726 ] Michael McCandless commented on TIKA-1033: -- OK I opened

[jira] [Created] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files

2012-11-26 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1031: Summary: TikaCLI doesn't create sub-dirs when extracting Zip files Key: TIKA-1031 URL: https://issues.apache.org/jira/browse/TIKA-1031 Project: Tika

[jira] [Updated] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files

2012-11-26 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1031: - Attachment: TIKA-1031.patch Patch w/ test fix. TikaCLI doesn't create

[jira] [Resolved] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-18 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1024. -- Resolution: Fixed An MP3 with an UTF-16 ID3 tag containing only the BOM should

[jira] [Resolved] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded

2012-11-18 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1025. -- Resolution: Fixed Fix Version/s: 1.3 Powerpoint (.ppt) parser doesn't leave

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2012-11-18 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499838#comment-13499838 ] Michael McCandless commented on TIKA-369: - +1 to cut over to

[jira] [Created] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-13 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1024: Summary: An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag Key: TIKA-1024 URL:

[jira] [Updated] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-13 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1024: - Attachment: testNakedUTF16BOM.mp3 An MP3 with an UTF-16 ID3 tag containing only the

[jira] [Updated] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-13 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1024: - Attachment: TIKA-1024.patch Patch w/ failing test and fix. An MP3 with

[jira] [Created] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded

2012-11-13 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1025: Summary: Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded Key: TIKA-1025 URL: https://issues.apache.org/jira/browse/TIKA-1025

[jira] [Updated] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded

2012-11-13 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1025: - Attachment: TIKA-1025.patch Patch w/ test fix. Powerpoint (.ppt)

[jira] [Resolved] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1019. -- Resolution: Fixed Document links in Word documents don't leave a placeholder

[jira] [Resolved] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1019. -- Resolution: Fixed Document links in Word documents don't leave a placeholder

[jira] [Reopened] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened TIKA-1019: -- I reverted my commit for now ... the test file was way too large ...

[jira] [Assigned] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-07 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1019: Assignee: Michael McCandless Document links in Word documents don't leave a

[jira] [Updated] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-07 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1019: - Attachment: testDocumentLink.doc TIKA-1019.patch Patch w/ test and fix.

[jira] [Resolved] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata

2012-10-31 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1015. -- Resolution: Fixed Word (.doc) embedded files don't set relationship ID in the

[jira] [Reopened] (TIKA-953) Tika failed to recognize non-ustar Tar file?

2012-10-31 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened TIKA-953: - I have another non-ustar tar file that's incorrectly detected as application/octet-stream

[jira] [Updated] (TIKA-953) Tika failed to recognize non-ustar Tar file?

2012-10-31 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-953: Attachment: test2.tar file reports this as a tar archive, but: {noformat} cat test2.tar |

[jira] [Created] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata

2012-10-30 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1015: Summary: Word (.doc) embedded files don't set relationship ID in the Metadata Key: TIKA-1015 URL: https://issues.apache.org/jira/browse/TIKA-1015 Project:

[jira] [Updated] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata

2012-10-30 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1015: - Attachment: TIKA-1015.patch Simple patch, but my only slight hesitation is I added an

[jira] [Resolved] (TIKA-1011) Exception (Null charset name) processing .mhtml file

2012-10-26 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1011. -- Resolution: Fixed Exception (Null charset name) processing .mhtml file

[jira] [Created] (TIKA-1011) Exception (Null charset name) processing .mhtml file

2012-10-25 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1011: Summary: Exception (Null charset name) processing .mhtml file Key: TIKA-1011 URL: https://issues.apache.org/jira/browse/TIKA-1011 Project: Tika

[jira] [Updated] (TIKA-1011) Exception (Null charset name) processing .mhtml file

2012-10-25 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1011: - Attachment: TIKA-1011.patch Exception (Null charset name) processing .mhtml file

[jira] [Created] (TIKA-1010) Embedded documents in RTF are not extracted

2012-10-19 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1010: Summary: Embedded documents in RTF are not extracted Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type:

[jira] [Updated] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-13 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1005: - Attachment: TIKA-1005.patch Patch w/ test ... In Microsoft Office Word

[jira] [Assigned] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator

2012-10-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1006: Assignee: Michael McCandless NPE in extractParagraph (styleClass) in

[jira] [Commented] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator

2012-10-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474947#comment-13474947 ] Michael McCandless commented on TIKA-1006: -- Thanks Sture, that patch looks good!

[jira] [Assigned] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1005: Assignee: Michael McCandless In Microsoft Office Word 2010 documents, text

[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474958#comment-13474958 ] Michael McCandless commented on TIKA-1005: -- Thanks David, I'll dig!

[jira] [Resolved] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator

2012-10-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1006. -- Resolution: Fixed Fix Version/s: 1.3 Thanks Sture, I just committed the test

[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-11 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474250#comment-13474250 ] Michael McCandless commented on TIKA-1005: -- Could you attach an example showing

[jira] [Resolved] (TIKA-997) Leave a placeholder when documents are embedded in .pptx documents

2012-09-28 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-997. - Resolution: Fixed Fix Version/s: 1.3 Leave a placeholder when documents are

[jira] [Updated] (TIKA-997) Leave a placeholder when documents are embedded in .pptx documents

2012-09-26 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-997: Attachment: TIKA-997.patch Patch. It's not perfect, because the placeholder will appear at

[jira] [Created] (TIKA-999) RTF Parser doesn't extract page/word/character count metadata

2012-09-26 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-999: --- Summary: RTF Parser doesn't extract page/word/character count metadata Key: TIKA-999 URL: https://issues.apache.org/jira/browse/TIKA-999 Project: Tika

  1   2   3   >