[jira] [Commented] (TIKA-1683) Add encryption support to Jackcess parser

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628585#comment-14628585 ] Tim Allison commented on TIKA-1683: --- At this point we have a version clash on

[jira] [Created] (TIKA-1682) Add formatting for values in Jackcess

2015-07-15 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1682: - Summary: Add formatting for values in Jackcess Key: TIKA-1682 URL: https://issues.apache.org/jira/browse/TIKA-1682 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-1588) Upgrade to PDFBox 1.8.10 when available

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628970#comment-14628970 ] Tim Allison commented on TIKA-1588: --- Interesting. This must be another case of the

[jira] [Created] (TIKA-1684) Clean up metadata properties in Jackcess parser

2015-07-15 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1684: - Summary: Clean up metadata properties in Jackcess parser Key: TIKA-1684 URL: https://issues.apache.org/jira/browse/TIKA-1684 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-1671) Wrapped lines in PDF files not processed correctly

2015-07-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631204#comment-14631204 ] Tim Allison commented on TIKA-1671: --- And a few other points... Encoding instructions

[jira] [Commented] (TIKA-1671) Wrapped lines in PDF files not processed correctly

2015-07-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631200#comment-14631200 ] Tim Allison commented on TIKA-1671: --- I think this is an issue with PDFs in general, not

[jira] [Assigned] (TIKA-1690) nconsistent (buggy) behavior when using tika-server

2015-07-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1690: - Assignee: Tim Allison nconsistent (buggy) behavior when using tika-server

[jira] [Commented] (TIKA-1690) nconsistent (buggy) behavior when using tika-server

2015-07-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631209#comment-14631209 ] Tim Allison commented on TIKA-1690: --- Thank you for raising this. As I mentioned on the

[jira] [Commented] (TIKA-1689) Parser sort order change in TIKA-1517 breaks parser override capability

2015-07-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631212#comment-14631212 ] Tim Allison commented on TIKA-1689: --- [~dwarren], thank you for raising this.

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633429#comment-14633429 ] Tim Allison commented on TIKA-1678: --- [~tilman], y, that's taken from the xmp. As you

[jira] [Commented] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633433#comment-14633433 ] Tim Allison commented on TIKA-1238: --- [~rangma], Any chance you could share a test file?

[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633454#comment-14633454 ] Tim Allison edited comment on TIKA-1678 at 7/20/15 11:43 AM: -

[jira] [Commented] (TIKA-1690) Inconsistent (buggy) behavior when using tika-server

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633436#comment-14633436 ] Tim Allison commented on TIKA-1690: --- tmpFile? Do you mean the fileUrl? Sorry.

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633454#comment-14633454 ] Tim Allison commented on TIKA-1678: --- The good news is that with PDFBox 2.0, we get a

[jira] [Commented] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633511#comment-14633511 ] Tim Allison commented on TIKA-1238: --- Got it. For now, let's see if I can find some

[jira] [Commented] (TIKA-1690) Inconsistent (buggy) behavior when using tika-server

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633610#comment-14633610 ] Tim Allison commented on TIKA-1690: --- Is the problem {{is.available()}}? {noformat}

[jira] [Comment Edited] (TIKA-1690) Inconsistent (buggy) behavior when using tika-server

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633610#comment-14633610 ] Tim Allison edited comment on TIKA-1690 at 7/20/15 1:43 PM: Is

[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633643#comment-14633643 ] Tim Allison commented on TIKA-1285: --- Still hammering out some issues. If regression tests

[jira] [Updated] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1238: -- Attachment: (was: 873911_100_20061124_191408.msg) Update OutlookExtractor to handle codepage

[jira] [Commented] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633696#comment-14633696 ] Tim Allison commented on TIKA-1238: --- Probably not the best way to transfer a file... I

[jira] [Commented] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633786#comment-14633786 ] Tim Allison commented on TIKA-1238: --- That's up to the community, but I think we have

[jira] [Reopened] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1238: --- Doh. Reopening until we get the mods to POI and then the updated Tika code after the next POI release.

[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633986#comment-14633986 ] Tim Allison edited comment on TIKA-1678 at 7/20/15 7:38 PM:

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633986#comment-14633986 ] Tim Allison commented on TIKA-1678: --- That works perfectly. Thank you, [~tilman]! Now

[jira] [Comment Edited] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633487#comment-14633487 ] Tim Allison edited comment on TIKA-1238 at 7/20/15 3:34 PM: The

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633720#comment-14633720 ] Tim Allison commented on TIKA-1678: --- Very helpful! If we require that the string start

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634391#comment-14634391 ] Tim Allison commented on TIKA-1678: --- Slight modification of [~tilman]'s example added in

[jira] [Resolved] (TIKA-1683) Add encryption support to Jackcess parser

2015-07-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1683. --- Resolution: Fixed r1692100 Add encryption support to Jackcess parser

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627948#comment-14627948 ] Tim Allison commented on TIKA-1678: --- Shouldn't have taken me this long, but, isn't that

[jira] [Comment Edited] (TIKA-1588) Upgrade to PDFBox 1.8.10 when available

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628102#comment-14628102 ] Tim Allison edited comment on TIKA-1588 at 7/15/15 2:13 PM:

[jira] [Updated] (TIKA-1588) Upgrade to PDFBox 1.8.10 when available

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1588: -- Attachment: reports_1_8_9_vs_1_8_10.zip Current version of reports attached comparing PDFBox 1.8.9 vs

[jira] [Created] (TIKA-1681) Fix file opening in Jackcess to enable read only for v1997 files

2015-07-15 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1681: - Summary: Fix file opening in Jackcess to enable read only for v1997 files Key: TIKA-1681 URL: https://issues.apache.org/jira/browse/TIKA-1681 Project: Tika Issue

[jira] [Updated] (TIKA-1681) Fix file opening in Jackcess to enable read only for v1997 files

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1681: -- Description: We need to make a small modification in how we're opening mdb files with Jackcess to set

[jira] [Assigned] (TIKA-1681) Fix file opening in Jackcess to enable read only for v1997 files

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1681: - Assignee: Tim Allison Fix file opening in Jackcess to enable read only for v1997 files

[jira] [Commented] (TIKA-1680) Add configuration layer to configure, Parsers default configurable properties.

2015-07-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628162#comment-14628162 ] Tim Allison commented on TIKA-1680: --- If we implemented TIKA-1508, would that accomplish

[jira] [Created] (TIKA-1777) Regression in spacing around differently formatted runs in PPT

2015-10-21 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1777: - Summary: Regression in spacing around differently formatted runs in PPT Key: TIKA-1777 URL: https://issues.apache.org/jira/browse/TIKA-1777 Project: Tika Issue

[jira] [Comment Edited] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966888#comment-14966888 ] Tim Allison edited comment on TIKA-1707 at 10/21/15 2:41 PM: - [~kiwiwings], we

[jira] [Created] (TIKA-1778) Regression in spacing around differently formatted runs in PPT

2015-10-21 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1778: - Summary: Regression in spacing around differently formatted runs in PPT Key: TIKA-1778 URL: https://issues.apache.org/jira/browse/TIKA-1778 Project: Tika Issue

[jira] [Updated] (TIKA-1778) Regression in spacing around differently formatted runs in PPT

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1778: -- Attachment: 018452.ppt Short govdocs1 file that shows the issue. We used to get "et cut". We are now

[jira] [Comment Edited] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966888#comment-14966888 ] Tim Allison edited comment on TIKA-1707 at 10/21/15 2:43 PM: - [~kiwiwings], we

[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966888#comment-14966888 ] Tim Allison commented on TIKA-1707: --- [~kiwiwings], we found a regression in spacing around differently

[jira] [Commented] (TIKA-1781) Tika generates broken XML file

2015-10-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969994#comment-14969994 ] Tim Allison commented on TIKA-1781: --- Are you getting doubled xml when you run tika-app from the

[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969173#comment-14969173 ] Tim Allison commented on TIKA-1707: --- That was a bad idea. The issue was that the first run ended with

[jira] [Resolved] (TIKA-1778) Regression in spacing around differently formatted runs in PPT

2015-10-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1778. --- Resolution: Fixed Thanks to [~kiwiwings] for the patch, this is now fixed. I added the {{if

[jira] [Resolved] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1782. --- Resolution: Fixed r1710799. We should probably open a separate issue to handle the failed build in

[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971696#comment-14971696 ] Tim Allison commented on TIKA-1285: --- Finished comparison of ~100k docs:

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974462#comment-14974462 ] Tim Allison commented on TIKA-1782: --- Not seeing it on RHEL with 1.8.0_66 either. IIRC, Jenkins is

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974419#comment-14974419 ] Tim Allison commented on TIKA-1782: --- What OS and Java version? I'm not seeing problems with RHEL 6.5 and

[jira] [Updated] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1707: -- Attachment: 075166.ppt Example file > Upgrade to Apache POI 3.13 Beta 2 >

[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968266#comment-14968266 ] Tim Allison commented on TIKA-1707: --- Thank you! That fixed the vast majority of content diffs. There

[jira] [Comment Edited] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968268#comment-14968268 ] Tim Allison edited comment on TIKA-1707 at 10/22/15 12:43 AM: -- Example file...

[jira] [Created] (TIKA-1780) Not common regression in AIOOBE for some ppts

2015-10-21 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1780: - Summary: Not common regression in AIOOBE for some ppts Key: TIKA-1780 URL: https://issues.apache.org/jira/browse/TIKA-1780 Project: Tika Issue Type: Bug

[jira] [Updated] (TIKA-1780) Not common regression in AIOOBE for some ppts

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1780: -- Description: After upgrading to POI 3.13, we're getting some new AIOOBE. This is fairly rare,

[jira] [Commented] (TIKA-1780) Not common regression in AIOOBE for some ppts

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968299#comment-14968299 ] Tim Allison commented on TIKA-1780: --- Opened [bugzilla

[jira] [Updated] (TIKA-1780) Not common regression in AIOOBE for some ppts

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1780: -- Priority: Minor (was: Major) > Not common regression in AIOOBE for some ppts >

[jira] [Commented] (TIKA-1779) different outputs between cmd & srv version

2015-10-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968302#comment-14968302 ] Tim Allison commented on TIKA-1779: --- My guess: you typically only get Content-Length if the underlying

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976405#comment-14976405 ] Tim Allison commented on TIKA-1782: --- Y, I think so. The stacktrace seems to suggest a more profound

[jira] [Comment Edited] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976405#comment-14976405 ] Tim Allison edited comment on TIKA-1782 at 10/27/15 1:49 PM: - Y, I think so.

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978068#comment-14978068 ] Tim Allison commented on TIKA-1782: --- Hmmm...should I reopen this issue and revert? Do you have a

[jira] [Resolved] (TIKA-1512) WordParser fails on many Word files

2015-11-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1512. --- Resolution: Fixed Fix Version/s: 1.8 Sorry for not resolving this earlier. The fix is there,

[jira] [Updated] (TIKA-1512) WordParser fails on many Word files

2015-11-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1512: -- Affects Version/s: (was: 1.8) > WordParser fails on many Word files >

[jira] [Created] (TIKA-1795) RTFParser can double Metadata.CONTENT_TYPE entry in Metadata

2015-11-16 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1795: - Summary: RTFParser can double Metadata.CONTENT_TYPE entry in Metadata Key: TIKA-1795 URL: https://issues.apache.org/jira/browse/TIKA-1795 Project: Tika Issue

[jira] [Resolved] (TIKA-1795) RTFParser can double Metadata.CONTENT_TYPE entry in Metadata

2015-11-16 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1795. --- Resolution: Fixed Fix Version/s: 1.12 r1714617 > RTFParser can double Metadata.CONTENT_TYPE

[jira] [Commented] (TIKA-1784) Use of ThreadLocal in Tika causes memory leaks and warnings in Tomcat

2015-10-30 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982572#comment-14982572 ] Tim Allison commented on TIKA-1784: --- Thank you for opening this. This is caused by POI. Not sure if it

[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika

2015-10-30 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982597#comment-14982597 ] Tim Allison commented on TIKA-1443: --- [~kkrugler], have you looked at how Optimaize has worked on garbled

[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika

2015-10-30 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982615#comment-14982615 ] Tim Allison commented on TIKA-1443: --- Doh... so much for that idea... >From Optimaize's

[jira] [Comment Edited] (TIKA-1443) Add a junk text detector to Tika

2015-10-30 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982615#comment-14982615 ] Tim Allison edited comment on TIKA-1443 at 10/30/15 2:21 PM: - Doh... so much

[jira] [Resolved] (TIKA-1786) Downgrade logging severity in FileResourceConsumer and fix handling of illegal xml characters

2015-11-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1786. --- Resolution: Fixed Fix Version/s: 1.12 r1712572 > Downgrade logging severity in

[jira] [Created] (TIKA-1786) Downgrade logging severity in FileResourceConsumer and fix handling of illegal xml characters

2015-11-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1786: - Summary: Downgrade logging severity in FileResourceConsumer and fix handling of illegal xml characters Key: TIKA-1786 URL: https://issues.apache.org/jira/browse/TIKA-1786

[jira] [Updated] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

2015-11-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1788: -- Attachment: grep_content_disposition.zip May have some candidates...grep on our commoncrawl slice with

[jira] [Commented] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor

2015-10-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947941#comment-14947941 ] Tim Allison commented on TIKA-1764: --- On second thought, if you use the RecursiveParserWrapper, do you get

[jira] [Commented] (TIKA-1761) Error Parsing PPT (97-2003) files with password protection against modification which were created using Office 2013

2015-10-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947933#comment-14947933 ] Tim Allison commented on TIKA-1761: --- It looks like we have no support for encryption for doc or

[jira] [Resolved] (TIKA-1755) Make ppt and pptx paragraph/div breaks more consistent

2015-10-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1755. --- Resolution: Fixed r1707432 > Make ppt and pptx paragraph/div breaks more consistent >

[jira] [Commented] (TIKA-1776) tika stop converting at this pdf document

2015-10-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965053#comment-14965053 ] Tim Allison commented on TIKA-1776: --- I'm not able to reproduce this on Windows or RHEL with PDFBox's app

[jira] [Closed] (TIKA-1776) tika stop converting at this pdf document

2015-10-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1776. - Resolution: Not A Problem No problem. Let us know if you have any other surprises. > tika stop

[jira] [Commented] (TIKA-1736) Bouncy Castle version binary incompatibility

2015-10-08 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949556#comment-14949556 ] Tim Allison commented on TIKA-1736: --- 2.1.1 is available. Will commit upgrade shortly. > Bouncy Castle

[jira] [Resolved] (TIKA-1736) Bouncy Castle version binary incompatibility

2015-10-08 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1736. --- Resolution: Fixed r1707635. Thank you, again, James Ahlborn! > Bouncy Castle version binary

[jira] [Commented] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor

2015-10-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946775#comment-14946775 ] Tim Allison commented on TIKA-1764: --- Ha, I've been wanting to do this for a while. I'm not sure of the

[jira] [Created] (TIKA-1765) Some doc and docx store multiple authors as semi-colon delimited list

2015-10-07 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1765: - Summary: Some doc and docx store multiple authors as semi-colon delimited list Key: TIKA-1765 URL: https://issues.apache.org/jira/browse/TIKA-1765 Project: Tika

[jira] [Commented] (TIKA-1765) Some doc and docx store multiple authors as semi-colon delimited list

2015-10-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947361#comment-14947361 ] Tim Allison commented on TIKA-1765: --- Would anyone mind if I changed {{OfficeOpenXMLExtended.MANAGER}} to

[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954947#comment-14954947 ] Tim Allison commented on TIKA-1285: --- Y, that's the first thing on my todo list on our wrapper --

[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954860#comment-14954860 ] Tim Allison commented on TIKA-1285: --- Thank you for testing the dev wrapper and PDFBox 2.0, and thank you

[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950223#comment-14950223 ] Tim Allison commented on TIKA-1285: --- No problem at all...I still need to run against our batch as well.

[jira] [Comment Edited] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor

2015-10-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950277#comment-14950277 ] Tim Allison edited comment on TIKA-1764 at 10/9/15 11:57 AM: - Y, I completely

[jira] [Commented] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor

2015-10-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950277#comment-14950277 ] Tim Allison commented on TIKA-1764: --- Y, I completely agree that we all need to see when embedded

[jira] [Commented] (TIKA-1761) Error Parsing PPT (97-2003) files with password protection against modification which were created using Office 2013

2015-10-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950239#comment-14950239 ] Tim Allison commented on TIKA-1761: --- And the other question...if we add support for pw protection for doc

[jira] [Updated] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-07-10 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1285: -- Attachment: pdfbox_reports_2_0_0_20150709.zip First dump of stack traces in govdocs1 from the

[jira] [Commented] (TIKA-1674) Add example to show how to extract embedded files

2015-07-07 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616777#comment-14616777 ] Tim Allison commented on TIKA-1674: --- Help! With r1689690, I added an example of how to

[jira] [Resolved] (TIKA-1676) Fix logic error in batch driver that prevents correct restarting of child process

2015-07-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1676. --- Resolution: Fixed r1690090 Fix logic error in batch driver that prevents correct restarting of child

[jira] [Created] (TIKA-1676) Fix logic error in batch driver that prevents correct restarting of child process

2015-07-09 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1676: - Summary: Fix logic error in batch driver that prevents correct restarting of child process Key: TIKA-1676 URL: https://issues.apache.org/jira/browse/TIKA-1676 Project:

[jira] [Resolved] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2015-07-10 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1300. --- Resolution: Duplicate Fix Version/s: (was: 1.10) It looks like PDFBox 2.0.0 is coming soon.

[jira] [Commented] (TIKA-1716) Tika Server's recursive JSON output from /rmeta different than tika-app -J output

2015-08-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717293#comment-14717293 ] Tim Allison commented on TIKA-1716: --- How about: # Switch default handler type to xml #

[jira] [Created] (TIKA-1725) Fix language detection for /rmeta in tika-server

2015-08-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1725: - Summary: Fix language detection for /rmeta in tika-server Key: TIKA-1725 URL: https://issues.apache.org/jira/browse/TIKA-1725 Project: Tika Issue Type: Bug

[jira] [Comment Edited] (TIKA-1723) Integrate language-detector into Tika

2015-08-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14718494#comment-14718494 ] Tim Allison edited comment on TIKA-1723 at 8/28/15 12:46 PM: -

[jira] [Resolved] (TIKA-1716) Tika Server's recursive JSON output from /rmeta different than tika-app -J output

2015-08-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1716. --- Resolution: Fixed Assignee: Tim Allison (was: Chris A. Mattmann) r1698329 Let me know if there

[jira] [Comment Edited] (TIKA-1723) Integrate language-detector into Tika

2015-08-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14718494#comment-14718494 ] Tim Allison edited comment on TIKA-1723 at 8/28/15 12:56 PM: -

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-08-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14718494#comment-14718494 ] Tim Allison commented on TIKA-1723: --- I've only taken a brief look, but I think that

[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711801#comment-14711801 ] Tim Allison commented on TIKA-1607: --- [~rgauss], thank you for this demo code! I haven't

[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14713567#comment-14713567 ] Tim Allison commented on TIKA-1607: --- Not the same, but these two issues are related...how

[jira] [Commented] (TIKA-1657) Allow easier dumping of TikaConfig file from tika-core

2015-08-31 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723941#comment-14723941 ] Tim Allison commented on TIKA-1657: --- I looked at this a bit today, I'm now backing off to putting this

<    5   6   7   8   9   10   11   12   13   14   >