[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364710#comment-14364710 ] Tilman Hausherr commented on TIKA-1575: --- Could you attach the TIKA output you get

[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2015-03-17 Thread Ann Burgess (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ann Burgess updated TIKA-1577: -- Description: A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts:

[GitHub] tika pull request: TIKA-1365: Lower priority for XML starting with...

2015-03-17 Thread mkr
GitHub user mkr opened a pull request: https://github.com/apache/tika/pull/35 TIKA-1365: Lower priority for XML starting with comment TIKA-1365: Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html You can merge this pull

[jira] [Comment Edited] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365641#comment-14365641 ] Tim Allison edited comment on TIKA-1575 at 3/17/15 5:51 PM: We

[jira] [Comment Edited] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365641#comment-14365641 ] Tim Allison edited comment on TIKA-1575 at 3/17/15 5:50 PM: We

[jira] [Created] (TIKA-1577) NetCDF Data Extraction

2015-03-17 Thread Ann Burgess (JIRA)
Ann Burgess created TIKA-1577: - Summary: NetCDF Data Extraction Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components:

[jira] [Commented] (TIKA-1365) Incorrectly MimeType detection for Apache Lucene web site

2015-03-17 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366176#comment-14366176 ] ASF GitHub Bot commented on TIKA-1365: -- GitHub user mkr opened a pull request:

[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: 005937.pdf.json Y, I can't find it in Acro Reader with search either, but it was extracted

[jira] [Comment Edited] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364885#comment-14364885 ] Tim Allison edited comment on TIKA-1575 at 3/17/15 10:27 AM: -

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365641#comment-14365641 ] Tim Allison commented on TIKA-1575: --- We haven't yet integrated OCR with PDFParsing...it

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365733#comment-14365733 ] Tim Allison commented on TIKA-1575: --- If the multithreading hypothesis is correct, we had

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365799#comment-14365799 ] Tim Allison commented on TIKA-1575: --- I've kicked off a single-threaded batch run of 1.8.9

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365807#comment-14365807 ] Tilman Hausherr commented on TIKA-1575: --- Can't tell, I don't know much about the

[jira] [Commented] (TIKA-1365) Incorrectly MimeType detection for Apache Lucene web site

2015-03-17 Thread Matthias Krueger (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366201#comment-14366201 ] Matthias Krueger commented on TIKA-1365: Quick wrapup: * HTML starting with comment

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365829#comment-14365829 ] Tilman Hausherr commented on TIKA-1575: --- Thanks. Re: OCR, you should know that there

[GitHub] tika pull request: TIKA-1554: Adding EMF magic as per Microsoft's ...

2015-03-17 Thread mkr
GitHub user mkr opened a pull request: https://github.com/apache/tika/pull/34 TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to Luis Filipe Nassif TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to Luis Filipe Nassif You can

[jira] [Commented] (TIKA-1554) Improve EMF file detection

2015-03-17 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366038#comment-14366038 ] ASF GitHub Bot commented on TIKA-1554: -- GitHub user mkr opened a pull request:

[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: 005937_1_8_9-SNAPSHOT.pdf.json Corrupted characters where monitoring should be. Given that

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365524#comment-14365524 ] Tilman Hausherr commented on TIKA-1575: --- I can't understand how you get the extracted