[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401985#comment-15401985 ] Tim Allison edited comment on TIKA-2038 at 8/1/16 8:51 PM: --- This includes the

[jira] [Commented] (TIKA-2046) Can not read PDF correctly

2016-08-01 Thread gopalbhalala (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402326#comment-15402326 ] gopalbhalala commented on TIKA-2046: I linked issue to PDFBOX Thanks for helping me out > Can not read

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402016#comment-15402016 ] Tim Allison commented on TIKA-2038: --- bq. 1) You are right, my repo on github is fairly new (less than 1

[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2038: -- Attachment: iust_encodings.zip This includes the encodings as detected by: 1) Tika default, 2) HTML

[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-08-01 Thread Egbert (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401938#comment-15401938 ] Egbert commented on TIKA-2045: -- Thanks for investigating and reporting it with PDFBox. I'll subscribe to

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401895#comment-15401895 ] Tim Allison commented on TIKA-2038: --- Are you able to share the second corpus? > A more accurate facility

[jira] [Commented] (TIKA-2046) Can not read PDF correctly

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401888#comment-15401888 ] Tim Allison commented on TIKA-2046: --- Y, please link the issue in PDFBox's jira to this one so that we can

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401870#comment-15401870 ] Tim Allison edited comment on TIKA-2038 at 8/1/16 11:17 AM: bq. Then I

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401870#comment-15401870 ] Tim Allison commented on TIKA-2038: --- >Then I remembered that almost all of the test files in my corpus

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401867#comment-15401867 ] Tim Allison commented on TIKA-2038: --- bq. Unfortunately, I didn’t compare the results of my algorithm

[jira] [Commented] (TIKA-2046) Can not read PDF correctly

2016-08-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401757#comment-15401757 ] Nick Burch commented on TIKA-2046: -- As per the troubleshooting guide, if one of your files doesn't work

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Shabanali Faghani (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401715#comment-15401715 ] Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 8:45 AM: - OK, so to

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Shabanali Faghani (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401715#comment-15401715 ] Shabanali Faghani commented on TIKA-2038: - OK, so to give more details about my library to this

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Shabanali Faghani (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401595#comment-15401595 ] Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 6:42 AM: - I got

[jira] [Commented] (TIKA-2046) Can not read PDF correctly

2016-08-01 Thread gopalbhalala (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401604#comment-15401604 ] gopalbhalala commented on TIKA-2046: Thanks Nick... I tried that also but got same issue for 2nd link

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Shabanali Faghani (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401595#comment-15401595 ] Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 6:33 AM: - I got

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Shabanali Faghani (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401595#comment-15401595 ] Shabanali Faghani commented on TIKA-2038: - I got astonished by these results at first look! Because