[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Abha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396332#comment-17396332 ] Abha edited comment on TIKA-3518 at 8/10/21, 1:09 AM: -- Update – So i tried 1.27 and

[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Abha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396339#comment-17396339 ] Abha commented on TIKA-3518: So if i copy the eng.traineddata from the tessdata folder and move it to the

[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Abha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396332#comment-17396332 ] Abha edited comment on TIKA-3518 at 8/10/21, 1:00 AM: -- Update -- So i tried 1.27 and

[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-09 Thread Xiaohong Yang (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396336#comment-17396336 ] Xiaohong Yang commented on TIKA-3519: - No. We have not. I will try it and let you know.  Thank you

[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Abha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396332#comment-17396332 ] Abha commented on TIKA-3518: Please find my response inline - {color:#FFAB00}When you say the processbuilder

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396302#comment-17396302 ] Luís Filipe Nassif commented on TIKA-3515: -- Yes, seems to be some java Font issue, if I copy the

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396282#comment-17396282 ] Tim Allison commented on TIKA-3515: --- The UI is not very familiar to me. It looks like regular java

[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396277#comment-17396277 ] Tim Allison commented on TIKA-3518: --- There shouldn't be any new config changes. Hmmm... When you say

[jira] [Comment Edited] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396260#comment-17396260 ] Tim Allison edited comment on TIKA-3517 at 8/9/21, 8:27 PM: For posterity, I'm

[jira] [Comment Edited] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396260#comment-17396260 ] Tim Allison edited comment on TIKA-3517 at 8/9/21, 8:26 PM: For posterity, I'm

[jira] [Commented] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396260#comment-17396260 ] Tim Allison commented on TIKA-3517: --- For posterity, I'm attaching the Document.iwa file from inside the

[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Abha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396258#comment-17396258 ] Abha edited comment on TIKA-3518 at 8/9/21, 8:18 PM: - I tested it with JDK 11 and

[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Abha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396258#comment-17396258 ] Abha commented on TIKA-3518: I tested it with JDK 11 and still the same issue. The ProcessBuilder class in

[jira] [Updated] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3517: -- Attachment: Document Document.iwa > Text extraction doesn't work for Pages and Numbers

[jira] [Updated] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated TIKA-3515: - Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) > Korean chars

[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Abha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396218#comment-17396218 ] Abha commented on TIKA-3518: On debugging it seems that, the processbuilder (java) is not creating the tmp

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396217#comment-17396217 ] Luís Filipe Nassif commented on TIKA-3515: -- Programmatically it worked fine. Is 

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396206#comment-17396206 ] Luís Filipe Nassif commented on TIKA-3515: -- Humm both 2 commands above worked (also without -J

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396203#comment-17396203 ] Tim Allison commented on TIKA-3515: --- There is a subtle difference in tika-cli in handling xhtml/html and

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396201#comment-17396201 ] Tim Allison commented on TIKA-3515: --- Or, what happens if you try the {{-e}} option with {{-t}}?

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396199#comment-17396199 ] Tim Allison commented on TIKA-3515: --- I'm perplexed... :( Y, those question marks are literally byte

[jira] [Updated] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated TIKA-3515: - Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302 > Korean chars not extracted

[jira] [Updated] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated TIKA-3515: - Attachment: LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml

[jira] [Comment Edited] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396188#comment-17396188 ] Luís Filipe Nassif edited comment on TIKA-3515 at 8/9/21, 5:43 PM: --- Hi,

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396188#comment-17396188 ] Luís Filipe Nassif commented on TIKA-3515: -- Hi, Tim. This is what I'm getting from tika-app from

[jira] [Updated] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated TIKA-3515: - Attachment: image-2021-08-09-14-38-26-763.png > Korean chars not extracted correctly >

[jira] [Updated] (TIKA-3515) Korean chars not extracted correctly

2021-08-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated TIKA-3515: - Attachment: image-2021-08-09-14-37-30-552.png > Korean chars not extracted correctly >

[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396134#comment-17396134 ] Tim Allison commented on TIKA-3519: --- Have you tried setting a write limit on your handler? e.g.

[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-09 Thread Xiaohong Yang (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396067#comment-17396067 ] Xiaohong Yang commented on TIKA-3519: - We call Tika programmatically. > Wonder if you can add a

[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396043#comment-17396043 ] Sebastian Nagel commented on TIKA-3489: --- +1 to leave it as is. A backport definitely makes sense, in

[jira] [Commented] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

2021-08-09 Thread Chris Bryant (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396040#comment-17396040 ] Chris Bryant commented on TIKA-3517: Thanks, [~tallison].  I see Tika-1358 now.  With regular

[jira] [Commented] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

2021-08-09 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396006#comment-17396006 ] Hudson commented on TIKA-3517: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #304 (See

[jira] [Commented] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395967#comment-17395967 ] Tim Allison commented on TIKA-3517: --- I fixed this in main/2.0.1-SNAPSHOT. However, regrettably, we

[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

2021-08-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395964#comment-17395964 ] Tim Allison commented on TIKA-3518: --- I'm not seeing problems with 4.1.1 locally. I think we added more

Re: Kaitai - might be worth trying for new formats

2021-08-09 Thread Tim Allison
Porting our RTFParser might be a good place to start. I’ve heard good things about kaitai… thank you for sharing! On Mon, Aug 9, 2021 at 5:40 AM Nick Burch wrote: > Hi All > > I came across Kaitai - http://kaitai.io/ - yesterday. Based on the > experiences documented in this twitter thread on

Kaitai - might be worth trying for new formats

2021-08-09 Thread Nick Burch
Hi All I came across Kaitai - http://kaitai.io/ - yesterday. Based on the experiences documented in this twitter thread on understanding + parsing an embedded filesystem: https://twitter.com/wrongbaud/status/1424380510671880198 Looks like it might be worth a look for if we need to write our