[jira] [Commented] (TIKA-1866) Out of memory error on Word document

2016-02-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160099#comment-15160099 ] Tim Allison commented on TIKA-1866: --- Not TikaInputStream's fault. This looks to be a bug deep within

[jira] [Commented] (TIKA-1866) Out of memory error on Word document

2016-02-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160022#comment-15160022 ] Tim Allison commented on TIKA-1866: --- Strike that...image handling is not the problem. If I save the

[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2016-02-23 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159659#comment-15159659 ] Chris A. Mattmann commented on TIKA-1696: - [~lewistre] FYI > Language Identification with Text

Re: Integrating Tika with MITLL Text.jl library for language detection

2016-02-23 Thread Mattmann, Chris A (3980)
Thanks Ken. We are working on bringing in Text.jl and prefer at this point to work on 1.x branch aka master. I’ve asked Trevor to take a look at the 1.x branch and pulling your code from 2.x for tika-detect module into 1.x. Then to look at adding text.jl from MIT-LL as a corresponding

[jira] [Assigned] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2016-02-23 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1696: --- Assignee: Chris A. Mattmann > Language Identification with Text Processing Toolkit

RE: Integrating Tika with MITLL Text.jl library for language detection

2016-02-23 Thread Ken Krugler
Hi Trevor, 1. I assume the benchmark was using a pre-2.0 version of Tika, yes? It would be great to try out the current support in the 2.0 branch, as a comparison with what we had previously. Also, details on the test corpus used would be useful. 2. I started using the ServiceLoader pattern

Integrating Tika with MITLL Text.jl library for language detection

2016-02-23 Thread Trevor Claude Lewis
Hi all, I am Trevor and I am a grad student at USC currently working with Prof. Chris Mattmann and Paul Ramirez, on integrating Tika with MIT Lincoln Lab’s Text.jl library for language detection. https://issues.apache.org/jira/browse/TIKA-1696 Since, Text.jl is written in Julia I have created a

[jira] [Commented] (TIKA-1867) Tika external parsers cannot be turned off without patching the tika-app-XX.jar

2016-02-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159042#comment-15159042 ] Nick Burch commented on TIKA-1867: -- You should be able to exclude the CompositeExternalParser with a ~5

[jira] [Created] (TIKA-1867) Tika external parsers cannot be turned off without patching the tika-app-XX.jar

2016-02-23 Thread Roman Kratochvil (JIRA)
Roman Kratochvil created TIKA-1867: -- Summary: Tika external parsers cannot be turned off without patching the tika-app-XX.jar Key: TIKA-1867 URL: https://issues.apache.org/jira/browse/TIKA-1867

[jira] [Commented] (TIKA-1866) Out of memory error on Word document

2016-02-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158894#comment-15158894 ] Tim Allison commented on TIKA-1866: --- Looks like something in the image handling is causing problems.

[jira] [Commented] (TIKA-1866) Out of memory error on Word document

2016-02-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158871#comment-15158871 ] Tim Allison commented on TIKA-1866: --- That's exciting. I'll take a look. > Out of memory error on Word