Re: Tess4j API for TIKA OCR parser

2017-03-08 Thread Thejan Wijesinghe
om: Thamme Gowda [mailto:thammego...@apache.org] > Sent: Tuesday, March 7, 2017 10:38 AM > To: Thejan Wijesinghe <thejan.k.wijesin...@gmail.com> > Cc: dev@tika.apache.org > Subject: Re: Tess4j API for TIKA OCR parser > > Thanks Nick for the reply. > > Thejan, > > I a

RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thamme Gowda
om: Thamme Gowda [mailto:thammego...@apache.org] Sent: Tuesday, March 7, 2017 10:38 AM To: Thejan Wijesinghe <thejan.k.wijesin...@gmail.com> Cc: dev@tika.apache.org Subject: Re: Tess4j API for TIKA OCR parser Thanks Nick for the reply. Thejan, I am glad to know your progress. Rewriting the Te

RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Allison, Timothy B.
he.org Subject: Re: Tess4j API for TIKA OCR parser Thanks Nick for the reply. Thejan, I am glad to know your progress. Rewriting the TesseractOCRParser would be the ultimate goal if using Tess4j proves to be better than the way it is done currently. But, for now, please consider these: + Rename your

RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Allison, Timothy B.
+1 Same experience, of same vintage. :) -Original Message- From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Tuesday, March 7, 2017 10:34 AM To: dev@tika.apache.org Subject: Re: Tess4j API for TIKA OCR parser Hi Thejan, Before the first version of TesseractOcrParser

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Luís Filipe Nassif
Hi Thejan, Before the first version of TesseractOcrParser was commited I tried to use Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because of bugs into native code (pointers to crazy

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thamme Gowda
Thanks Nick for the reply. Thejan, I am glad to know your progress. Rewriting the TesseractOCRParser would be the ultimate goal if using Tess4j proves to be better than the way it is done currently. But, for now, please consider these: + Rename your class to *Tess4jOCRParser*. It is a new

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thejan Wijesinghe
Hi Nick, I thought the same thing. I will try to keep the public method signatures unchanged and will send updates on my progress. On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch wrote: > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: > >> I have already use the Tess4j API to

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Nick Burch
On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: I have already use the Tess4j API to rewrite the TesseractOCRParser class, Although It successfully extracts content from most of the file types, it fails some particular unit tests in the TesseractOCRParserTest class. I can solve that. However, I

Re: Tess4j API for TIKA OCR parser

2017-03-06 Thread Thejan Wijesinghe
Thamme, I have already use the Tess4j API to rewrite the TesseractOCRParser class, Although It successfully extracts content from most of the file types, it fails some particular unit tests in the TesseractOCRParserTest class. I can solve that. However, I want to know whether I can rewrite the

Tess4j API for TIKA OCR parser

2017-03-04 Thread Thejan Wijesinghe
Hi Thamme, Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both installed in my system using apt-get. Since, I wasn't sure whether this is a problem with the APT software packages, I built both ImageMagick and Tesseract from sources. I also double checked the availability of