Re: TIKA OCR not working

Mattmann, Chris A (3980) Mon, 27 Apr 2015 09:48:07 -0700

Thanks Konstantin!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++







-----Original Message-----
From: Konstantin Gribov <gros...@gmail.com>
Reply-To: "u...@tika.apache.org" <u...@tika.apache.org>
Date: Monday, April 27, 2015 at 12:43 PM
To: "u...@tika.apache.org" <u...@tika.apache.org>
Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr-user@lucene.apache.org"
<solr-user@lucene.apache.org>
Subject: Re: TIKA OCR not working

>JFYI, there's no tesseract & leptonica for centos6/rhel6 (even in epel),
>so I have specs for building tesseract and leptonica (its dependency) on
>github (https://github.com/grossws/tesseract-ocr-specs).
> Feel free to use if you're on centos/rhel.
>
>
>Also, tesseract language packs are trained for one language each, so
>dual-lang document would have quite bad OCR result even when both
>languages use latin chars. You can use
>o.a.tika.parsers.ocr.TesseractOCRConfig.setLanguage(String) to set lang
>pack for
> OCR.
>
>
>-- 
>Best regards,
>Konstantin Gribov
>
>
>пн, 27 апр. 2015 г. в 17:36, Uwe Schindler <u...@thetaphi.de>:
>
>Yes that is fixed.
>
>-----
>Uwe Schindler
>H.-H.-Meier-Allee 63, D-28213 Bremen
>http://www.thetaphi.de
>eMail: u...@thetaphi.de
>
>
>> -----Original Message-----
>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>> Sent: Monday, April 27, 2015 4:29 PM
>> To: u...@tika.apache.org
>> Cc: trung...@anlab.vn;
>solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
>> Subject: Re: TIKA OCR not working
>>
>> It should work out of the box in Solr as long as Tesseract is installed
>>and on
>> the class path. Solr had an issue with it since Tika sends 2
>>startDocument calls,
>> but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I
>>think?
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398) NASA Jet
>> Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>> Adjunct Associate Professor, Computer Science Department University of
>> Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <Allison>, "Timothy B." <talli...@mitre.org>
>> Reply-To: "u...@tika.apache.org" <u...@tika.apache.org>
>> Date: Monday, April 27, 2015 at 10:26 AM
>> To: "u...@tika.apache.org" <u...@tika.apache.org>
>> Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr-
>> u...@lucene.apache.org"
>> <solr-user@lucene.apache.org>
>> Subject: FW: TIKA OCR not working
>>
>> >Trung,
>> >
>> >I haven't experimented with our OCR parser yet, but this should give a
>> >good start: https://wiki.apache.org/tika/TikaOCR .
>> >
>> >Have you installed tesseract?
>> >
>> >Tika colleagues,
>> >  Any other tips?  What else has to be configured and how?
>> >
>> >-----Original Message-----
>> >From: trung.ht <http://trung.ht> [mailto:trung...@anlab.vn]
>> >Sent: Friday, April 24, 2015 11:22 PM
>> >To: solr-user@lucene.apache.org
>> >Subject: Re: TIKA OCR not working
>> >
>> >HI everyone,
>> >
>> >Does anyone have the answer for this problem :)?
>> >
>> >
>> >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>> >1.7,
>> >> but it looks like it does not work. Does anyone know that TIKA OCR
>> >> works automatically with Solr or I have to change some settings?
>> >>
>> >>>
>> >Trung.
>> >
>> >
>> >> It's not clear if OCR would happen automatically in Solr Cell, or if
>> >>> changes to Solr would be needed.
>> >>>
>> >>> For Tika OCR info, see:
>> >>>
>> >>> https://issues.apache.org/jira/browse/TIKA-93
>> >>> https://wiki.apache.org/tika/TikaOCR
>> >>>
>> >>>
>> >>>
>> >>> -- Jack Krupansky
>> >>>
>> >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> >>> arafa...@gmail.com>
>> >>> wrote:
>> >>>
>> >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>> >>>seen
>> >>> it
>> >>> > in use yet.
>> >>> >
>> >>> > Regards,
>> >>> >     Alex
>> >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan"
>> >>> > <iori...@yahoo.com.invalid>
>> >>> wrote:
>> >>> >
>> >>> > > Hi Trung,
>> >>> > >
>> >>> > > I didn't know about OCR capabilities of tika.
>> >>> > > Someone who is familiar with sold-cell can inform us whether
>> >>> > > this functionality is added to solr or not.
>> >>> > >
>> >>> > > Ahmet
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > On Thursday, April 23, 2015 2:06 PM,
>trung.ht <http://trung.ht>
>> >>> > > <trung...@anlab.vn>
>> >>> wrote:
>> >>> > > Hi Ahmet,
>> >>> > >
>> >>> > > I used a png file, not a pdf file. From the document, I
>> >>> > > understand
>> >>> that
>> >>> > > solr will post the file to tika, and since tika 1.7, OCR is
>> >>>included.
>> >>> Is
>> >>> > > there something I misunderstood.
>> >>> > >
>> >>> > > Trung.
>> >>> > >
>> >>> > >
>> >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> >>> <iori...@yahoo.com.invalid
>> >>> > >
>> >>> > > wrote:
>> >>> > >
>> >>> > > > Hi Trung,
>> >>> > > >
>> >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from
>> >>> > > > image
>> >>> based
>> >>> > > > pdfs.
>> >>> > > >
>> >>> > > > Ahmet
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > > > On Thursday, April 23, 2015 7:33 AM,
>trung.ht <http://trung.ht>
>> >>> > > > <trung...@anlab.vn>
>> >>> > wrote:
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > > > Hi,
>> >>> > > >
>> >>> > > > I want to use solr to index some scanned document, after
>> >>> > > > settings
>> >>> solr
>> >>> > > > document with a two field "content" and "filename", I tried to
>> >>> upload
>> >>> > the
>> >>> > > > attached file, but it seems that the content of the file is
>> >>> > > > only
>> >>> "\n \n
>> >>> > > > \n....".
>> >>> > > > But if I used the tesseract from command line I got the result
>> >>> > correctly.
>> >>> > > >
>> >>> > > > The log when solr receive my request:
>> >>> > > > -----------
>> >>> > > > INFO  - 2015-04-23 03:49:25.941;
>> >>> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>> >>>[collection1]
>> >>> > > > webapp=/solr path=/update/extract
>> >>>params={literal.groupid=2&json.nl <http://json.nl>
>> >>> > > =flat&
>> >>> > > > resource.name <http://resource.name>=phplNiPrs&literal.id
>><http://literal.id>
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=tr
>> ue&
>> >>>lit
>> >>>eral.userid=3&literal.createddate=2015-04-
>> 22T15:00:00Z&fmap.content=c
>> >>>ont ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> >>> > > >
>> >>> > > > ------------
>> >>> > > >
>> >>> > > > The document when I check on solr admin page:
>> >>> > > > -------------
>> >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> >>> "createddate":
>> >>> > > > "2015-04-22T15:00:00Z", "filename":
>> >>> > "\\\\trunght\\test\\tesseract_3.png",
>> >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>> >>> > > "content": "
>> >>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> >>> > > > \n
>> >>> \n
>> >>> > \n
>> >>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> >>>\n
>> >>> \n
>> >>> > ",
>> >>> > > > "_version_": 1499213034586898400 }
>> >>> > > >
>> >>> > > > -----------
>> >>> > > >
>> >>> > > > Since I am a solr newbie I do not know where to look, can
>> >>> > > > anyone
>> >>> give
>> >>> > me
>> >>> > > > an advice for where to look for error or settings to make it
>> >>>work.
>> >>> > > > Thanks in advanced.
>> >>> > > >
>> >>> > > > Trung.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>
>
>
>
>

Re: TIKA OCR not working

Reply via email to