Thanks Konstantin! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Konstantin Gribov <gros...@gmail.com> Reply-To: "u...@tika.apache.org" <u...@tika.apache.org> Date: Monday, April 27, 2015 at 12:43 PM To: "u...@tika.apache.org" <u...@tika.apache.org> Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> Subject: Re: TIKA OCR not working >JFYI, there's no tesseract & leptonica for centos6/rhel6 (even in epel), >so I have specs for building tesseract and leptonica (its dependency) on >github (https://github.com/grossws/tesseract-ocr-specs). > Feel free to use if you're on centos/rhel. > > >Also, tesseract language packs are trained for one language each, so >dual-lang document would have quite bad OCR result even when both >languages use latin chars. You can use >o.a.tika.parsers.ocr.TesseractOCRConfig.setLanguage(String) to set lang >pack for > OCR. > > >-- >Best regards, >Konstantin Gribov > > >пн, 27 апр. 2015 г. в 17:36, Uwe Schindler <u...@thetaphi.de>: > >Yes that is fixed. > >----- >Uwe Schindler >H.-H.-Meier-Allee 63, D-28213 Bremen >http://www.thetaphi.de >eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] >> Sent: Monday, April 27, 2015 4:29 PM >> To: u...@tika.apache.org >> Cc: trung...@anlab.vn; >solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> >> Subject: Re: TIKA OCR not working >> >> It should work out of the box in Solr as long as Tesseract is installed >>and on >> the class path. Solr had an issue with it since Tika sends 2 >>startDocument calls, >> but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I >>think? >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> ++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) NASA Jet >> Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> ++++++++ >> Adjunct Associate Professor, Computer Science Department University of >> Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> ++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: <Allison>, "Timothy B." <talli...@mitre.org> >> Reply-To: "u...@tika.apache.org" <u...@tika.apache.org> >> Date: Monday, April 27, 2015 at 10:26 AM >> To: "u...@tika.apache.org" <u...@tika.apache.org> >> Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr- >> u...@lucene.apache.org" >> <solr-user@lucene.apache.org> >> Subject: FW: TIKA OCR not working >> >> >Trung, >> > >> >I haven't experimented with our OCR parser yet, but this should give a >> >good start: https://wiki.apache.org/tika/TikaOCR . >> > >> >Have you installed tesseract? >> > >> >Tika colleagues, >> > Any other tips? What else has to be configured and how? >> > >> >-----Original Message----- >> >From: trung.ht <http://trung.ht> [mailto:trung...@anlab.vn] >> >Sent: Friday, April 24, 2015 11:22 PM >> >To: solr-user@lucene.apache.org >> >Subject: Re: TIKA OCR not working >> > >> >HI everyone, >> > >> >Does anyone have the answer for this problem :)? >> > >> > >> >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika >> >1.7, >> >> but it looks like it does not work. Does anyone know that TIKA OCR >> >> works automatically with Solr or I have to change some settings? >> >> >> >>> >> >Trung. >> > >> > >> >> It's not clear if OCR would happen automatically in Solr Cell, or if >> >>> changes to Solr would be needed. >> >>> >> >>> For Tika OCR info, see: >> >>> >> >>> https://issues.apache.org/jira/browse/TIKA-93 >> >>> https://wiki.apache.org/tika/TikaOCR >> >>> >> >>> >> >>> >> >>> -- Jack Krupansky >> >>> >> >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch < >> >>> arafa...@gmail.com> >> >>> wrote: >> >>> >> >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't >> >>>seen >> >>> it >> >>> > in use yet. >> >>> > >> >>> > Regards, >> >>> > Alex >> >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" >> >>> > <iori...@yahoo.com.invalid> >> >>> wrote: >> >>> > >> >>> > > Hi Trung, >> >>> > > >> >>> > > I didn't know about OCR capabilities of tika. >> >>> > > Someone who is familiar with sold-cell can inform us whether >> >>> > > this functionality is added to solr or not. >> >>> > > >> >>> > > Ahmet >> >>> > > >> >>> > > >> >>> > > >> >>> > > On Thursday, April 23, 2015 2:06 PM, >trung.ht <http://trung.ht> >> >>> > > <trung...@anlab.vn> >> >>> wrote: >> >>> > > Hi Ahmet, >> >>> > > >> >>> > > I used a png file, not a pdf file. From the document, I >> >>> > > understand >> >>> that >> >>> > > solr will post the file to tika, and since tika 1.7, OCR is >> >>>included. >> >>> Is >> >>> > > there something I misunderstood. >> >>> > > >> >>> > > Trung. >> >>> > > >> >>> > > >> >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan >> >>> <iori...@yahoo.com.invalid >> >>> > > >> >>> > > wrote: >> >>> > > >> >>> > > > Hi Trung, >> >>> > > > >> >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from >> >>> > > > image >> >>> based >> >>> > > > pdfs. >> >>> > > > >> >>> > > > Ahmet >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > On Thursday, April 23, 2015 7:33 AM, >trung.ht <http://trung.ht> >> >>> > > > <trung...@anlab.vn> >> >>> > wrote: >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > Hi, >> >>> > > > >> >>> > > > I want to use solr to index some scanned document, after >> >>> > > > settings >> >>> solr >> >>> > > > document with a two field "content" and "filename", I tried to >> >>> upload >> >>> > the >> >>> > > > attached file, but it seems that the content of the file is >> >>> > > > only >> >>> "\n \n >> >>> > > > \n....". >> >>> > > > But if I used the tesseract from command line I got the result >> >>> > correctly. >> >>> > > > >> >>> > > > The log when solr receive my request: >> >>> > > > ----------- >> >>> > > > INFO - 2015-04-23 03:49:25.941; >> >>> > > > org.apache.solr.update.processor.LogUpdateProcessor; >> >>>[collection1] >> >>> > > > webapp=/solr path=/update/extract >> >>>params={literal.groupid=2&json.nl <http://json.nl> >> >>> > > =flat& >> >>> > > > resource.name <http://resource.name>=phplNiPrs&literal.id >><http://literal.id> >> >>> > > > >> >>> > > >> >>> > >> >>> >> >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=tr >> ue& >> >>>lit >> >>>eral.userid=3&literal.createddate=2015-04- >> 22T15:00:00Z&fmap.content=c >> >>>ont ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png} >> >>> > > > >> >>> > > > ------------ >> >>> > > > >> >>> > > > The document when I check on solr admin page: >> >>> > > > ------------- >> >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, >> >>> "createddate": >> >>> > > > "2015-04-22T15:00:00Z", "filename": >> >>> > "\\\\trunght\\test\\tesseract_3.png", >> >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], >> >>> > > "content": " >> >>> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> >>> > > > \n >> >>> \n >> >>> > \n >> >>> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> >>>\n >> >>> \n >> >>> > ", >> >>> > > > "_version_": 1499213034586898400 } >> >>> > > > >> >>> > > > ----------- >> >>> > > > >> >>> > > > Since I am a solr newbie I do not know where to look, can >> >>> > > > anyone >> >>> give >> >>> > me >> >>> > > > an advice for where to look for error or settings to make it >> >>>work. >> >>> > > > Thanks in advanced. >> >>> > > > >> >>> > > > Trung. >> >>> > > > >> >>> > > >> >>> > >> >>> >> >> >> >> > > > > >