Please also see: https://wiki.apache.org/tika/TikaOCR
and https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR If you have any other questions about Apache Tika and OCR, please feel free to ask on our users list as well: u...@tika.apache.org Cheers, Tim -----Original Message----- From: Arian Pasquali [mailto:arianpasqu...@gmail.com] Sent: Sunday, March 26, 2017 11:44 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents Hi Walled, I've never done that with solr, but you would probably need to use some OCR preprocessing before indexing. The most popular library I know for the job is tesseract-orc <https://github.com/tesseract-ocr>. If you want to do that inside solr I've found that Tika has some support for that too. Take a look Vijay Mhaskar's post on how to do this using TikaOCR http://blog.thedigitalgroup.com/vijaym/using-solr-and-tikaocr-to-search-text-inside-an-image/ I hope that guides you Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < waleed.raza.parhi...@gmail.com> escreveu: > Hello > I want to ask you that how can we extract text in solr from images > which are inside pdf and MS office documents ? > i found many websites but did not get a reply of it please guide me. > > On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < > waleed.raza.parhi...@gmail.com > > wrote: > > > Hello > > I want to ask you that how can we extract in solr text from images > > which are inside pdf and MS office documents ? > > i found many websites but did not get a reply of it please guide me. > > > > > -- [image: INESC TEC] *Arian Rodrigo Pasquali* Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of Artificial Intelligence and Decision Support *INESC TEC* Campus da FEUP Rua Dr Roberto Frias 4200-465 Porto Portugal T +351 22 040 2963 F +351 22 209 4050 arian.r.pasqu...@inesctec.pt www.inesctec.pt