RE: Index scanned documents

Allison, Timothy B. Mon, 27 Mar 2017 05:08:29 -0700

Please also see: 

https://wiki.apache.org/tika/TikaOCR


and

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR

If you have any other questions about Apache Tika and OCR, please feel free to 
ask on our users list as well: u...@tika.apache.org

Cheers,

           Tim

-----Original Message-----
From: Arian Pasquali [mailto:arianpasqu...@gmail.com] 
Sent: Sunday, March 26, 2017 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents

Hi Walled,

I've never done that with solr, but you would probably need to use some OCR 
preprocessing before indexing.
The most popular library I know for the job is tesseract-orc 
<https://github.com/tesseract-ocr>.

If you want to do that inside solr I've found that Tika has some support for 
that too.
Take a look Vijay Mhaskar's post on how to do this using TikaOCR

http://blog.thedigitalgroup.com/vijaym/using-solr-and-tikaocr-to-search-text-inside-an-image/

I hope that guides you

Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < 
waleed.raza.parhi...@gmail.com> escreveu:

> Hello
> I want to ask you that how can we extract text in solr from images 
> which are inside pdf and MS office documents ?
> i found many websites but did not get a reply of it please guide me.
>
> On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < 
> waleed.raza.parhi...@gmail.com
> > wrote:
>
> > Hello
> > I want to ask you that how can we extract in solr text from images 
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> >
>
--
[image: INESC TEC]

*Arian Rodrigo Pasquali*
Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of 
Artificial Intelligence and Decision Support

*INESC TEC*
Campus da FEUP
Rua Dr Roberto Frias
4200-465 Porto
Portugal

T +351 22 040 2963
F +351 22 209 4050
arian.r.pasqu...@inesctec.pt
www.inesctec.pt

RE: Index scanned documents

Reply via email to