Thanks Rick, minutes of CPU is definitely going to break my site. I'm looking for someone to hire as I have no coding knowledge. Please let me know if you are up for it.
On Mon, Nov 6, 2017 at 8:05 PM, Rick Leir <rl...@leirtech.com> wrote: > Anand, > As Charlie says you should have a separate process for this. Also, if you > go back about ten months in this mailing list you will see some discussion > about how OCR can take minutes of CPU per page, and needs some > preprocessing with Imagemagick or Graphicsmagick. You will want to do some > fine tuning with this, then save your OCR output in a DB or the filesystem. > Then you will want to be able to re-index Solr easily as you fine tune Solr. > > Yes, use Python or your preferred Scripting language. > Cheers -- Rick > > On November 6, 2017 4:05:42 AM EST, Charlie Hull <char...@flax.co.uk> > wrote: > >On 03/11/2017 15:32, Admin eLawJournal wrote: > >> Hi, > >> I have read that we can use tesseract with solr to index image files. > >I > >> would like some guidance on setting this up. > >> > >> Currently, I am using solr for searching my wordpress installation > >via the > >> WPSOLR plugin. > >> > >> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with > >> wordpress. > >> > >> I have also installed tesseract but have no clue on configuring it. > >> > >> > >> I am new to solr so will greatly appreciate a detailed step by step > >> instruction. > > > >Hi, > > > >I'm guessing if you're using a preconfigured Solr plugin for WP you > >probably haven't got your hands properly dirty with Solr yet. > > > >One way to use Tesseract would be via Apache Tika > >https://wiki.apache.org/tika/TikaOCR which is an awesome library for > >extracting plain text from many different document formats and types. > >There's a direct way to use Tesseract from within Solr (the > >ExtractingRequestHandler > >https://lucene.apache.org/solr/guide/6_6/uploading-data- > with-solr-cell-using-apache-tika.html#uploading-data-with- > solr-cell-using-apache-tika) > > > >but we don't generally recommend this, as dodgy files can sometimes eat > > > >all your resources during parsing and if Tika dies then so does Solr. > >We > >usually process the files externally and the feed them to Solr using > >its > >HTTP API. > > > >Here's one way to do it - a simple server wrapper around Tika > >https://github.com/mattflax/dropwizard-tika-server written by my > >colleague Matt Pearce. > > > >So you're going to need to do some coding I think - Python would be a > >good choice - to feed your source files to Tika for OCR and extraction, > > > >and then the resulting text to Solr for indexing. > > > >Cheers > > > >Charlie > > > >> > >> Thank you very much > >> > > > > > >-- > >Charlie Hull > >Flax - Open Source Enterprise Search > > > >tel/fax: +44 (0)8700 118334 > >mobile: +44 (0)7767 825828 > >web: www.flax.co.uk > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com
