Re: Fwd: configuring Solr with Tesseract

Admin eLawJournal Mon, 06 Nov 2017 04:23:23 -0800

Thanks Rick, minutes of CPU is definitely going to break my site. I'm
looking for someone to hire as I have no coding knowledge. Please let me
know if you are up for it.


On Mon, Nov 6, 2017 at 8:05 PM, Rick Leir <rl...@leirtech.com> wrote:

> Anand,
> As Charlie says you should have a separate process for this. Also, if you
> go back about ten months in this mailing list you will see some discussion
> about how OCR can take minutes of CPU per page, and needs some
> preprocessing with Imagemagick or Graphicsmagick. You will want to do some
> fine tuning with this, then save your OCR output in a DB or the filesystem.
> Then you will want to be able to re-index Solr easily as you fine tune Solr.
>
> Yes, use Python or your preferred Scripting language.
> Cheers -- Rick
>
> On November 6, 2017 4:05:42 AM EST, Charlie Hull <char...@flax.co.uk>
> wrote:
> >On 03/11/2017 15:32, Admin eLawJournal wrote:
> >> Hi,
> >> I have read that we can use tesseract with solr to index image files.
> >I
> >> would like some guidance on setting this up.
> >>
> >> Currently, I am using solr for searching my wordpress installation
> >via the
> >> WPSOLR plugin.
> >>
> >> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> >> wordpress.
> >>
> >> I have also installed tesseract but have no clue on configuring it.
> >>
> >>
> >> I am new to solr so will greatly appreciate a detailed step by step
> >> instruction.
> >
> >Hi,
> >
> >I'm guessing if you're using a preconfigured Solr plugin for WP you
> >probably haven't got your hands properly dirty with Solr yet.
> >
> >One way to use Tesseract would be via Apache Tika
> >https://wiki.apache.org/tika/TikaOCR which is an awesome library for
> >extracting plain text from many different document formats and types.
> >There's a direct way to use Tesseract from within Solr (the
> >ExtractingRequestHandler
> >https://lucene.apache.org/solr/guide/6_6/uploading-data-
> with-solr-cell-using-apache-tika.html#uploading-data-with-
> solr-cell-using-apache-tika)
> >
> >but we don't generally recommend this, as dodgy files can sometimes eat
> >
> >all your resources during parsing and if Tika dies then so does Solr.
> >We
> >usually process the files externally and the feed them to Solr using
> >its
> >HTTP API.
> >
> >Here's one way to do it - a simple server wrapper around Tika
> >https://github.com/mattflax/dropwizard-tika-server written by my
> >colleague Matt Pearce.
> >
> >So you're going to need to do some coding I think - Python would be a
> >good choice - to feed your source files to Tika for OCR and extraction,
> >
> >and then the resulting text to Solr for indexing.
> >
> >Cheers
> >
> >Charlie
> >
> >>
> >> Thank you very much
> >>
> >
> >
> >--
> >Charlie Hull
> >Flax - Open Source Enterprise Search
> >
> >tel/fax: +44 (0)8700 118334
> >mobile:  +44 (0)7767 825828
> >web: www.flax.co.uk
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Fwd: configuring Solr with Tesseract

Reply via email to