Hi Gus,

Thank you so much! I will definitely take a look at it during the day.


Martin Frank Hansen,

-----Oprindelig meddelelse-----
Fra: Gus Heck <gus.h...@gmail.com>
Sendt: 22. oktober 2018 00:06
Til: solr-user@lucene.apache.org
Emne: Re: Tesseract language

Hi Martin,

I wrote a framework (https://github.com/nsoft/jesterj) that is meant to help 
with small to medium custom solutions It's not (yet) ready for cases where you 
need multiple machines feeding data, but so long as a single box can do the 
work it should be useful. It has a basic Tika stage which is ripe for 
enhancement. The example in the project uses Tika to extract text from 
Shakespeare's plays, though I'll admit that the Tika processor class it has not 
yet been given the full set of configuration options.  Fleshing that out is on 
the list of things to do and would be easy and welcome as a contribution 
(https://github.com/nsoft/jesterj/issues/74).

-Gus


On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <m...@kmd.dk>
> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and
> Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it
> work.
> >
> > You said that DIH are not recommended for production usage, what is
> > the
> recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -----Oprindelig meddelelse-----
> > Fra: Alexandre Rafalovitch <arafa...@gmail.com>
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user <solr-user@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is
> > great
> for a quick test, just like you did it, but going to production,
> running it externally is better. Tika - especially with large files
> can use up a lot of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr
> > but
> specifying parseContent.config file as shown at the link and described
> further down in the same document:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> > You still need to check with Tika documentation with Tesseract can
> > take
> its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler
> > can
> iterate through files and then - as a nested entity - feed it to Tika
> processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is
> also not recommended for production.
> >
> > I hope this helps,
> >     Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <m...@kmd.dk>
> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) <m...@kmd.dk>
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the
> > > data-import-handler in Solr and it actually works very well – with
> > > English. As the documents are in Danish, I need to change the
> > > language setting in Tesseract to Danish as well, is that possible from 
> > > Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement
> > > several files from a file-system?
> > >
> > >
> > >
> > > Here is the request-handler I used:
> > >
> > >
> > >
> > > <requestHandler name="/update/extract"
> > >
> > >                   startup="lazy"
> > >
> > >                   class="solr.extraction.ExtractingRequestHandler"
> > > >
> > >
> > >     <lst name="defaults">
> > >
> > >       <str name="lowernames">false</str>
> > >
> > >       <str name="uprefix">ignored_</str>
> > >
> > >       <str name="captureAttr">true</str>
> > >
> > >     </lst>
> > >
> > >   </requestHandler>
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder du KMD’s Privatlivspolitik
> > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi
> behandler oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> > > outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information.
> > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller
> kopiere den.
> > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> > > e-mailen.
> > >
> > > Please note that this message may contain confidential
> > > information. If you have received this message by mistake, please
> > > inform the sender of the mistake by sending a reply, then delete
> > > the message from your system without making, distributing or retaining 
> > > any copies of it.
> > > Although we believe that the message and any attachments are free
> > > from viruses and other errors that might affect the computer or
> > > it-system where it is received and read, the recipient opens the
> > > message at his
> or her own risk.
> > > We assume no responsibility for any loss or damage arising from
> > > the receipt or use of this message.
> > >
>


--
http://www.the111shift.com

Reply via email to