I used tess4j for image formats and Tika for scanned PDFs and images within PDFs.
Regards, Rohan Kasat On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote: > Hi Rohan, > > Thanks for your reply, are you using tess4j with Tika or on its own? I > will take a look at tess4j if I can't make it work with Tika alone. > > Best regards > Martin > > > -----Original Message----- > From: Rohan Kasat <rohan.ka...@gmail.com> > Sent: 26. oktober 2018 21:45 > To: solr-user@lucene.apache.org > Subject: Re: Tesseract language > > Hi Martin, > > Are you using it For image formats , I think you can try tess4j and use > give TESSDATA_PREFIX as the home for tessarct Configs. > > I have tried it and it works pretty well in my local machine. > > I have used java 8 and tesseact 3 for the same. > > Regards, > Rohan Kasat > > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) <m...@kmd.dk> > wrote: > > > Hi Tim, > > > > You were right. > > > > When I called `tesseract testing/eurotext.png testing/eurotext-dan -l > > dan`, I got an error message so I downloaded "dan.traineddata" and > > added it to the Tesseract-OCR/tessdata folder. Furthermore I added the > > 'TESSDATA_PREFIX' variable to the path-variables pointing to > > "Tesseract-OCR/tessdata". > > > > Now Tesseract works with Danish language from the CMD, but now I can't > > make the code work in Java, not even with default settings (which I > > could before). Am I missing something or just mixing some things up? > > > > > > > > -----Original Message----- > > From: Tim Allison <talli...@apache.org> > > Sent: 26. oktober 2018 19:58 > > To: solr-user@lucene.apache.org > > Subject: Re: Tesseract language > > > > Tika relies on you to install tesseract and all the language libraries > > you'll need. > > > > If you can successfully call `tesseract testing/eurotext.png > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan" > > with your code above. > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) > > <m...@kmd.dk> > > wrote: > > > > > > Hi again, > > > > > > Now I moved the OCR part to Tika, but I still can't make it work > > > with > > Danish. It works when using default language settings and it seems > > like Tika is missing Danish dictionary. > > > > > > My java code looks like this: > > > > > > { > > > File file = new File(pathfilename); > > > > > > Metadata meta = new Metadata(); > > > > > > InputStream stream = TikaInputStream.get(file); > > > > > > Parser parser = new AutoDetectParser(); > > > BodyContentHandler handler = new > > > BodyContentHandler(Integer.MAX_VALUE); > > > > > > TesseractOCRConfig config = new TesseractOCRConfig(); > > > config.setLanguage("dan"); // code works if this phrase > > > is > > commented out. > > > > > > ParseContext parseContext = new ParseContext(); > > > > > > parseContext.set(TesseractOCRConfig.class, config); > > > > > > parser.parse(stream, handler, meta, parseContext); > > > System.out.println(handler.toString()); > > > } > > > > > > Hope that someone can help here. > > > > > > -----Original Message----- > > > From: Martin Frank Hansen (MHQ) <m...@kmd.dk> > > > Sent: 22. oktober 2018 07:58 > <https://maps.google.com/?q=tober+2018+07:58&entry=gmail&source=g> > > > To: solr-user@lucene.apache.org > > > Subject: SV: Tessera > > <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>ct > > language > > > > > > Hi Erick, > > > > > > Thanks for the help! I will take a look at it. > > > > > > > > > Martin Frank Hansen, Senior Data Analytiker > > > > > > Data, IM & Analytics > > > > > > > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk Web > > > www.kmd.dk Mobil +4525571418 > > > > > > -----Oprindelig meddelelse----- > > > Fra: Erick Erickson <erickerick...@gmail.com> > > > Sendt: 21. oktober 2018 22:49 > > > Til: solr-user <solr-user@lucene.apache.org> > > > Emne: Re: Tesseract language > > > > > > Here's a skeletal program that uses Tika in a stand-alone client. > > > Rip > > the RDBMS parts out.... > > > > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch < > > arafa...@gmail.com> wrote: > > > > > > > > Usually, we just say to do a custom solution using SolrJ client to > > > > connect. This gives you maximum flexibility and allows to > > > > integrate Tika either inside your code or as a server. Latest Tika > > > > actually has some off-thread handling I believe, to make it safer to > embed. > > > > > > > > For DIH alternatives, if you want configuration over custom code, > > > > you could look at something like Apache NiFI. It can push data > > > > into > > Solr. > > > > Obviously it is a bigger solution, but it is correspondingly more > > > > robust too. > > > > > > > > Regards, > > > > Alex. > > > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) > > > > <m...@kmd.dk> > > wrote: > > > > > > > > > > Hi Alexandre, > > > > > > > > > > Thanks for your reply. > > > > > > > > > > Yes right now it is just for testing the possibilities of Solr > > > > > and > > Tesseract. > > > > > > > > > > I will take a look at the Tika documentation to see if I can > > > > > make it > > work. > > > > > > > > > > You said that DIH are not recommended for production usage, what > > > > > is > > the recommended method(s) to upload data to a Solr instance? > > > > > > > > > > Best regards > > > > > > > > > > Martin Frank Hansen > > > > > > > > > > -----Oprindelig meddelelse----- > > > > > Fra: Alexandre Rafalovitch <arafa...@gmail.com> > > > > > Sendt: 21. oktober 2018 16:26 > > > > > Til: solr-user <solr-user@lucene.apache.org> > > > > > Emne: Re: Tesseract language > > > > > > > > > > There is a couple of things mixed in here: > > > > > 1) Extract handler is not recommended for production usage. It > > > > > is > > great for a quick test, just like you did it, but going to production, > > running it externally is better. Tika - especially with large files > > can use up a lot of memory and trip up the Solr instance it is running > within. > > > > > 2) If you are still just testing, you can configure Tika within > > > > > Solr > > but specifying parseContent.config file as shown at the link and > > described further down in the same document: > > > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-sol > > > > > r- > > > > > ce > > > > > ll-using-apache-tika.html#configuring-the-solr-extractingrequest > > > > > ha nd ler You still need to check with Tika documentation with > > > > > Tesseract can take its configuration from the parseContext file. > > > > > 3) If you are still testing with multiple files, Data Import > > > > > Handler > > can iterate through files and then - as a nested entity - feed it to > > Tika processor for further extraction. I think one of the examples shows > that. > > > > > However, I am not sure you can pass parseContext that way and > > > > > DIH is > > also not recommended for production. > > > > > > > > > > I hope this helps, > > > > > Alex. > > > > > > > > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) > > > > > <m...@kmd.dk> > > wrote: > > > > > > > > > > > Hi again, > > > > > > > > > > > > > > > > > > > > > > > > Is there anyone who has some experience of using Tesseract’s > > > > > > OCR module within Solr? The files I am trying to read into > > > > > > Solr is Danish Tiff documents. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Martin Frank Hansen*, Senior Data Analytiker > > > > > > > > > > > > Data, IM & Analytics > > > > > > > > > > > > [image: cid:image001.png@01D383C9.6C129A60] > > > > > > > > > > > > > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk Web > > > > > > www.kmd.dk Mobil +4525571418 > > > > > > > > > > > > > > > > > > > > > > > > *Fra:* Martin Frank Hansen (MHQ) <m...@kmd.dk> > > > > > > *Sendt:* 18. oktober > <https://maps.google.com/?q=t:*+18.+oktober+&entry=gmail&source=g>2018 > 13:30 > > > > > > *Til:* solr-user@lucene.apache.org > > > > > > *Emne:* Tesseract language > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > I have been trying to use Tesseract through the > > > > > > data-import-handler in Solr and it actually works very well – > > > > > > with English. As the documents are in Danish, I need to change > > > > > > the language setting in Tesseract to > > <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>Dani > > sh > > as well, is that possible from Solr? > > > > > > > > > > > > > > > > > > > > > > > > I was using the update/extract-handler to import single files > > > > > > into Solr, and it worked for a single file, how would I > > > > > > implement several files from a file-system? > > > > > > > > > > > > > > > > > > > > > > > > Here is the request-handler I used: > > > > > > > > > > > > > > > > > > > > > > > > <requestHandler name="/update/extract" > > > > > > > > > > > > startup="lazy" > > > > > > > > > > > > > class="solr.extraction.ExtractingRequestHandler" > > > > > > > > > > > > > > > > > > > <lst name="defaults"> > > > > > > > > > > > > <str name="lowernames">false</str> > > > > > > > > > > > > <str name="uprefix">ignored_</str> > > > > > > > > > > > > <str name="captureAttr">true</str> > > > > > > > > > > > > </lst> > > > > > > > > > > > > </requestHandler> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Martin Frank Hansen*, Senior Data Analytiker > > > > > > > > > > > > Data, IM & Analytics > > > > > > > > > > > > [image: cid:image001.png@01D383C9.6C129A60] > > > > > > > > > > > > > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk Web > > > > > > www.kmd.dk Mobil +4525571418 > > > > > > > > > > > > > > > > > > > > > > > > Beskyttelse af dine personlige oplysninger er vigtig for os. > > > > > > Her finder du KMD’s Privatlivspolitik > > > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan > > > > > > vi > > behandler oplysninger om dig. > > > > > > > > > > > > Protection of your personal data is important to us. Here you > > > > > > can read KMD’s Privacy Policy > > > > > > <http://www.kmd.net/Privacy-Policy> > > > > > > outlining how we process your personal data. > > > > > > > > > > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig > > information. > > > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig > > > > > > venligst informere afsender om fejlen ved at bruge > svarfunktionen. > > > > > > Samtidig beder vi dig slette e-mailen i dit system uden at > > videresende eller kopiere den. > > > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores > > > > > > overbevisning er fri for virus og andre fejl, som kan påvirke > > > > > > computeren eller it-systemet, hvori den modtages og læses, > > > > > > åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget > > > > > > ansvar for tab og skade, som er opstået i forbindelse med at > > > > > > modtage og > > bruge e-mailen. > > > > > > > > > > > > Please note that this message may contain confidential > > > > > > information. If you have received this message by mistake, > > > > > > please inform the sender of the mistake by sending a reply, > > > > > > then delete the message from your system without making, > > > > > > distributing > > or retaining any copies of it. > > > > > > Although we believe that the message and any attachments are > > > > > > free from viruses and other errors that might affect the > > > > > > computer or it-system where it is received and read, the > > > > > > recipient > > opens the message at his or her own risk. > > > > > > We assume no responsibility for any loss or damage arising > > > > > > from the receipt or use of this message. > > > > > > > > > -- > > *Regards,Rohan Kasat* > -- *Regards,Rohan Kasat*