Hi All,
I am watching this thread regards to performance of tesseract. We are
processing large PDF (100 of pages and each page is converted to BMP) and
sent to tesseract for processing one by one.
I am interested in only identifying the orientation of the text in the
image and do rotation of the image based on the orientation identified.
I could see that each of the image takes nearly 3 secs on an average.
So a hundred page PDF will take around 275 - 300 secs. Isn't this a bit too
high?
I am using the .NET tesseract wrapper 3.0.2 now. Do we have a latest
release version available and will it improve performance?
Again, my whole tesseract functionaliy is implemented in .NET assembly
(DLL) which is then called from our Delphi client.
I understand that the tesseract init process is a bit costly, but
wondering how to Init only once in the .NET assembly (DLL) and use it for
all pages on the PDF so I can save time while sending
subsequent pages from Delphi for processing from the .NET assembly?
Ta
Tomy
On Sunday, 14 February 2016 21:45:12 UTC+5:30, viraf wrote:
>
> I am new to tesseract and using it through Tess4J. I am trying to OCR
> faxes where pages are represented as TIFF (CCITT T.6) images - 2509 x 3530
> @ 300 dpi (1 bit - i.e. BW).
>
> I have two set of questions
>
> *Speed*
> On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1
> thread. I was looking for suggestions on how to speed up page processing.
> I use parallelStream to process each page in a separate thread,
>
> *Training*
> I am trying to learn about training Tesseract for improved accuracy.
> Given that the fonts / box files used to generate eng.traindata are not
> available can one specify the fonts used for english?
> Also, is there a description of the various training artifacts ? I used
> "combine_tessdata
> -u" to unpack eng.traindata and "dawg2wordlist" to extract thee
> wordlist, however was looking for documentation to better understand the
> various training artifacts.
>
> Thanks
>
> - viraf
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/c1be37b4-c6a5-4595-9b91-b6f8876b5cf5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.