[tesseract-ocr] Unable to compile Tesseract 4 for Android platform (libtesseract.so)

2020-04-22 Thread Kunal Singh
I am trying to compile tesseract for android platform (armv7 and arm64 
architectures). As mentioned in the tesseract_android 
, i tried 
to get "libtesseract.so" file by running the code :

ndk-build -C tess-two-git/tess-two tesseract APP_ABI=arm64-v8a

But, I am unable to run this code. I always get "ndk-build command not 
found" error. I tried on windows 10 PC with cgywin and also on a virtual 
box with Ubuntu 18.

How do we run this command to get the compiled file (libtesseract.so). What 
am I missing here? Do we need any special tools for executing these 
commands? Or are there any prerequisites for running this command (like any 
installations, git repository clones etc)?

I don't have any experience with compiling tesseract for any platforms. So, 
any suggestions are most welcome.

P.S.
Attachment containing how I am running the command

Thanks,
Kunal

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/df173ec5-aea1-4975-96e6-af4656c110f5%40googlegroups.com.


Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-22 Thread Peyi Oyelo
I created the akan.traineddata using the typical tesseract 3 legacy 
workflow. I do not have word/freq/punc lists. As of now I would like to 
train using lstm to support as many fonts i.e. 45000 fonts, as possible. 
The existing akan.traineddata was only trained to work with DejaVu Sans

New versions of the  akan.trainedddata will be trained on 8 fonts that 
support Akan. These 8 fonts include Dejavu Sans, Dejavu Serif, FreeMono, 
FreeSans, FreeSerif, LiberationMono, Liberation Sans and Liberation Serif. 
Across 8 of them, these fonts have 44 variants.

Thank you for the evaluation link.

On Wednesday, April 22, 2020 at 6:46:28 AM UTC+1, shree wrote:
>
> For evaluating OCR accuracy of tesseract models, you can use the following:
>
> https://github.com/impactcentre/ocrevalUAtion 
>
> or
>
> https://github.com/eddieantonio/ocreval
>
> How did you create akan.traineddata?
>
> Do you need to train it only for one font? 
>
> On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  > wrote:
>
>> Thank you for replying Shree. I have zipped the entire document into 
>> Akan.zip.
>>
>>
>> I have attached the source training text file (Akan.dejavusans.txt) 
>> containing the text that is to be recognized by Tesseract. I have been able 
>> to generate a tiff file and box file from Akan.dejavusans.txt and its 
>> resulting files are labeled accordingly. I have also been able to recognize 
>> sample text with the trained model called Akan.traineddata. I am unaware as 
>> to how to evaluate the accuracy of this model and would like to hear your 
>> thoughts. I have attached the results of the akan.traineddata trial on 
>> TestFileA  (this is the source test txt found testFile folder ) in the 
>> testfile folder. The results of the test exist as testFilesA_results.
>>
>> It is worth noting that Akan makes use of a Latin Script and only 
>> exhibits differences in 2 letters in alphabets specifically the letters Ɔ 
>> and Ɛ. It also does not contain the letters C, Q, V, X, and Z. Would it be 
>> better to just fine-tune the existing default eng.traineddata using lstm?
>>
>> I have no wordlist, freq list, punc.dawg files
>> On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:
>>>
>>> Please share couple of image files and their corresponding text version 
>>> so that I can see what will work best.
>>>
>>> On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:
>>>
 Hello Shree and sorry for reviving an old dead thread. I am currently 
 trying to train Tesseract to recognize the Akan language. I have been able 
 to create a trained data file that can recognize akan, however this does 
 not use Tesseract's lstm network. I am now trying to perform lstm training 
 but I do not have ground-truth data for lstm training. I have generated 
 synthetic tiff files from a txt file but I am at loggerheads as to how to 
 automate the ground-truth generation process. I came across your post 
 here: 
 https://github.com/tesseract-ocr/tesstrain/issues/7 where you 
 described that it was possible but I could not understand the code. 

 I am asking please if you could explain it to me and how it would work 
 for using my Tiff files. I know it is a lot to ask but thank you

 On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>
> Does anyone know of any utilities to convert a box file to ground 
> truth text file?
>
> I am using tesstrain.sh which uses text2image for trying out LSTM 
> training. However, because unrenderable words are not included in the 
> tifs, 
> it is not possible to use the training_text as ground truth.
>
> Thanks!
>
 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/bd5b0b4d-c8a7-45aa-b6a0-cec9732b8e0a%40googlegroups.com
>>  
>> 
>> .
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To 

[tesseract-ocr] Re: Checkbox Extraction as text after Fine tuning for new characters .

2020-04-22 Thread Piyush Chandra
Hi Apoorva,

Were you able to get the 3 check boxes OCRed? Did you get any errors while 
training and how did you complete the training for your model?

Thanks & Regards,
Piyush

On Tuesday, 3 April 2018 14:29:38 UTC+5:30, Apoorv Khanna wrote:
>
> Hi all,
>
> I am able to extract few check boxes after fine tuning the English model 
> but tesseract is not able to extract all the check boxes .
>
> Thanks in advance
>
> version Used : *tesseract 4 beta*
> Font used for training : *Dejavu Sans*
> No of symbols inserted in training text is 14 each
>
> *Extracted text:*
> ☐not reported wnot reported zpnot reported
> cno Byes tno ☒yes ☐no ☑pyes
> not reported not reported ☐not reported
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/48c88e9d-0f43-45b4-a3ee-3303e5310106%40googlegroups.com.