[tesseract-ocr] Understanding the training algorithm response

2020-06-05 Thread Piyush Chandra
You can check this link for all your queries. https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#iterations-and-checkpoints -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails

[tesseract-ocr] Re: No tessdata for sat (Santali language, Ol Chiki Script ) in respository

2020-06-04 Thread Piyush Chandra
here is process to create a new Ttraineddata file: t https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00 On Monday, 1 June 2020 12:24:11 UTC+5:30, Prasanta Hembram wrote: > > Their is no tessdata for Santali language :- > >1. https://github.com/tesseract-ocr/tessdata >2.

[tesseract-ocr] Re: Some quesetions about ocrd-train

2020-06-04 Thread Piyush Chandra
radical-stroke.txt is used only for CJK languages, but tesseract checks for it during training process, so you need to make it available. You are doing it correctly. On Thursday, 4 June 2020 11:10:45 UTC+5:30, 易鑫 wrote: > > Hello,everyone: > Currently I use the

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-04 Thread Piyush Chandra
This is what is missing : --net_spec . Check the line below that I mentioned before. lstmtraining --traineddata ./out/own/own.traineddata --model_output ./output/own --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c110]" --train_listfile ./eng_ltsm/eng.training_files.txt

[tesseract-ocr] Re: Checkbox Extraction as text after Fine tuning for new characters .

2020-06-04 Thread Piyush Chandra
nesday, 22 April 2020 16:55:26 UTC+5:30, Piyush Chandra wrote: >> >> Hi Apoorva, >> >> Were you able to get the 3 check boxes OCRed? Did you get any errors >> while training and how did you complete the training for your model? >> >> Thanks & Regard

[tesseract-ocr] Re: Read subtitles with Tesseract

2020-06-04 Thread Piyush Chandra
Pre-process the image to crop the image where you want OCR to be done. you can use OpenCV to achieve that. On Thursday, 4 June 2020 09:10:55 UTC+5:30, Shefali Modi wrote: > > With below code I am trying to read the subtitles but code is returning > the text present at top. > Any suggestion how

[tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Piyush Chandra
; > четверг, 28 мая 2020 г., 8:04:03 UTC+3 пользователь Piyush Chandra написал: >> >> Hope below information helps: :) >> >> > Pls, some questions: > > Is it required: "--words...", "--numbers..." and "--puncs"? > Why do need "--net_s

[tesseract-ocr] Re: Extraction of English and Thai text from documents

2020-05-28 Thread Piyush Chandra
1. There has been always a problem with tables with tesseract. I would suggest you to remove the tables and do some pre processing of image like upscaling, threshold, grey scale, etc to improve accuracy. 2. Try posting you sample images and results for better reply. On Monday, 18 May 2020

[tesseract-ocr] Re: (Question) UB Mannheims's Windows installer options

2020-05-28 Thread Piyush Chandra
ScrollView is the jar file used while debugging in tesseract. I am not sure about what you mean script and language data. On Sunday, 24 May 2020 12:06:36 UTC+5:30, Axel Gold wrote: > > Hello, I am trying to install Tesseract 5.0.0 alpha using the installet > built by UB Mannheim

[tesseract-ocr] Re: Tesseract-ocr image not able to read the exact data .. Please reply me as soon as possible.

2020-05-28 Thread Piyush Chandra
Please send me the Java sample code for pre-processing the images. > > Thanks, > Piyush > > > On Thursday, May 28, 2020 at 11:52:16 AM UTC+5:30, Piyush Chandra wrote: >> >> 1. You need to work on pre processing the images. >> >> 2. The first image I tried, 18

[tesseract-ocr] Re: Tesseract-ocr image not able to read the exact data .. Please reply me as soon as possible.

2020-05-28 Thread Piyush Chandra
1. You need to work on pre processing the images. 2. The first image I tried, 180 rotation was required. tesseract Sample1_3.png sam1 -l osd --psm 0 Result: Page number: 0 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 0.96 Script: Latin Script confidence: 11.67 3. After

[tesseract-ocr] Re: Creating trainneddata from box files

2020-05-27 Thread Piyush Chandra
Hi, Hope below information helps: :) Creating trained data file own.traineddata : Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox Create unicharset file: unicharset_extractor --norm_mode 1 --output_unicharset ./output/folder/own.unicharset

Re: [tesseract-ocr] Tesseract 4 not reading Arabic numbers accurately using custom trained data file

2020-05-15 Thread Piyush Chandra
You need to put the radical stroke file in your script_dir folder. On Friday, 15 May 2020 14:57:36 UTC+5:30, nourhan magdy wrote: > > how can i use this text file? i downloaded ara folder and coppied it to my > tessdata but it didnt work > > On Friday, September 27, 2019 at 10:01:11 AM UTC+2,

[tesseract-ocr] Why does underlines (series of underscore or dots) is not getting detected by tesseract

2020-05-06 Thread Piyush Chandra
I was trying to do OCR for the image where I have underlines (series of underscore or dots) but is not detected by tesseract. Is there some kind of configuration required for tesseract to detect this? Please help. The image I have used is attached. Thanks & Regards, Piyush -- You received

[tesseract-ocr] Re: Engineering drawings OCR

2020-04-28 Thread Piyush Chandra
Hi, First of all please make sure you have quality image, check this link for more info. If you still don't get the required result, the it is suggested to train tesseract with that particular font. And yes, training helps in improved

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-26 Thread Piyush Chandra
usands of iterations. Try > fine-tuning. > > On Thu, Apr 16, 2020, 19:51 Piyush Chandra > wrote: > >> Hi Shree, >> >> Thanks for replying. >> >> So shall I remove them from text file and create a unicharset file after >> that or do I have d

[tesseract-ocr] Re: Checkbox Extraction as text after Fine tuning for new characters .

2020-04-22 Thread Piyush Chandra
Hi Apoorva, Were you able to get the 3 check boxes OCRed? Did you get any errors while training and how did you complete the training for your model? Thanks & Regards, Piyush On Tuesday, 3 April 2018 14:29:38 UTC+5:30, Apoorv Khanna wrote: > > Hi all, > > I am able to extract few check boxes

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Piyush Chandra
Hi Shree, Thanks for replying. So shall I remove them from text file and create a unicharset file after that or do I have do do something while creating the lstmf files? Also, Will this affect the training if I don't remove this? I saw that training was continuing but the best char error was

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Piyush Chandra
you creating the box files? > > On Wed, Apr 15, 2020, 01:52 Piyush Chandra > wrote: > >> For other files, when I try on linux, its coming like this: >> >> unicharset_extractor --norm_mode 2 hin.desk0.box hin.desk1.box >> Extracting unicharset from box file hin.desk0.b

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Piyush Chandra
'ं' Invalid start of grapheme sequence:M=0x93f Normalization failed for string 'ि' On Tuesday, 14 April 2020 17:01:20 UTC+5:30, Piyush Chandra wrote: > > Hi Shree, > > When I used unicharset extractor command, I get these error: > > unicharset_extractor --norm_mode 2 -

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Piyush Chandra
x >> >> You can unpack any of the existing traineddatas from tessdata_best or >> tessdata_fast and check. >> >> combine_tessdata -u >> >> and looks at the lstm-unicharset in the components >> >> On Thu, Apr 9, 2020 at 12:15 PM Piyush Chandra > > wr

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Piyush Chandra
# ज [91c ]x >> >> You can unpack any of the existing traineddatas from tessdata_best or >> tessdata_fast and check. >> >> combine_tessdata -u >> >> and looks at the lstm-unicharset in the components >> >> On Thu, Apr 9, 2020 at 12:15 PM Piyush C

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Piyush Chandra
Thank you Shree for giving the overview. Could you please help me understand your last point? Your unicharset should have Unicode codepoints. what does that mean? any example would be helpful. I was actually using akshara (attached box fiile image) . On Thursday, 9 April 2020 09:02:43

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Piyush Chandra
is not available for tesseract?? Any help will be appreciated. On Wednesday, 8 April 2020 21:58:37 UTC+5:30, shree wrote: > > Why do you want to fine-tune eng to get to hindi traineddata? > > You can fine-tune hin.traineddata or script/Devanagari.traineddata. > > On Wed, Apr 8, 2020, 21

[tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Piyush Chandra
When I downloaded the devenagari.unicharset, Latin.unicharset and radical-stroke.txt , it worked. What are these files and why we need this? Do we need to use these every time we work for new language or we need to create our own??? On Wednesday, 8 April 2020 20:42:44 UTC+5:30, Piyush Chandra

[tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Piyush Chandra
On Wednesday, 8 April 2020 20:42:44 UTC+5:30, Piyush Chandra wrote: > > Hi, > > I am trying to create a hindi traineddata from scratch using > eng.traineddata. > > I used some png and txt files to create box file using lstmbox and edited > those box files to correct the

[tesseract-ocr] Tesseract error while combine_lang_model

2020-04-08 Thread Piyush Chandra
Hi, I am trying to create a hindi traineddata from scratch using eng.traineddata. I used some png and txt files to create box file using lstmbox and edited those box files to correct the words. Then, I used lstm.train to create lstm files and created unicharset file from the box files using