[tesseract-ocr] Unrecognized characters in the traineddata model

2017-08-14 Thread robertyoung0511
Hello,

I have extracted all the characters and id numbers from the 
chi_sim.traineddata. And all the characters are stored in a txt file, which 
can be demonstrated following:

0 
1Joined
2|Broken|0|1
3S
4D
5F
68
77
80
9K
10O
11U
12H
13E
14I
154
165
171
189
19&
20C
21W
22N
23_
24P
25M
26T
27V
28R
29L
30A
31Y
322
33J
34B
35G
363
376
38Z
39X
40Q
41'
42+
43-
44.
45#
46e
47v
48a
49m
50i
51z
52o
53l
54s
55h
56n
57d
58g
59y
60u
61王
62汝
63敏
64邹
65立
66健
67熊
...
...
4013扔
4014嗨
4015髋
4016「
4017[
4018』
4019瀵
4020〕
4021掺
4022|"|0|2
4023|"|1|2
4024rn
4025|m|0|2
4026|m|1|2
4027in
4028cl
4029|d|0|2
4030|d|1|2
4031rm
4032|rm|0|2
4033|rm|1|2
4034nn
4035|nn|0|2
4036|nn|1|2
4037ri
4038|n|0|2
4039|n|1|2
4040|h|0|2
4041|h|1|2
4042|u|0|2
4043|u|1|2
4044|m|0|3
4045|m|1|3
4046|m|2|3
4047|H|0|2
4048|H|1|2
4049|H|0|3
4050|H|1|3
4051|H|2|3
4052|w|0|2
4053|w|1|2
4054|W|0|2
4055|W|1|2
4056fi
4057|k|0|2
4058|k|1|2
4059ki
4060|ki|0|2
4061|ki|1|2
4062|in|0|2
4063|in|1|2
4064tl
4065th
...


I can recognize most of the characters, such as the han, ladin alphabet. 
But some characters, such as 'Joined', ' |Broken|0|1' at the file header, 
and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.

Can you explan what these characters mean?
4059ki
4060|ki|0|2
4061|ki|1|2
4062|in|0|2
4063|in|1|2
 and so on


Thx alot.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b042f6e0-7fc9-487b-bcc6-0acf22c343fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: How to recognize letter / number combination

2017-08-14 Thread Isaias Barroso
Hi.

Have you tried to disable dictionary? 
 
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns

Best regards



On Tuesday, July 25, 2017 at 8:06:57 AM UTC-3, Jérémy Hannouna wrote:
>
> Hi,
>
> I'm trying to extract a number from a document. this number contains 
> letters and numbers.
>
> If I try to recognize the line as a singleword, the letter Z is 
> automatically convert as a 2 and if I try to recognize the line with 
> multiple words, the 6 before the Z is convert as a G.
>
> I don't know how to configure Tesseract to just recognize letters without 
> trying to interpret it in a "context".
>
> I'm not sure my explaination is clear.
>
> Anyone here can help me please ?
>
> Thank you
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3fb9a0d2-e3b7-423a-9c12-9ebda48b3aba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Extracting content from specific areas such as Account Number or Cheque Number from a Cheque

2017-08-14 Thread Isaias Barroso
Hi.

You will need some processing with a library like OpenCV, see that great 
post from PyImageSearch

http://www.pyimagesearch.com/2017/07/24/bank-check-ocr-with-opencv-and-python-part-i/

In my experience, that type of task always needs a good pre processing 
before tesseract.


Best regards


On Thursday, March 6, 2014 at 5:36:26 AM UTC-3, Karthick S wrote:
>
> Hi, I am looking at building an app on Android which can take a picture of 
> a bank check and then pull out account number as well as the check number 
> from it. I do not want all text in the scanned image to be OCRed (which 
> seems to be happening with many apps doing this). The same goes with a 
> Driving License - I would like to pull the DL Number and the name of the 
> person holding the license from the scanned image of a DL. Can anyone help 
> on this?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/483bc410-f711-4f95-9673-3a63ac0210cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Extracting content from specific areas such as Account Number or Cheque Number from a Cheque

2017-08-14 Thread Kiran Patil
Dear Karthick,

Did you resolve your issue to get the Account number, Check number and so 
on ?

May I know the steps you took ?

If you used any other solution, please let me know.

Regards,
Kiran.

On Thursday, 6 March 2014 14:06:26 UTC+5:30, Karthick S wrote:
>
> Hi, I am looking at building an app on Android which can take a picture of 
> a bank check and then pull out account number as well as the check number 
> from it. I do not want all text in the scanned image to be OCRed (which 
> seems to be happening with many apps doing this). The same goes with a 
> Driving License - I would like to pull the DL Number and the name of the 
> person holding the license from the scanned image of a DL. Can anyone help 
> on this?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/397a1012-e403-4cde-a9a9-4d81358607c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Other case Л of л is not in unicharset

2017-08-14 Thread robertyoung0511
Hello,

I use the new tutorial to fine tuning the traineddata. I want to add some 
specific symbols to the existing chi_sim.traineddata model.

First, I use the command:* training/tesstrain.sh --fonts_dir 
/usr/share/fonts --lang chi_sim --linedata_only --noextract_font_properties 
--langdata_dir ../langdata --fontlist "SIMSUN" --tessdata_dir ./tessdata 
--output_dir ~/tesstutorial/trainspecial* to create the new training data. 
But some specific symbols cannot be added to the unicharset file.

A part of output information showed following:

=== Phase UP: Generating unicharset and unichar properties files ===
[2017年 08月 14日 星期一 15:59:17 CST] /usr/local/bin/unicharset_extractor -D 
/tmp/tmp.78WyISy4o7/chi_sim/ 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.SIMSUN.exp0.box
Extracting unicharset from 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.SIMSUN.exp0.box
Wrote unicharset file /tmp/tmp.78WyISy4o7/chi_sim//unicharset.
[2017年 08月 14日 星期一 15:59:17 CST] /usr/local/bin/set_unicharset_properties 
-U /tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset -O 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset -X 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.xheights --script_dir=../langdata
Loaded unicharset of size 1129 from file 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset
Setting unichar properties
Other case Л of л is not in unicharset
Other case Υ of υ is not in unicharset
Other case Π of π is not in unicharset
Other case Β of β is not in unicharset
Mirror ∼ of ∽ is not in unicharset
Mirror ⧵ of ∕ is not in unicharset
Other case σ of Σ is not in unicharset
Other case Ρ of ρ is not in unicharset
Mirror 》 of 《 is not in unicharset
Other case j of J is not in unicharset
Mirror 【 of 】 is not in unicharset
Mirror 「 of 」 is not in unicharset
Other case K of k is not in unicharset
Mirror { of } is not in unicharset
Other case q of Q is not in unicharset
Mirror 〗 of 〖 is not in unicharset
Setting script properties
Warning: properties incomplete for index 57 = )
Warning: properties incomplete for index 60 = :
Warning: properties incomplete for index 64 = !
Warning: properties incomplete for index 67 = ?
Warning: properties incomplete for index 73 = >
Warning: properties incomplete for index 81 = ;
Warning: properties incomplete for index 82 = ~
Warning: properties incomplete for index 90 = .
Warning: properties incomplete for index 98 = (
Warning: properties incomplete for index 99 = ゜
Warning: properties incomplete for index 115 = <
Warning: properties incomplete for index 190 = ,
Writing unicharset to file /tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset


which shows that some specific symbols such as 'Л', '》', ...,   cannot be 
added to the unicharset.


How can I add these symbols to the unicharset? Should I add them manually?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b2e87fb-ebca-4b92-a561-1a6ccc4a27ba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-14 Thread Ava Nimaee
I have traineddata in this 
path: /home/zohreh/tesstutorial/engtrian/eng/eng.traineddata.
that with using :
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng   
 --training_text training/langdata/eng/eng.training_text 
--linedata_only \
  --noextract_font_properties --langdata_dir training/langdata \
  --tessdata_dir ./tessdata \
  --fontlist "Times New Roman," --output_dir ~/tesstutorial/engtrian
i created it.
And also i used the link that u sent me.
sorry shree but i  tried alot but i couldn't solve that.


On Monday, August 7, 2017 at 10:28:05 PM UTC+4:30, shree wrote:
>
> You also need to provide a traineddata file as input
>
> Please review the updated training instructions in the wiki and change the 
> training commands accordingly.
>
> On 07-Aug-2017 6:15 PM, "Ava Nimaee"  
> wrote:
>
>> hi how can you solve it? i have this error too.
>> please help me
>>
>> On Friday, August 4, 2017 at 11:03:41 AM UTC+4:30, roberty...@gmail.com 
>> wrote:
>>>
>>> Hello,
>>>
>>> I use the 'git pull' command to update the code from the link 
>>> https://github.com/tesseract-ocr/tesseract.git, and I recompile, 
>>> reinstall the Tess4.0.
>>>
>>> But when I execute the command (showed in below) to finetune the 
>>> traineddata, an error appears: 
>>> "mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file 
>>> ../lstm/lstmtrainer.h, line 110"
>>>
>>> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned 
>>> \
>>> --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
>>> --train_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
>>> --eval_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
>>> --target_error_rate 0.01
>>>
>>>
>>>
>>> There is nothing wrong with the Tess before updating the code. But now, 
>>> An assertion error crashes. Why? Can you help me?
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/7c66d368-f232-4eed-abfc-3bba2418f024%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3ae829b7-0a54-4439-b895-46ca2955c77f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.