from:"Seokbong Choi"

Re: [tesseract-ocr] Japanese - Problems with vertical words

2019-06-03 Thread Seokbong Choi

Are you using jpn_vert instead of jpn?
I have trained jpn_vert

https://github.com/zodiac3539/jpn_vert


On Mon, Jun 3, 2019 at 11:31 AM Shree Devi Kumar 
wrote:

> tesseract 4 has been trained on line images and hence gives better results
> for lines, as far as I have seen.
>
> On Sun, Jun 2, 2019 at 2:52 PM Jorge Castrillo 
> wrote:
>
>> Hi everyone. I'm making a program on that uses tesseract to get a word
>> from a manga with a snipping-tool like program, and translates that word
>> with JMdict.
>> The thing is tesseract gives weird values for vertical, small selections.
>> I'm going to explain it in more detail:
>>
>>
>> Say I get a full horizontal line in Japanese, like  the following one:
>>
>> [image: horizontal_full.jpg]
>> The output "元来日本語は漢文に倣い、文字を上" is perfect
>>
>> Getting a full vertical line gives no problems either:
>>
>> [image: vertical_full.jpg]
>>
>> Gives the same correct output. Now if I want to get only words, when
>> examining horizontal text there are no problems, while with the vertical
>> text the output is almost always (except when examining a Kanji alone)
>> wrong, like this:
>>
>> [image: nih-horizontal.jpg]
>>
>>
>> [image: nih-vertical.jpg]
>>
>>
>> The first one returns 日本語 while the second one returns 髑升田.
>> They are both from the same file, same size, same font, yet the results
>> vary greatly-
>>
>>
>> Another example, this time from a manga:
>>
>> [image: ej2full.jpg]
>>
>> The output is 今日の勝敗よりも, again, correct.
>> But going word by word we start to have errors:
>>
>> [image: eje2-word1.jpg]
>> Output 由」〉
>>
>> and
>>
>> [image: ej2-word.jpg]
>> Output 健雛
>>
>> Why is it that it can examine the full line without problem, but have so
>> much trouble getting vertical words? I am using psm 8 for words, but it
>> only seems to work with horizontal ones, and I can't get my head around it.
>> I've been trying to find a solution to this all day, but without success.
>> I'm not an expert programmer by any means, this is more of a college
>> project, but any insight would be really, really appreciated. Thank you for
>> reading.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/71b34e0f-5713-42d3-9ba0-4926291758cb%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWb0n%2Bie5ukkq7bRxtuD%2Bx6iQWYV5KK1b19s6yT-NhS1Q%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA7FYEVFRz5PV1C98omoK%2BNJfY6Cc6nqg8mKeF%2B8svHp5g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Failing to run on OSX after installation with brew

2018-12-31 Thread Seokbong Choi

you need to install pre-requisite libraries.

https://gist.github.com/fractaledmind/cd2fc4125bef57bcb3e2
Please refer to line 17-19. Thanks. Happy new year!






On Mon, Dec 31, 2018 at 6:49 AM Bernard Pochet  wrote:

> After installing (and reinstall ...) with brew,I receive this message ...
>
> dyld: Library not loaded: /usr/local/opt/leptonica/lib/liblept.5.dylib
>   Referenced from: /usr/local/bin/tesseract
>   Reason: image not found
> Abort trap: 6
>
> need help
>
> Thanks (and hapy new year)
>
> B
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bfe38d1b-8055-4e5e-bbf9-61cbe747eb53%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA6E5ibRAQ08u1bV%2BDsvSiw34xovTJG0F83tRV0n8ZO_iQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How I extra the green word on a image

2018-12-23 Thread Seokbong Choi

Use HSV filter. You can use OpenCV. I guess you don't need to filter out V
range, but High S range may work.


On Sun, Dec 23, 2018 at 10:09 AM 童虎  wrote:

>
> I want use tessract to extra green text(which is Chinese word)
> [image: t_first_name_more_foggy.png]
> but the result it not well.
>
> And I save the `-c tessedit_wite_images=1` to see the middle image, which
> is not well
>
> [image: bb.png]
>
>
> How can I preprocess this picture? I'm new to image process..
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5a0f16e1-b6fb-41e9-8d3a-34bb6f2d1fac%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA404sGAawDJG-L0%2BzewAvXNu8Wj%2BpUSmN0YEAsus%3DOwPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] [/usr/local/bin/language-specific.sh: 줄 1125: FONTS: unbound variable] Error help me!!

2018-12-05 Thread Seokbong Choi

Hello,

I think you miss the "fontlist" argument...
The below script worked out for Japanese.
Even though you want to train all fonts in language-specific.sh, I would
suggest to include the "fontlist" argument still.

tesstrain.sh  \
  --fonts_dir /usr/share/fonts/ \
  --lang jpn \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir /langdata \
  --tessdata_dir /tessdata  \
  --output_dir ~/tesstutorial/horizon \
  --fontlist "TakaoExGothic" "TakaoExMincho"

Please keep in mind that the fonts that you want to use should be in
language-specific.sh too.
Also, you may want to look at VERTICAL_FONT section to avoid the situation
where the sentences are aligned vertically, which will be needed in
Japanese or Chinese, but not in Korean.


On Wed, Dec 5, 2018 at 3:08 AM Zdenko Podobny  wrote:

> Do you use scripts from master repository? There where some updates after
> 4.0 release...
>
> Zdenko
>
>
> st 5. 12. 2018 o 8:19 SEUNGGWANSHIN  napísal(a):
>
>> hello guys
>>
>> i'm training tesseract-lstm with
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>> i have some problem using "tesstrain.sh"
>>
>> When creating train data, this website used tesstrain.sh this way.
>>
>>   src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>> --linedata_only \
>>
>> --noextract_font_properties --langdata_dir ../langdata \
>>
>>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>>
>>
>> So my code is below.
>>
>> tesstrain.sh --fonts_dir /usr/share/fonts \
>>
>> --lang kor \
>>
>> --linedata_only \
>>
>> --noextract_font_properties \
>>
>> --langdata_dir ../langdata-master \
>>
>> --tessdata_dir tessdata/tessdata_fast/ \
>>
>> --output_dir kortrain
>>
>> My language is* "kor" *not "eng" ...
>> when i  executed those script, i got unknown error like this.
>>
>> === Starting training for language 'kor'
>>
>> /usr/local/bin/language-specific.sh: 줄 1125: FONTS: unbound variable
>>
>>
>> and i checked this error line in language-specific.sh.
>>
>> 1124 kor ) MEAN_COUNT="20"
>>
>> 1125   WORD_DAWG_FACTOR=0.015
>>
>> 1126   NUMBER_DAWG_FACTOR=0.05
>>
>> 1127   TRAINING_DATA_ARGUMENTS+=" --infrequent_ratio=1"
>>
>> 1128   TRAINING_DATA_ARGUMENTS+=" --desired_bigrams="
>>
>> 1129   GENERATE_WORD_BIGRAMS=0
>>
>> 1130   FILTER_ARGUMENTS="--charset_filter=kor
>> --segmenter_lang=kor"
>>
>> 1131   test -z "$FONTS" && FONTS=( "${KOREAN_FONTS[@]}" ) ;;
>>
>>  312 KOREAN_FONTS=( \
>>
>>  313 "Arial Unicode MS" \
>>
>>  314 "Arial Unicode MS Bold" \
>>
>>  315 "Baekmuk Batang Patched" \
>>
>>  316 "Baekmuk Batang" \
>>
>>  317 "Baekmuk Dotum" \
>>
>>  318 "Baekmuk Gulim" \
>>
>>  319 "Baekmuk Headline" \
>>
>>  320 )
>>
>> I installed perfectly korean_fonts using ttf_mscorefonts_installer, etc..
>> but i dont know why this error happens..
>>
>> Anyone help me !
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/2b7bbf45-4240-411b-bd4a-87c46fdcea5a%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xdwe36ORbpvjm0s79zQhNE%2BNFmgsa1c4%2B_N1yfROtBdQ%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA6-H10-pdoetCcS1KhS03QopjKrrVfhHYezjRxd0LbP1Q%40mail.gmail.com.
For more

Re: [tesseract-ocr] Is Tesseract high security for commercial APP?

2018-12-03 Thread Seokbong Choi

Hello Long,

Tesseract does not require internet connection to be run. That fact will
eliminate most concerns around network security. (As a matter of fact, the
current threat landscape mostly stems from the internet connectivity.)
However, I do not know it will impact integrity and availability. I suggest
to conduct white box testing, as the source code is available in GitHub.
You can follow the standard security assessment on the third party library
to assess its security from ISO or NIST.

Thanks.


On Mon, Dec 3, 2018 at 9:05 AM long zhao  wrote:

> Hi,
>
> I am new for Tesseract. I just want to ask, whether the Tesseract is high
> security enough for a commericial APP like for a bank APP.
> Thanks a lot in advance
>
> Kind regards,
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c21ed954-09f8-4baf-83e3-28da1311a6e8%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA4ciqhgjSAF7ogZJKsKLH1ZUPjnTcBD-pOgHfQFwmBQHQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] New jpn_vert.trainnedata

2018-11-26 Thread Seokbong Choi

Hello all,

Although our jpn_vert from best worked greatly, it didn't serve my purpose 
- reading comic books.
Here, I retrained with the new font and new expressions where most Japanese 
comic books use.

https://github.com/zodiac3539/jpn_vert

   - 
   
   Add more fonts - Othutome, the font where most comic books use.
   - 
   
   Trained almost 200,000 cycles. The character level error rate is less 
   than 0.3%.
   - 
   
   Whenever Tesseract stumbles upon ♥ ‼, Tesseract is likely to make a 
   mistake, distorting the entire sentence. So, I trained these characters 
   thoroughly. The result is remarkable. Feel free to leave any comment on my 
   GitHub
   - 
   
   
   

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b6f78986-02ac-4569-8994-01769271dd3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract v4 generated incorrect text output

2018-11-26 Thread Seokbong Choi

Hello,

OEM and PSM are values that you should set up whenever you execute
tesseract.exe, which cannot be automatically detected under the current
version. (I hope it can be improved in the next version)
I guess you are in the situation where the optimal result can be obtained
through different psm values right? Unfortunately, it's a manual labor
under this version.
psm 4 generally works, if your sentence is horizontally aligned, whereas
psm 5 works in vertically aligned Chinese-Japanese-Korean (CJK) sentences.

I ran your bmp with psm 4 option, it worked. Although it prompted the
result that you may not desire, by appending 용 된 다 at the end of the
sentence. In that case, I would suggest you to retrain. It may improve
accuracy. (I had a similar issue with Japanese) I hope it would help.

[image: image.png]


On Mon, Nov 26, 2018 at 12:18 PM Hwa Chuang  wrote:

> I was testing Tesseract v4 and found some text files generated by image
> have incorrect string. For example, I have image as below:
>
>
> [image: 2018-11-26 11_29_42-Photos.png]
>
> $ ./tesseract.exe Korean.bmp Korean -l kor
> Tesseract Open Source OCR Engine v4.0.0 with Leptonica
>
> $ cat Korean.txt
> 을 만 나 서 반 가 워 요 ! 이 테 스 트 목 적 을 위해 사
>
> 志 巳
> 必 白
>
> It's pretty clear that output text string is almost completely incorrect.
> However, I can have correct test string if page segmentation mode is 11.
>
> $ ./tesseract.exe Korean.bmp Korean-psm11 -l kor --psm 11
> Tesseract Open Source OCR Engine v4.0.0 with Leptonica
>
> $ cat Korean-psm11.txt
> 당 신 이 선 생 님 을 만 나 서 반 가 워 요 ! 이 테 스 트 목 적 을 위해 사
>
> 용 된 다 .
>
> The problem is I can not change psm image by image.
>
> Any suggestion?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6d25e720-dcab-4659-b0bc-4d9928dbf0e4%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA6veP3MpnuTZS6mpOd5TBmsH9qSeS4EpdQC0Z97Z4HaLA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Images with text in white color

2018-11-12 Thread Seokbong Choi

Use Otsu Inverse from OpenCV.

https://www.meccanismocomplesso.org/en/opencv-python-otsu-binarization-thresholding/


On Mon, Nov 12, 2018 at 6:38 AM raghunath rs 
wrote:

> Hi,
>
> I recently experienced that Tesseract 4 is not identifying images with
> text in white and background colored
>
> Is there any specific preprocessing?
>
> Thanks,
> Raghu
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4ec17a53-67a9-4a9b-b512-db1cc2306d2d%40googlegroups.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA5wSBmmp300_kvaBeb-AKkWPebd612ym_ydwAaoYpv2SQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

2018-10-19 Thread Seokbong Choi

Can you share the content of "eng.training_files.txt" file? that
--train_listfile argument refers to?
Thanks.

On Fri, Oct 19, 2018 at 1:59 PM tu tonquang  wrote:

> I want my application able to recognize characters like: 'Φ'
>
> Vào 00:56:01 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, tu tonquang đã viết:
>>
>> Hi,
>>
>> *I have some errors when I follow this tutorial to retrain tesseract: *
>>
>> I follow this link to retrain tesseract with my image dataset (I retrain
>> tesseract with real image, not from text file via tesstrain.sh)
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata
>>
>> It is my steps to retrain tesseract lstm:
>>
>>
>> *Step1: I create my training data (tif image + box file) from my images.*
>> I generated its via this command line: tesseract
>> [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop
>> makebox
>>
>>
>> *Step2: I edit manually by Qt-box-edito*r. (I done with this link:
>> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files
>> )
>> So now I have files:
>> .tif file
>> .box file
>> .lstmf file (generated by command: tesseract
>> [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] lstm.train
>> unicharset file
>>
>>
>> *Step 3: I create .traineddata via this command:*
>> combine_lang_model --input_unicharset unicharset --script_dir langdata
>> --output_dir output --lang "eng"
>> With langdata I downloaded from here:
>> https://github.com/tesseract-ocr/langdata
>>
>>
>> *Step4: I extract existing model from exist traineddata by command:*
>> combine_tessdata -e
>> /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata eng.lstm
>>
>>
>> *Step5: I retrain tesseract *(Fine Tuning for ± a few characters:
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters)
>> by command:
>> lstmtraining --model_output output_model --continue_from eng.lstm
>> --traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share
>> tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile
>> eng.training_files.txt --debug_interval -1 --max_iterations 400
>>
>>- It is format of my eng.training_files.txt:
>>path/to/lstmf
>>
>> *I get an error like the following:*
>>
>> [image: Screenshot from 2018-10-19 21-49-00.png]
>> *It is example about my training image:*
>> [image: eng.centurygothic.exp0.png]
>>
>>
>>
>>
>>
>> *I try to retrain tesseract with from real image (not from text file via
>> tesstrain.sh)*
>>
>> Please share me something if you have any idea to fix it.
>>
>>
>> Thank you for advance !
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d08df2e0-ccc3-49bc-90ab-6588f9ab6ef3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA5z%2BdHyXoo-w3B9E2wtAGtAHDCqO6ryqYiV4Qu6NrMSrw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] New JPN_VERT traineddata (for 4.0)

2018-10-15 Thread Seokbong Choi

Hello all,

During 2 weeks, I trained JPN_VERT little bit further.
I included heart symbols, which are commonly used in Japanese comic books.
Whenever I tried to OCR, the entire sentence got weird. So, I got around 
the issue by training those symbols.
I also trained casual conversations more. The existing training set had too 
formal sentences.
I hope it useful for Japanese comic book fans.
I cannot provide eval data, but I am sure that this works better whenever I 
read Japanese comic books.

https://github.com/zodiac3539/pythontesseract/ 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a445cfe4-f1de-453f-b9a5-ace89d36e67c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Generate box file for JPN_VERT?

2018-10-11 Thread Seokbong Choi

I found myself.
Please see language-specific.sh

VERTICAL_FONTS(= \
"TAKAoExGothic" \ #for jpn
"TAKAoExMincho" \ #for jpn
"WHAT_EVER_FONT_YOU_WANT_TO_ADD"
)

when you execute tesstrain.sh \
  --font_dir /usr/share/font
  --lang jpn
...
  --fontlist "WHAT_EVER_FONT_YOU_WANT_TO_ADD"

You will see the box file and tiff file where characters are vertically
aligned.

Thanks!


On Sun, Oct 7, 2018 at 12:56 PM Seokbong Choi  wrote:

> Hello,
>
> I am a Japanese comic book fan. Recently, I come to learn about tesseract,
> which is awesome.
> There are many challenges around Japanese - it has millions characters, so
> that millions of iteration are required to train.
>
> Another challenge is vertical text. Most of comic books use vertical
> alignment for the text.
> I am trying to train tesseract based upon JPN_VERT (I already successfully
> trained JPN).
> However, I am not able to find a way to generate "box" file, which is
> aligned vertically to train JPN_VERT further.
> Any idea?
>
> Thanks in advance.
>
> Greg.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c82e323c-17b3-4b1e-a8a9-074fadb88528%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c82e323c-17b3-4b1e-a8a9-074fadb88528%40googlegroups.com?utm_medium=email_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA4fF9kYUXxf7Rv2TS_uxxS90S72df-qFU_LXZAZ%2BxtTaA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Generate box file for JPN_VERT?

2018-10-07 Thread Seokbong Choi

Hello,

I am a Japanese comic book fan. Recently, I come to learn about tesseract, 
which is awesome.
There are many challenges around Japanese - it has millions characters, so 
that millions of iteration are required to train.

Another challenge is vertical text. Most of comic books use vertical 
alignment for the text.
I am trying to train tesseract based upon JPN_VERT (I already successfully 
trained JPN).
However, I am not able to find a way to generate "box" file, which is 
aligned vertically to train JPN_VERT further.
Any idea?

Thanks in advance.

Greg.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c82e323c-17b3-4b1e-a8a9-074fadb88528%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Japanese - Problems with vertical words

Re: [tesseract-ocr] Failing to run on OSX after installation with brew

Re: [tesseract-ocr] How I extra the green word on a image

Re: [tesseract-ocr] [/usr/local/bin/language-specific.sh: 줄 1125: FONTS: unbound variable] Error help me!!

Re: [tesseract-ocr] Is Tesseract high security for commercial APP?

[tesseract-ocr] New jpn_vert.trainnedata

Re: [tesseract-ocr] Tesseract v4 generated incorrect text output

Re: [tesseract-ocr] Images with text in white color

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

[tesseract-ocr] New JPN_VERT traineddata (for 4.0)

Re: [tesseract-ocr] Generate box file for JPN_VERT?

[tesseract-ocr] Generate box file for JPN_VERT?

12 matches

Site Navigation

Mail list logo

Footer information