Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-11 Thread Fanatico
After some research in Korean I found that they do use Chinese characters 
in their language, so it is correct to set Chinese as a sublanguage, the 
problem is that the kor.training_text doesn't have chinede letters, so the 
code is only training Korean and ignoring the Chinese, so if I tesseract on 
an image that has Korean and Chinese it is going to recognize some Korean 
characters as Chinese and some Chinese characters as Korean.

On Monday, 9 April 2018 05:15:57 UTC-3, shree wrote:
>
> Leftover from 3.04, my guess.
>
> On Mon 9 Apr, 2018, 12:52 PM Fanatico,  
> wrote:
>
>> It worked, thanks.
>>
>> Any reason for this chi_tra there?
>>
>>
>> On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote:
>>>
>>> Please remove the sub language line from config file, and use combine 
>>> tessdata to overwrite it.
>>>
>>> Right now it seems to be using chi_tra also.
>>>
>>> On Mon 9 Apr, 2018, 11:48 AM Fanatico,  wrote:
>>>
 I used one traineddata that I created on removing the top layer from 
 the kor.traineddata from "tessdata_best", after this I replaced this 
 traineddata with the one from "tessdata_best" and got the same problem.

 Yes, it include chi_tra as sublanguage
 tessedit_load_sublangs chi_tra

 lstm-unicharset only has corean characters

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d20b1468-9b36-49a5-9b96-3a8ed2df3e71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread Fanatico
The conf from kor did already have it

#Fixes https://github.com/tesseract-ocr/tesseract/issues/1009
preserve_interword_spaces 1


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/967be8d6-a613-40d7-b8db-afc819698051%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
For Korean, please check whether adding the following lines to config,
improves your results further.

#Fixes https://github.com/tesseract-ocr/tesseract/issues/1009
preserve_interword_spaces 1


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 9, 2018 at 1:45 PM, ShreeDevi Kumar 
wrote:

> Leftover from 3.04, my guess.
>
> On Mon 9 Apr, 2018, 12:52 PM Fanatico,  wrote:
>
>> It worked, thanks.
>>
>> Any reason for this chi_tra there?
>>
>>
>> On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote:
>>>
>>> Please remove the sub language line from config file, and use combine
>>> tessdata to overwrite it.
>>>
>>> Right now it seems to be using chi_tra also.
>>>
>>> On Mon 9 Apr, 2018, 11:48 AM Fanatico,  wrote:
>>>
 I used one traineddata that I created on removing the top layer from
 the kor.traineddata from "tessdata_best", after this I replaced this
 traineddata with the one from "tessdata_best" and got the same problem.

 Yes, it include chi_tra as sublanguage
 tessedit_load_sublangs chi_tra

 lstm-unicharset only has corean characters

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%
 40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/
>> msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%
>> 40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU4j1QD_zrAPGws_5ztQh1De6%3DGtHKnzNTHW%3DkeNX2qgg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
Leftover from 3.04, my guess.

On Mon 9 Apr, 2018, 12:52 PM Fanatico,  wrote:

> It worked, thanks.
>
> Any reason for this chi_tra there?
>
>
> On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote:
>>
>> Please remove the sub language line from config file, and use combine
>> tessdata to overwrite it.
>>
>> Right now it seems to be using chi_tra also.
>>
>> On Mon 9 Apr, 2018, 11:48 AM Fanatico,  wrote:
>>
>>> I used one traineddata that I created on removing the top layer from the
>>> kor.traineddata from "tessdata_best", after this I replaced this
>>> traineddata with the one from "tessdata_best" and got the same problem.
>>>
>>> Yes, it include chi_tra as sublanguage
>>> tessedit_load_sublangs chi_tra
>>>
>>> lstm-unicharset only has corean characters
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUUsnmjCxN9btT0sVbSVmCZy%2Bxv6QXOe75vdZDAHuG1Fg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread Fanatico
It worked, thanks.

Any reason for this chi_tra there?


On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote:
>
> Please remove the sub language line from config file, and use combine 
> tessdata to overwrite it.
>
> Right now it seems to be using chi_tra also.
>
> On Mon 9 Apr, 2018, 11:48 AM Fanatico,  
> wrote:
>
>> I used one traineddata that I created on removing the top layer from the 
>> kor.traineddata from "tessdata_best", after this I replaced this 
>> traineddata with the one from "tessdata_best" and got the same problem.
>>
>> Yes, it include chi_tra as sublanguage
>> tessedit_load_sublangs chi_tra
>>
>> lstm-unicharset only has corean characters
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
Please remove the sub language line from config file, and use combine
tessdata to overwrite it.

Right now it seems to be using chi_tra also.

On Mon 9 Apr, 2018, 11:48 AM Fanatico,  wrote:

> I used one traineddata that I created on removing the top layer from the
> kor.traineddata from "tessdata_best", after this I replaced this
> traineddata with the one from "tessdata_best" and got the same problem.
>
> Yes, it include chi_tra as sublanguage
> tessedit_load_sublangs chi_tra
>
> lstm-unicharset only has corean characters
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV3O9Bh%3DfwjzL5aMmZmChkPfpMW3%2BOw5TVUHRRRL7pD4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread Fanatico
I used one traineddata that I created on removing the top layer from the 
kor.traineddata from "tessdata_best", after this I replaced this 
traineddata with the one from "tessdata_best" and got the same problem.

Yes, it include chi_tra as sublanguage
tessedit_load_sublangs chi_tra

lstm-unicharset only has corean characters

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-08 Thread ShreeDevi Kumar
Which traineddata are you using?

Use combine_tessdata and extract the config file to see if chinese is
included as sub language.

Also look at the lstm-unicharset to see if the Chinese characters are
included in it.

On Mon 9 Apr, 2018, 11:09 AM Fanatico,  wrote:

> I'm running tesseract with the "-l kor" param but it is detecting some
> chinese characters, the image really have 3 chinese characters but none of
> them is returning correctly (and I'm not expecting them to return
> correctly) but the others korean characters are being recognized as chinese
> characters
>
> tesseract teste_kor.tif teste_kor -l kor --oem 3 --psm 6
>
> Any idea of how to fix it?
>
>
>
> 
>
>
> Result:
>
>
> 1 화
>
>
> 서 05)
>
>
> 수 마 0 뜨 \) 에 사 로 잡혀 눈 을 도 저
>
> 히 뜰 수가 없다.
>
>
> 힘 을 내 도 겨우 반 개 하는 것이 고
>
> 작 . 그 이상 움직일 수가 없었다.
>
> " 아 ‥…. 7
>
>
> 苗 朮 習 趾 葉 刁 估 舍 點 選 們 同 對 刀
>
> 려 소 리 를 낸다. 하지만 신 음 에 가
>
> 까운 목 소 리 만 홀 러 나 올 뿐이었다.
>
> “장로 Q 全 程 ::: 가 시 면 ‥.”
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1e5142e1-d198-46d3-95ee-1a3206d1a2c4%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUBio3cRuAC39kUnoRB3%2B1WbmaSDGhqvWp%2BW_VV_QK9ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.