[tesseract-ocr] Creating a new language pack for Javanese Script

2018-04-22 Thread Christopher Imantaka Halim
Hi,

I want to develop an OCR for Javanese Script / Aksara.
https://en.wikipedia.org/wiki/Javanese_script

Plan on using Tesseract version 4.0
I've read the wiki but somehow got confused.

What do I need to prepare, to start the bare minimum training process? (for 
Tesseract 4.0)
In some other thread someone said that training using image files are not 
supported yet.
Also found out that box file/tiff pairs are not supported also.
(I did try making one box file, using this online 
tool: https://pp19dd.com/tesseract-ocr-chopper/)

Do we have an example of the training "inputs" somewhere on the github 
projects?

Sorry if this is a stupid question, I'm a newbie. :)

Thanks before

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/96a694b6-2ab8-4114-9788-483adee32802%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Unsure why tesseract isn't returning the correct text

2018-04-22 Thread ShreeDevi Kumar
Yes, please use the latest code from github master branch for building.
That way you will have all the bug fixes and updates.

On Sun 22 Apr, 2018, 2:42 AM 'DR' via tesseract-ocr, <
tesseract-ocr@googlegroups.com> wrote:

> I double checked, there seems to be a 4.0.0-beta.1 tag. I assume you
> installed that using git?
>
>
> On Saturday, April 21, 2018 at 2:40:20 PM UTC-6, zdenop wrote:
>>
>> Really? Did you check it before writing to forum?
>>
>> Zdenko
>>
>> 2018-04-21 22:25 GMT+02:00 'DR' via tesseract-ocr <
>> tesser...@googlegroups.com>:
>>
>>> Where can I find tesseract 4 beta? The github repo goes up to 4 alpha.
>>>
>>> On Saturday, April 21, 2018 at 2:21:49 PM UTC-6, zdenop wrote:

 Time for upgrade?

 Zdenko

 2018-04-21 22:14 GMT+02:00 'DR' via tesseract-ocr <
 tesser...@googlegroups.com>:

> I'm using:
>
> tesseract 3.04.01
>  leptonica-1.73
>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>
>
>
> On Saturday, April 21, 2018 at 2:48:15 AM UTC-6, shree wrote:
>>
>>
>> BLAZIKEN-M RAPIDASH-M VICTREEBEL-M SHRRPEDO-M PORYGON-I-M  RAZELF-M
>>
>> with
>>
>>  tesseract -v
>> tesseract 4.0.0-beta.1-133-g5435c
>>  leptonica-1.76.0
>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 :
>> zlib 1.2.8 : libopenjp2 2.3.0
>>  Found AVX
>>  Found SSE
>>
>> tesseract names.png - --tessdata-dir ./tessdata_best
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> Estimating resolution as 547
>> BLAZIKEN-M RAPIDASH-M VICTREEBEL-M SHRRPEDO-M PORYGON-I-M  RAZELF-M
>>
>>
>> Which version of tesseract are you using?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Apr 21, 2018 at 6:32 AM, 'DR' via tesseract-ocr <
>> tesser...@googlegroups.com> wrote:
>>
>>> I have this image I want to turn into text:
>>>
>>>
>>> 
>>> To clean it up, I've used Fred's textcleaner script (
>>> http://www.fmwconcepts.com/imagemagick/textcleaner/index.php) and
>>> ran
>>>
>>> ./textcleaner -i 2 names.png result.png

>>>
>>> on the image, the result is now:
>>>
>>>
>>> 
>>> It looks a lot cleaner, so now I use tesseract to turn it into text:
>>>
>>> tesseract result.png stdout -psm 7 -l eng --user-words
 /path/to/eng.user-words --user-patterns /path/to/eng.user-patterns
>>>
>>>
>>> with the following files,  eng.user-words:
>>>
>>> BLAZIKEN
 RAPIDASH
 VICTREEBEL
 SHARPEDO
 PORYGON-Z
 AZELF
>>>
>>>
>>> eng.user-pattern:
>>>
>>> -M
>>>
>>>
>>> & /path/to/configs/bazaar:
>>>
>>> load_system_dawg F
 load_freq_dawg   F
 user_words_suffixuser-words
 user_patterns_suffix user-patterns
>>>
>>>
>>> Yet my output is:
>>>
>>> Bl*H*ZIKEN-M R*H*PID*H*SH-M V*lE*TREEBEl-M SH*H*RPE*IIIJ*-M P*U*RY
 *Efl*N-Z-M *H*ZELF-M
>>>
>>>
>>> Since case isn't an issue for me, the only problems are "A" showing
>>> up as "H", "C" showing up as "LE", "DO" showing up as "IIIJ", and "GO"
>>> showing up as "Efl" (with "fl" being one character).
>>>
>>> I'm not sure how to make the image any clearer if possible or if I'm
>>> doing something wrong with tesseract. Any help is appreciated.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/cc3d86fb-4d9f-4e77-a5dd-23a41df213e3%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send 

Re: [tesseract-ocr] "jav" language -- is it Javanese Script or Latin-based text?

2018-04-22 Thread ShreeDevi Kumar
Seems to be in Latin script

see
https://github.com/tesseract-ocr/langdata/blob/master/jav/jav.training_text

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Apr 22, 2018 at 2:58 PM, Christopher Imantaka Halim <
topher.halim...@gmail.com> wrote:

> Hi everyone,
>
> I'm new to Tesseract OCR, want to develop an OCR for Javanese Script /
> Aksara.
>
> Noticed that Tesseract 4.0 already have a "jav" language package:
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> The question, is it for Javanese Script or for Javanese in Latin text?
>
> https://en.wikipedia.org/wiki/Javanese_script
>
> Thanks before
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/14828355-77e1-41ba-b705-5a8a3801e077%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZNJOMOn45HZX4Z-zPoQ-%3DEicEM%2Bi6k%3DUywCNEXJaGAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] "jav" language -- is it Javanese Script or Latin-based text?

2018-04-22 Thread Christopher Imantaka Halim
Hi everyone,

I'm new to Tesseract OCR, want to develop an OCR for Javanese Script / 
Aksara.

Noticed that Tesseract 4.0 already have a "jav" language package:

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

The question, is it for Javanese Script or for Javanese in Latin text?

https://en.wikipedia.org/wiki/Javanese_script

Thanks before

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/14828355-77e1-41ba-b705-5a8a3801e077%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.