Well, I just saw Arabic config file in langdata (uploaded on Aug 12th by
Ray) and I am not sure whether training will be possible with existing
tools available to us ...

See
https://code.google.com/p/tesseract-ocr/source/browse/ara/ara.config?repo=langdata

It says:

# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).
tessedit_ocr_engine_mode        1

Other than that, in order to use Jtess or commandline tools for
training, you will need font_properties, wordlists etc ...

----

Ray, is it possible to use the latest source from git to train Arabic?



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Nov 20, 2014 at 7:37 PM, iram akbar <[email protected]> wrote:

> it seems its a known issue of Serak. i have created the "ara" folder with
> files as "vie" folder in jtessbox editor as you can see in attachment.
> after that i have set the box file path in jtessbox editor of "Tesseract
> executable" and "Training data" for "ara" as attached. when i click the
> "Run" button i got the attached error. i don't know what goes wrong here.
> Question: m i giving the wrong file in the path in "Tesseract executable"
> and "Training data" i.e ara box file? or what goes wrong.
> note: i have put no data words_list, frequent_words, font_properties file.
>
>
> On 20 November 2014 17:32, ShreeDevi Kumar <[email protected]> wrote:
>
>> I have not used Serak - but the issues page there indicates problems with
>> RTL languages - see
>> https://code.google.com/p/serak-tesseract-trainer/issues/detail?id=6
>>
>> why are u not using jtessbox editor's trainer or the command line
>> programs? I think the binaries are bundled with JTess...
>>
>>
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, Nov 20, 2014 at 4:26 PM, iram akbar <[email protected]> wrote:
>>
>>> Hello shree,
>>>
>>> i am having an issue while training arabic in Serak (for box file
>>> generation i am using jtessbox editor). i am doing some testing. i have
>>> assigned  english alphabet for a single arabic word and created the box
>>> file as attached (jtessbox file). now following all training process in
>>> serak i got the OCR result as attached. although you can see in the box
>>> file there is 4 alphabets "A,B,C,D" but i was expecting OCR result will be
>>> ABCD but the results are BDBBAABBBBA as attached (serak result).
>>> Question: why i a getting that result? is it some wrong while making box
>>> file in jtessbox editor or training in serak?
>>>
>>> On Monday, 10 November 2014 15:30:21 UTC+5, shree wrote:
>>>>
>>>> Look under jtessboxeditor/samples/vie folder
>>>>
>>>> and create similar files for your language
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Mon, Nov 10, 2014 at 1:10 PM, iram akbar <[email protected]> wrote:
>>>>
>>>>> Quan,
>>>>> i am able to generate some files with jtess ox editor but i am having
>>>>> an issue, when i select "Train with existing box" or "Train from Scratch"
>>>>> under the *Traine*r tab i am getting this attached message.
>>>>> Question: How i can generate the Arabic.font_properties,
>>>>> Arabic.frequent_word_list and Arabic.words_list files using jtessbox 
>>>>> editor?
>>>>>
>>>>> On Friday, 7 November 2014 19:42:37 UTC+5, Quan Nguyen wrote:
>>>>>>
>>>>>> Look in samples folder for a working example. You can start out from
>>>>>> a UTF-8 text file about 2-page long, generate TIFF/Box from it, and 
>>>>>> prepare
>>>>>> other necessary input files for training. You can train entirely in
>>>>>> jTessBoxEditor.
>>>>>>
>>>>>> On Thursday, November 6, 2014 6:19:53 AM UTC-6, iram akbar wrote:
>>>>>>>
>>>>>>> thank you for your help but my issue still exits. if i need to
>>>>>>> generate the Tiff of an image text i am unable to generate the TIFF as 
>>>>>>> it
>>>>>>> only ask to load the text file not image file. second if i have a lots 
>>>>>>> of
>>>>>>> documents i need to copy paste first then generate the TIFF. Any one 
>>>>>>> else
>>>>>>> can help me in this.
>>>>>>> Question: how can i Input the Arabic text image in jtessbox editor
>>>>>>> to generate Tiff (as attached).
>>>>>>>
>>>>>>> On Thursday, 6 November 2014 16:38:25 UTC+5, shree wrote:
>>>>>>>>
>>>>>>>> Click on the 'generate' box - with some devanagri fonts I have
>>>>>>>> found that text does not display but the tiff/box are generated. Maybe 
>>>>>>>> same
>>>>>>>> for the arabic font you are using. Give it a try.
>>>>>>>>
>>>>>>>> You can also try to copy and paste the text, sometimes that works.
>>>>>>>>
>>>>>>>>
>>>>>>>> ShreeDevi
>>>>>>>> ____________________________________________________________
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/tesseract-ocr/d7396d3d-c4d1-4fcc-a58d-6cc02927989c%
>>>>> 40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d7396d3d-c4d1-4fcc-a58d-6cc02927989c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1422c53d-8ad5-4339-8e4a-3de540a3dfa5%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1422c53d-8ad5-4339-8e4a-3de540a3dfa5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/QQ8wC59YKUI/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWieFAj7ZnJKRTYPwL-UzJWnTK-wRSFPZgOEy-%2BM4D4-g%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWieFAj7ZnJKRTYPwL-UzJWnTK-wRSFPZgOEy-%2BM4D4-g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CACYj_gEhH225qfiX79X3Ma7zB0MDJD%3DSVv7zcY26NrTgnvyKUw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CACYj_gEhH225qfiX79X3Ma7zB0MDJD%3DSVv7zcY26NrTgnvyKUw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUewNUbwFcctqry5wUaobaOv1oWXR-Xf%3DWL6vTZ%3DOOPTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to