[tesseract-ocr] Re: Missing Characters

2020-02-04 Thread Quan Nguyen
It looks like Times New Romon font does not have the glyphs for the 
characters of your interest. You'll need to select a compatible font.

Btw, that application is jTessBoxEditor, not VietOCR.

On Tuesday, February 4, 2020 at 11:02:47 AM UTC-6, Peyi Oyelo wrote:
>
> Hello,
>
>
> I am currently using VietOCR on Ubuntu 18 to try to create box files, but 
> I am unable to see some characters. I am working with Akan Twi which has a 
> general english script (with some missing characters) and some borrowed 
> characters from the Greek script. The greek characters are limited to ɛ 
> and ɔ. I am currently trying to fine-tune the existing default English 
> mode to recognize these characters. However, VietOCR shows these characters 
> as empty boxes.
>
> Please how can I resolve this
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5b12c2be-8acb-464b-8caf-e49ebc121fb1%40googlegroups.com.


[tesseract-ocr] tesseract Table output using tesseract.js

2020-02-04 Thread Alok Kumar
Hi, can anyone help me to extract table format using tesseract.js in 
asp.net. currently i am extracting data object.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4dd4b48e-f6b4-425a-9435-12fe35341e9e%40googlegroups.com.


Re: [tesseract-ocr] approches used for language detection on images ...

2020-02-04 Thread Albretch Mueller
On 2/1/20, Zdenko Podobny  wrote:
> You did not provide any example Image

 OK, this one would do. On this pdf file there are images of varying
quality and with text embedded in various ways. This would be the
typical text I would be dealing with:

 https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf

 another example of textual file I work with would be:

 https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes
and Texts.pdf

 on that file pdftohtml produces one background file per page, but
when you stratify the content (simply using hash signatures) you
realize most files are of the same kind (just blank background images
or files containing a single line (for example, underlining a title)
or framing a blocked message), then there are full-page blank images
with segments of greek text, ...

 I don't quite understand why poppler utils don't just underline a
word. Of course, you could easily write some code to figure out which
segments of text should be underlined, but understanding the obvious
tends to pay in the long run

> , neither what kind of tools you would
> like to use (open source or proprietary)

 the poppler's pdftohtml tools:

 https://poppler.freedesktop.org/

 are pretty good, but there is always an extra twicking you need.
Authors write texts in whichever way they want and this is a good
thing

>4. I guess you will have problem with texts with mixed languages.

 Yes, I do, but a few heuristics included in metadata (extracted from
the names and/or headings of files) are of great help

 At the end of the day you can't fully automate such a process. You
will need a GUI and let "users" eye ball the data . . .

>5. If  proprietary tools (and budget ;-) ) are not problem you can try
>to use  google vision [6] or Microsoft cognitive services [7] or Amazon
>Rekognition. Dataturks made some test for them [9]...

 I am trying to write up a set of bash scripts and code as part of a
pretty complete all-purpose library. Ideally the back end text will be
formatter as ODT since it is very easy to convert it to any other
format anyway

 Do you know of such a library?

> [1] ... [9]

 Thank you,
 lbrtchx

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwj_b5uxQaP-%3DYv_1VP6%3DNG5B1OYjCOT2LLJAdKr%2BTX66A%40mail.gmail.com.


Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-04 Thread Shree Devi Kumar
>
> By the way, I added a create_ground_truth utility, which creates .gt.txt
> files as well as the associated .tif files for every specified font, to
> the package. I think it could be useful for anyone who does not have a
> ground truth collection yet.
>
> Thanks, I tried it with latest tesseract code.

1. Error when --fonts_dir is not specified, works ok, when specified.

2. Very slow (10 mins), started 20 text2image processes in parallel for
training_text with 20 lines.

 create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
corpora ground-truth
2020-02-04 11:01:19,135 INFO Processing .txt files
2020-02-04 11:01:19,137 INFO Generating .tif files
2020-02-04 11:10:24,855 INFO Done

Much faster (1 second) after setting  export OMP_THREAD_LIMIT=1

 export OMP_THREAD_LIMIT=1
 create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
corpora ground-truth
2020-02-04 11:12:18,713 INFO Processing .txt files
2020-02-04 11:12:18,715 INFO Generating .tif files
2020-02-04 11:12:19,398 INFO Done

You can update the documenation.



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXzXLFbK8JKnNOK%3Di39p3UcGZJgJSmvzCbmUo_rnwhpRQ%40mail.gmail.com.


Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-04 Thread Shree Devi Kumar
Thanks, Wincent.
I will try out the tools added by you.

I found a Unicode version of the ISRI evaluation tools at
https://github.com/eddieantonio/ocreval which handles the high range
Unicodepoints also. See
https://github.com/Shreeshrii/tesstrain-modi/blob/master/reports/modi-eval-modiLayer_1.017_157724_324000/report_modiLayer_1.017_157724_324000-modi-ALL.txt
for an example

Do you have a workflow for tesseract training using your tools? If so, I
would like to add/refer to it in Tesseract documentation.




On Tue, Feb 4, 2020 at 2:06 AM Wincent Balin 
wrote:

> Hi Shree,
>
> I am glad you find the package already useful :-) .
>
> As to your question: I did not use the ocr-evaluation tools, only the
> language_metrics utility. So, regrettably, I cannot help you here. But
> maybe you could try the same utility too?
>
> By the way, I added a create_ground_truth utility, which creates .gt.txt
> files as well as the associated .tif files for every specified font, to
> the package. I think it could be useful for anyone who does not have a
> ground truth collection yet.
>
> Kind regards,
>
> Wincent
>
>
> Am Mittwoch, 29. Januar 2020 06:47:01 UTC+1 schrieb shree:
>>
>> Hi Wincent,
>>
>> Thank you for sharing these tools. I find create-dictdata to be very
>> useful.
>>
>> I wanted to know if you have modified any ocr-evaluation tools to handle
>> the high unicode range such as for Akkadian language.
>>
>> I was trying to test regarding Modi script (*Range*‎: ‎U+11600..U+1165F;
>> (96 code points)) and found that  `ocrevalutf8 accuracy` does not work
>> well for it. Any suggestions ...
>>
>> Shree
>>
>> On Sunday, January 5, 2020 at 2:22:50 AM UTC+5:30, Wincent Balin wrote:
>>>
>>> Hi all,
>>>
>>> I would like to announce pytesstrain, a collection of Tesseract
>>> training tools, as well as the underlying library. The tools were created
>>> while training Tesseract to recognise Akkadian language (stay tuned for
>>> more posts!), to solve the problems that emerged in the process.
>>>
>>> You can install it with pip install pytesstrain.
>>>
>>> The PyPI page for the package is https://pypi.org/project/pytesstrain/.
>>> The GitHub project page is https://github.com/wincentbalin/pytesstrain.
>>>
>>> This package contains the tools to create dictionary data (wordlist, bi-
>>> and unigram lists, etc.), rewrap lines in text files to the specified
>>> length, collect most frequent recognition errors and dump them into
>>> unicharambigs file, and to perform recognition metrics (WER and CER). It
>>> also contains the run_test() function, which creates an image file from
>>> the given string and performs OCR on it afterwards, as well as its
>>> parallelised version, run_tests(), which can be used in future tools.
>>>
>>> Feedback, suggestions, etc would be most welcome.
>>>
>>> Yours truly,
>>>
>>> Wincent
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3df5801b-7119-4451-9bb5-5fabc3e66bb1%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-Xyj4bU3-aw%3DjVP9%3DTvm5uPjLDuFesC4G%2B6nx6JM4Ug%40mail.gmail.com.