Re: [tesseract-ocr] Are character bboxes trustworthy?

2020-07-25 Thread Zdenko Podobny
As I mentioned, if you need good bounding boxes you have to use a legacy
engine.
There are several issues & comments why it is problem to get accurate
bounding boxes e.g.
https://github.com/tesseract-ocr/tesseract/issues/2825#issuecomment-579220987


Zdenko


so 25. 7. 2020 o 0:44 'robinw...@googlemail.com' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> > Do you use lstm or legacy engine?
>
> lstm.
>
> I can find a couple of Noah Metzger patches:
>
>
> https://github.com/tesseract-ocr/tesseract/commit/c350077b96077fa50fefe97fbaed04014407f0f1
>
> and
> https://github.com/tesseract-ocr/tesseract/pull/2576
>
> etc, but they've all been merged into master. As far as I can tell from
> his github, all his patches have been pulled in.
>
> I'm using master.
>
> Crap bounding boxes really knock the effectiveness of Tesseract as a
> library :(
>
> Thanks.
> On Friday, 24 July 2020 at 19:01:30 UTC+1 zdenop wrote:
>
>> Do you use lstm or legacy engine?
>>
>> If lstm: search issue tracker/PR/(forum?) for bounding box problem (and  Noah
>> Metzger patches)
>>
>> There are rumours that if you need really good bounding boxes you have to
>> use the latest 3.5 version because changes in the 4.x version (and later)
>> also affected legacy engine bounding box accuracy (compared to version 3).
>> But I never saw comparison test (especially on high volume of images)
>>
>> Zdenko
>>
>>
>> pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <
>> tesser...@googlegroups.com> napísal(a):
>>
>>> Hi all,
>>>
>>> I'm using tesseract as a library, and broadly it seems to be working
>>> well. I am having some very strange problems with the character boxes I get
>>> back from the iterator though.
>>>
>>> The attached image is a png made from the 8bpp greyscale image that I
>>> feed it, overlaid with boxes to show all the 'b' characters I get back.
>>>
>>> Only one of the 4 'b' characters I get appears to have the box in the
>>> right place.
>>>
>>> The code I'm using to extract the data is:
>>>
>>> tesseract::ResultIterator *res_it = api->GetIterator();
>>> while (!res_it->Empty(tesseract::RIL_BLOCK))
>>> {
>>> if (res_it->Empty(tesseract::RIL_WORD))
>>> {
>>> res_it->Next(tesseract::RIL_WORD);
>>> continue;
>>> }
>>>
>>> res_it->BoundingBox(tesseract::RIL_TEXTLINE,
>>> line_bbox, line_bbox+1,
>>> line_bbox+2, line_bbox+3);
>>> res_it->BoundingBox(tesseract::RIL_WORD,
>>> word_bbox, word_bbox+1,
>>> word_bbox+2, word_bbox+3);
>>> font_name = res_it->WordFontAttributes(,
>>> ,
>>> ,
>>> ,
>>> ,
>>> ,
>>> ,
>>> _id);
>>> do
>>> {
>>> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
>>> if (graph && graph[0] != 0)
>>> {
>>> int unicode;
>>> res_it->BoundingBox(tesseract::RIL_SYMBOL,
>>> char_bbox, char_bbox+1,
>>> char_bbox+2, char_bbox+3);
>>> fz_chartorune(, graph);
>>> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox,
>>> pointsize);
>>> }
>>> res_it->Next(tesseract::RIL_SYMBOL);
>>> }
>>> while (!res_it->Empty(tesseract::RIL_BLOCK) &&
>>> !res_it->IsAtBeginningOf(tesseract::RIL_WORD));
>>> }
>>>
>>> The characters are coming back correctly, and *most* are in the correct
>>> position. Just a few are shifted.
>>>
>>> Is this to be expected? Am I doing something stupid?
>>>
>>> (Even being told "It's reliably correct for me" would be helpful at this
>>> point.)
>>>
>>> Thanks,
>>>
>>> Robin
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wX558wi_YhC2ysf76OrkkaBPoQwGey05fR7H47NrTh7Q%40mail.gmail.com.


Re: [tesseract-ocr] Are character bboxes trustworthy?

2020-07-24 Thread 'robinw...@googlemail.com' via tesseract-ocr
> Do you use lstm or legacy engine?  

lstm.

I can find a couple of Noah Metzger patches:

https://github.com/tesseract-ocr/tesseract/commit/c350077b96077fa50fefe97fbaed04014407f0f1
 
and 
https://github.com/tesseract-ocr/tesseract/pull/2576

etc, but they've all been merged into master. As far as I can tell from his 
github, all his patches have been pulled in.

I'm using master.

Crap bounding boxes really knock the effectiveness of Tesseract as a 
library :(

Thanks.
On Friday, 24 July 2020 at 19:01:30 UTC+1 zdenop wrote:

> Do you use lstm or legacy engine?
>
> If lstm: search issue tracker/PR/(forum?) for bounding box problem (and  Noah 
> Metzger patches) 
>
> There are rumours that if you need really good bounding boxes you have to 
> use the latest 3.5 version because changes in the 4.x version (and later) 
> also affected legacy engine bounding box accuracy (compared to version 3). 
> But I never saw comparison test (especially on high volume of images)
>
> Zdenko
>
>
> pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <
> tesser...@googlegroups.com> napísal(a):
>
>> Hi all,
>>
>> I'm using tesseract as a library, and broadly it seems to be working 
>> well. I am having some very strange problems with the character boxes I get 
>> back from the iterator though.
>>
>> The attached image is a png made from the 8bpp greyscale image that I 
>> feed it, overlaid with boxes to show all the 'b' characters I get back.
>>
>> Only one of the 4 'b' characters I get appears to have the box in the 
>> right place.
>>
>> The code I'm using to extract the data is:
>>
>> tesseract::ResultIterator *res_it = api->GetIterator(); 
>> while (!res_it->Empty(tesseract::RIL_BLOCK))
>> {
>> if (res_it->Empty(tesseract::RIL_WORD))
>> {
>> res_it->Next(tesseract::RIL_WORD);
>> continue;
>> }
>>
>> res_it->BoundingBox(tesseract::RIL_TEXTLINE,
>> line_bbox, line_bbox+1,
>> line_bbox+2, line_bbox+3);
>> res_it->BoundingBox(tesseract::RIL_WORD,
>> word_bbox, word_bbox+1,
>> word_bbox+2, word_bbox+3);
>> font_name = res_it->WordFontAttributes(,
>> ,
>> ,
>> ,
>> ,
>> ,
>> ,
>> _id);
>> do
>> {
>> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
>> if (graph && graph[0] != 0)
>> {
>> int unicode;
>> res_it->BoundingBox(tesseract::RIL_SYMBOL,
>> char_bbox, char_bbox+1,
>> char_bbox+2, char_bbox+3);
>> fz_chartorune(, graph);
>> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox, 
>> pointsize);
>> }
>> res_it->Next(tesseract::RIL_SYMBOL);
>> }
>> while (!res_it->Empty(tesseract::RIL_BLOCK) &&
>> !res_it->IsAtBeginningOf(tesseract::RIL_WORD));
>> }
>>
>> The characters are coming back correctly, and *most* are in the correct 
>> position. Just a few are shifted.
>>
>> Is this to be expected? Am I doing something stupid?
>>
>> (Even being told "It's reliably correct for me" would be helpful at this 
>> point.)
>>
>> Thanks,
>>
>> Robin
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com.


Re: [tesseract-ocr] Are character bboxes trustworthy?

2020-07-24 Thread Zdenko Podobny
Do you use lstm or legacy engine?

If lstm: search issue tracker/PR/(forum?) for bounding box problem (and  Noah
Metzger patches)

There are rumours that if you need really good bounding boxes you have to
use the latest 3.5 version because changes in the 4.x version (and later)
also affected legacy engine bounding box accuracy (compared to version 3).
But I never saw comparison test (especially on high volume of images)

Zdenko


pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hi all,
>
> I'm using tesseract as a library, and broadly it seems to be working well.
> I am having some very strange problems with the character boxes I get back
> from the iterator though.
>
> The attached image is a png made from the 8bpp greyscale image that I feed
> it, overlaid with boxes to show all the 'b' characters I get back.
>
> Only one of the 4 'b' characters I get appears to have the box in the
> right place.
>
> The code I'm using to extract the data is:
>
> tesseract::ResultIterator *res_it = api->GetIterator();
> while (!res_it->Empty(tesseract::RIL_BLOCK))
> {
> if (res_it->Empty(tesseract::RIL_WORD))
> {
> res_it->Next(tesseract::RIL_WORD);
> continue;
> }
>
> res_it->BoundingBox(tesseract::RIL_TEXTLINE,
> line_bbox, line_bbox+1,
> line_bbox+2, line_bbox+3);
> res_it->BoundingBox(tesseract::RIL_WORD,
> word_bbox, word_bbox+1,
> word_bbox+2, word_bbox+3);
> font_name = res_it->WordFontAttributes(,
> ,
> ,
> ,
> ,
> ,
> ,
> _id);
> do
> {
> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
> if (graph && graph[0] != 0)
> {
> int unicode;
> res_it->BoundingBox(tesseract::RIL_SYMBOL,
> char_bbox, char_bbox+1,
> char_bbox+2, char_bbox+3);
> fz_chartorune(, graph);
> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox,
> pointsize);
> }
> res_it->Next(tesseract::RIL_SYMBOL);
> }
> while (!res_it->Empty(tesseract::RIL_BLOCK) &&
> !res_it->IsAtBeginningOf(tesseract::RIL_WORD));
> }
>
> The characters are coming back correctly, and *most* are in the correct
> position. Just a few are shifted.
>
> Is this to be expected? Am I doing something stupid?
>
> (Even being told "It's reliably correct for me" would be helpful at this
> point.)
>
> Thanks,
>
> Robin
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ySG0KByGLvys6eEWsHKpwcsRAM0NLLK%2BPYzVZ26v3BLg%40mail.gmail.com.


[tesseract-ocr] Are character bboxes trustworthy?

2020-07-24 Thread 'Robin Watts' via tesseract-ocr
Hi all,

I'm using tesseract as a library, and broadly it seems to be working well. 
I am having some very strange problems with the character boxes I get back 
from the iterator though.

The attached image is a png made from the 8bpp greyscale image that I feed 
it, overlaid with boxes to show all the 'b' characters I get back.

Only one of the 4 'b' characters I get appears to have the box in the right 
place.

The code I'm using to extract the data is:

tesseract::ResultIterator *res_it = api->GetIterator(); 
while (!res_it->Empty(tesseract::RIL_BLOCK))
{
if (res_it->Empty(tesseract::RIL_WORD))
{
res_it->Next(tesseract::RIL_WORD);
continue;
}

res_it->BoundingBox(tesseract::RIL_TEXTLINE,
line_bbox, line_bbox+1,
line_bbox+2, line_bbox+3);
res_it->BoundingBox(tesseract::RIL_WORD,
word_bbox, word_bbox+1,
word_bbox+2, word_bbox+3);
font_name = res_it->WordFontAttributes(,
,
,
,
,
,
,
_id);
do
{
const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
if (graph && graph[0] != 0)
{
int unicode;
res_it->BoundingBox(tesseract::RIL_SYMBOL,
char_bbox, char_bbox+1,
char_bbox+2, char_bbox+3);
fz_chartorune(, graph);
callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox, 
pointsize);
}
res_it->Next(tesseract::RIL_SYMBOL);
}
while (!res_it->Empty(tesseract::RIL_BLOCK) &&
!res_it->IsAtBeginningOf(tesseract::RIL_WORD));
}

The characters are coming back correctly, and *most* are in the correct 
position. Just a few are shifted.

Is this to be expected? Am I doing something stupid?

(Even being told "It's reliably correct for me" would be helpful at this 
point.)

Thanks,

Robin

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com.