Hmmm. I had a quick look. The results don't seem to be too helpful. Could be a little more precise as to what I'm looking for? Thx
On Friday, June 4, 2021 at 4:29:29 PM UTC+1 zdenop wrote: > search issue tracker and forum for "table" > > Zdenko > > > pi 4. 6. 2021 o 17:13 Jeremy Young <[email protected]> napísal(a): > >> It looks like there's a bug of some sort here. Attached is another image. >> When I COR it with >> >> "tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1" >> >> the hocr for "Party A" looks like this: >> >> <span class='ocrx_word' id='word_1_7' title='bbox 1547 347 1683 >> 384; x_wconf 84'> >> <span class='ocrx_cinfo' title='x_bboxes 1547 347 1567 376; x_conf >> 98.908447'>P</span> >> <span class='ocrx_cinfo' title='x_bboxes 1571 354 1589 376; x_conf >> 99.026512'>a</span> >> <span class='ocrx_cinfo' title='x_bboxes 1594 354 1607 376; x_conf >> 98.80246'>r</span> >> <span class='ocrx_cinfo' title='x_bboxes 1609 349 1645 384; x_conf >> 98.968414'>t</span> >> <span class='ocrx_cinfo' title='x_bboxes 1637 347 1661 384; x_conf >> 98.820137'>y</span> >> <span class='ocrx_cinfo' title='x_bboxes 1657 347 1683 376; x_conf >> 97.777733'>A</span> >> </span> >> >> ie the x-coordinate of the "y" overlaps the prior and following >> characters. >> >> On Thursday, June 3, 2021 at 6:45:51 PM UTC+1 Jeremy Young wrote: >> >>> Hi >>> >>> The attached test image (which could be in a batch of a million, so I >>> need a generalised fix) is being processed in Tess4J but I also get the >>> same issue with the Windows build from Mannheim version: >>> >>> C:\temp>tesseract --version >>> tesseract v5.0.0-alpha.20210506 >>> leptonica-1.78.0 >>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : >>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >>> Found AVX2 >>> Found AVX >>> Found FMA >>> Found SSE4.1 >>> Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 >>> liblz4/1.7.5 libzstd/1.4.5 >>> Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 >>> nghttp2/1.31.0 >>> >>> When I execute "tesseract test1.png test1" the output contains at line >>> 21 "PartyA | PartyB | Valuation". "Party A" should be two words as should >>> "Party B". >>> >>> When I output the hocr using Tess4J I can see that the gaps between the >>> characters are 4,6,2,2,12 >>> ie the gap between the "y" and the "A" is much bigger than the others. >>> >>> <span class='ocrx_word' id='word_1_15' title='bbox 1551 349 1681 >>> 386; x_wconf 91; x_fsize 9'> >>> <span class='ocrx_cinfo' title='x_bboxes 1551 349 1569 378; >>> x_conf 99.031525'>P</span> >>> <span class='ocrx_cinfo' title='x_bboxes 1573 356 1590 378; >>> x_conf 98.951897'>a</span> >>> <span class='ocrx_cinfo' title='x_bboxes 1596 356 1608 378; >>> x_conf 98.996353'>r</span> >>> <span class='ocrx_cinfo' title='x_bboxes 1610 351 1623 378; >>> x_conf 99.038818'>t</span> >>> <span class='ocrx_cinfo' title='x_bboxes 1625 357 1644 386; >>> x_conf 98.881676'>y</span> >>> <span class='ocrx_cinfo' title='x_bboxes 1656 349 1681 378; >>> x_conf 98.736168'>A</span> >>> </span> >>> >>> Any suggestions what I could do? >>> >>> Thx >>> >>> >>> >>> LIKEZERO Limited is a limited company registered in Scotland with >>> registered number SC651418. Our registered office is at Quartermile One, 15 >>> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP >>> >>> This email is intended solely for the addressee and may contain >>> confidential information. If you have received this message in error, >>> please immediately and permanently delete it. Do not use, copy or disclose >>> the information contained in this message or in any attachment. >>> >>> This email is not in any way intended to create a binding contract. >>> >>> We may monitor and record emails for security reasons and for monitoring >>> compliance with internal policies. >>> >> >> LIKEZERO Limited is a limited company registered in Scotland with >> registered number SC651418. Our registered office is at Quartermile One, 15 >> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP >> >> This email is intended solely for the addressee and may contain >> confidential information. If you have received this message in error, >> please immediately and permanently delete it. Do not use, copy or disclose >> the information contained in this message or in any attachment. >> >> This email is not in any way intended to create a binding contract. >> >> We may monitor and record emails for security reasons and for monitoring >> compliance with internal policies. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- LIKEZERO Limited is a limited company registered in Scotland with registered number SC651418. Our registered office is at Quartermile One, 15 Lauriston Place, Edinburgh, United Kingdom, EH3 9EP This email is intended solely for the addressee and may contain confidential information. If you have received this message in error, please immediately and permanently delete it. Do not use, copy or disclose the information contained in this message or in any attachment. This email is not in any way intended to create a binding contract. We may monitor and record emails for security reasons and for monitoring compliance with internal policies. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f7c3e0ba-3693-4315-885d-e6bd3a5ae0a4n%40googlegroups.com.

