Re: [tesseract-ocr] Detection on complex images

2017-10-19 Thread Dmitri Silaev
Oh, come on, please no more speechmaking.

Contest hosts know that there are troubles getting words out of the images,
and you know it. And you know that there's no much of a "full solution" to
build once you've managed to get the OCR work right. That's the point of
the problem. That way, you just ask people to solve the problem, to get
your job fully done. You're going to get paid, people won't get anything.

That's. Not. Fair. Period.




On Wed, Oct 18, 2017 at 6:39 PM, Paolo Giannoccaro  wrote:

> Why not fair ? Having technical advise from any kind of forum is just an
> ordinary work (think of stackoverflow, is it unfair to find an idea or a
> piece of code from there ?). Developing a full solution it's a different
> thing and it is what I will try to do.
>
> thanks for your time.
>
> On Wednesday, October 18, 2017 at 7:38:02 PM UTC+2, Dmitri Silaev wrote:
>>
>> Wow, we are being taken advantage of. Smart move Paolo but not fair.
>> Heck, I almost started writing the answer.
>>
>>
>>
>>
>> On Tue, Oct 17, 2017 at 7:00 PM, Tom Morris  wrote:
>>
>>> I don't suppose this has anything to do with the Top Coder Mud Logger
>>> OCR contest, does it?
>>> https://community.topcoder.com/longcontest/?module=ViewProbl
>>> emStatement=17004=57867
>>>
>>> How will our team divide its winnings?
>>>
>>> Tom
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/7fcf2544-9e05-4114-a089-743af8b3df91%40goo
>>> glegroups.com
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/bf0c8d4e-a3cd-4dd5-9746-c56d8c79cb0d%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFPjkQMgk1DC_Zbaw7dsLmfuWTm3_K24J6rofm6VJ6Bp0Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract ignores tessedit_char_whitelist parameter

2017-10-19 Thread Quan Nguyen
https://github.com/tesseract-ocr/tesseract/issues/751

Use current version 3.05.x, if possible.


On Thursday, October 19, 2017 at 9:19:08 AM UTC-5, Ľuboš Katrinec wrote:
>
> I used --print-parameters with this version and I could see the parameter 
> in the list included. Do you think it is not used even if listed? It's the 
> same with tessedit_char_blacklist? Is there an alternative?
>
> Thanks and regards,
> Lubos
>
> On Saturday, October 14, 2017 at 5:43:16 PM UTC+2, shree wrote:
>>
>> whitelist parameter does not work with tesseract 4.0x
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Oct 14, 2017 at 8:25 PM, Dan9er  wrote:
>>
>>> -c goes at the very end of the command, and you can combine those two 
>>> arguments. Try this:
>>>
>>> > tesseract threshold_problem1.jpeg stdout -c tessedit_char_whitelist=
>>> ABCDEFGHIJKLMNOPQRSTUVWXYZ tessedit_char_blacklist=abcdef
>>> ghijklmnopqrstuvwxyz
>>>
>>> On Friday, October 13, 2017 at 5:43:46 AM UTC-4, Ľuboš Katrinec wrote:

 Hello,

 I'm trying to solve captcha images just for fun (or rather a challenge 
 ;-) ). I'm passing tessedit_char_whitelist and tessedit_char_blacklist 
 parameters but somehow they seem to be ignored. Perhaps I just miss 
 something.

 > tesseract -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ -c 
 tessedit_char_blacklist=abcdefghijklmnopqrstuvwxyz  
 threshold_problem1.jpeg 
 stdout
 Warning. Invalid resolution 0 dpi. Using 70 instead.
 R x C Eo e

 I'm using a windows version:

 > tesseract -v
 tesseract 4.00.00alpha
  leptonica-1.74.1
   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : 
 libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0


 I'm doing it over a JPEG, could that be a problem?

 Thanks and regards,
 Lubos

>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/7036c184-2d91-43f1-874f-44f2c29f3d61%40googlegroups.com
>>>  
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b091515d-b04b-46bb-93c0-5e908c52d326%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: [Urgent] Find out coordinates and bounding box of a word/phrase/paragraph

2017-10-19 Thread parv gupta
can anyone tell me the position of this word
lipsum(9)


On Wednesday, January 6, 2010 at 11:07:33 PM UTC+5:30, jdevelop wrote:
>
> Hello, all!
>
> Can somebody please advice - is it possible to get the coordinates and
> bounding boxes of words, recognized by tesseract? If so - can somebody
> please point me to where I should learn more about it?
>
> Ideally, the output (or API callback) should contain the word itself,
> the [X,Y] of upper-left corner and [X,Y] of bottom-right one.
>
> Thank you all in advance!
>
> --
> Eugene
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7de36bd7-561d-4c0c-af86-529274c68bb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: russian-old?

2017-10-19 Thread Yury
Thanks again, Shree, 
also, for that comment on the PR page.

Now, the PR looks to me like that's a circa 2015 language data with 
modifications.
Wouldn't the OCR quality regress compared with 4.* data, or did the 
langdata source remain the same?

I think I'll just have to file the issue. The materiel in the PR looks too 
intimidating for me to try to install this stuff by myself. :) Still trying 
to make sense of that 'plusminus' material!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb4a7c08-769b-4e6e-a100-c7aaba86d0e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: russian-old?

2017-10-19 Thread ShreeDevi Kumar
Well, If that PR was the right one you could add a reminder for Ray Smith
(chief developer) to include it.

You can try the plus-minus type of finetune training - please see
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Oct 19, 2017 at 7:51 PM, Yury  wrote:

> Shree, thank you, and yes, accented vowels would be fine,
> but right now I was talking of ' іѣѳѵ' set (U+0406,0462,0472,0474
> uppercase and U+0456,0463,0473,0475 lowercase).
>
> The 4.0.0.0 version from git definitely refuses to recognise those, and
> AFAICT there is no mention of the codes in the source files.
>
> I'm a complete noob at git, how could I know when the PR you mentioned
> becomes available in git as downloads?
>
> -Yury
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/316cc856-a996-43af-87c1-ca6b3314d8e3%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW%2B%3D_QRK8ASeH7TFf%3D9JMGJOFYpuJES8%2BP1U4ZtCZhOTQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: russian-old?

2017-10-19 Thread Yury
Shree, thank you, and yes, accented vowels would be fine,
but right now I was talking of ' іѣѳѵ' set (U+0406,0462,0472,0474 uppercase 
and U+0456,0463,0473,0475 lowercase).

The 4.0.0.0 version from git definitely refuses to recognise those, and 
AFAICT there is no mention of the codes in the source files.

I'm a complete noob at git, how could I know when the PR you mentioned 
becomes available in git as downloads?

-Yury

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/316cc856-a996-43af-87c1-ca6b3314d8e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract ignores tessedit_char_whitelist parameter

2017-10-19 Thread Ľuboš Katrinec
I used --print-parameters with this version and I could see the parameter 
in the list included. Do you think it is not used even if listed? It's the 
same with tessedit_char_blacklist? Is there an alternative?

Thanks and regards,
Lubos

On Saturday, October 14, 2017 at 5:43:16 PM UTC+2, shree wrote:
>
> whitelist parameter does not work with tesseract 4.0x
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Oct 14, 2017 at 8:25 PM, Dan9er  > wrote:
>
>> -c goes at the very end of the command, and you can combine those two 
>> arguments. Try this:
>>
>> > tesseract threshold_problem1.jpeg stdout -c tessedit_char_whitelist=
>> ABCDEFGHIJKLMNOPQRSTUVWXYZ tessedit_char_blacklist=abcdef
>> ghijklmnopqrstuvwxyz
>>
>> On Friday, October 13, 2017 at 5:43:46 AM UTC-4, Ľuboš Katrinec wrote:
>>>
>>> Hello,
>>>
>>> I'm trying to solve captcha images just for fun (or rather a challenge 
>>> ;-) ). I'm passing tessedit_char_whitelist and tessedit_char_blacklist 
>>> parameters but somehow they seem to be ignored. Perhaps I just miss 
>>> something.
>>>
>>> > tesseract -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ -c 
>>> tessedit_char_blacklist=abcdefghijklmnopqrstuvwxyz  threshold_problem1.jpeg 
>>> stdout
>>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>>> R x C Eo e
>>>
>>> I'm using a windows version:
>>>
>>> > tesseract -v
>>> tesseract 4.00.00alpha
>>>  leptonica-1.74.1
>>>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : 
>>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>>>
>>>
>>> I'm doing it over a JPEG, could that be a problem?
>>>
>>> Thanks and regards,
>>> Lubos
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/7036c184-2d91-43f1-874f-44f2c29f3d61%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6a79aa66-3e42-42d0-9b90-6513bea58c1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract ignores tessedit_char_whitelist parameter

2017-10-19 Thread Ľuboš Katrinec
I already tried this, didn't help at all.

On Saturday, October 14, 2017 at 4:55:42 PM UTC+2, Dan9er wrote:
>
> -c goes at the very end of the command, and you can combine those two 
> arguments. Try this:
>
> > tesseract threshold_problem1.jpeg stdout -c tessedit_char_whitelist=
> ABCDEFGHIJKLMNOPQRSTUVWXYZ tessedit_char_blacklist=abcdef
> ghijklmnopqrstuvwxyz
>
> On Friday, October 13, 2017 at 5:43:46 AM UTC-4, Ľuboš Katrinec wrote:
>>
>> Hello,
>>
>> I'm trying to solve captcha images just for fun (or rather a challenge 
>> ;-) ). I'm passing tessedit_char_whitelist and tessedit_char_blacklist 
>> parameters but somehow they seem to be ignored. Perhaps I just miss 
>> something.
>>
>> > tesseract -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ -c 
>> tessedit_char_blacklist=abcdefghijklmnopqrstuvwxyz  threshold_problem1.jpeg 
>> stdout
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> R x C Eo e
>>
>> I'm using a windows version:
>>
>> > tesseract -v
>> tesseract 4.00.00alpha
>>  leptonica-1.74.1
>>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : 
>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>>
>>
>> I'm doing it over a JPEG, could that be a problem?
>>
>> Thanks and regards,
>> Lubos
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ba46f0fb-a23d-45d7-a51a-ad9ef84cec42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] russian-old?

2017-10-19 Thread ShreeDevi Kumar
There is an existing PR at https://github.com/tesseract-ocr/langdata/pull/12

Does that meet what you are looking for?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Oct 18, 2017 at 10:45 PM, Yury Tarasievich <
yury.tarasiev...@gmail.com> wrote:

> Hi guys,
>
> I may be wrong but the Russian tessdata does not provide for recognising
> old orthography and Church Slavonic glyphs? You know, i with dot, theta,
> yat, etc.
>
> Would it be very hard to add the 'rus_old' variant? Or, is it too
> difficult to roll-your-own the changed rus.tessdata on the local system?
>
> -Yury
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/tesseract-ocr/d5ba872e-f489-9f19-b1ff-f3ea85d4fa23%40gmail.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWFKXhfbvZiWXU33pc%3DQ3Brqv%3DU2TuOeAeHu4%3DBnn0QcA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.