Re: [tesseract-ocr] Re: Page layout analysis module

2016-03-08 Thread Age Bosma
Hi Zdenko,

Man, would I have liked getting that hint 5 years ago... :-/

Best regards,

Age Bosma


On Tuesday, 8 March 2016 16:56:36 UTC+1, zdenop wrote:
>
> IMO it is - in hocr (xml) output or tsv (in master branch a.k.a 3.05)
>
> Zdenko
>
> On Tue, Mar 8, 2016 at 3:14 PM, Age Bosma  > wrote:
>
>> Hi Teng,
>>
>> The options I mention aren't available in tesseract. I listed them as 
>> suggestions for extending tesseract. They haven't been implemented as far 
>> as I know.
>>
>> Best regards,
>>
>> Age
>>
>>
>>
>> On Monday, 7 March 2016 09:56:40 UTC+1, Teng Long wrote:
>>>
>>>
>>> Hi Age, I'm a newbie in OCR.
>>> You mentioned 3 option to use tesseract, 
>>> could you please tell me how to use this 3 options?
>>>
>>> any command is appreciated.
>>> Like:
>>>tesseract sample2.jpg ouput -l eng -psm 3
>>>
>>> Thank you !
>>>
>>> On Monday, June 20, 2011 at 8:19:03 PM UTC+8, Age Bosma wrote:

 Thank you for your reply.

 Nice to learn that it is possible programming-wise. I should, however,
 have been more clear that I was referring to command-line functionality.

 Would it be an idea to extend the tesseract command-line tools to have
 it output containing block dimensions?

 So one option to output just the text (current behaviour):
 
 Some text
 And yet again some other text
 

 A second option to output the text marked with it's block dimensions:
 
 [block:10,20,250,20]
 Some text
 [block:350,400,600,410]
 And yet again some other text
 

 A a third option to output just all blocks:
 
 [block:10,20,250,20]
 [block:350,400,600,410]
 

 Yours,

 Age


 On 20-06-11 11:56, patrickq wrote:
 > You can definitely get just layout analysis before text recognition -
 > look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
 > structure. You can then iterate through that structure to look at
 > blocks and rows within these blocks. Keep in mind that a sentence in
 > the image could be broken out into separate boxes altogether if you
 > have anything more complex than a simple page, so you'll have to do
 > the stiching yourself of rows in entirely different boxes, based on
 > their coordinates. There are even cases where you might get
 > "Patrick"returned as one row containing "Ptrik" and one row containing
 > "ic" - rare but happens too, especially when the text line has a slope
 > (even if very moderate).
 > 
 > Patrick
 > 
 > On Jun 19, 4:07 pm, Prodoc  wrote:
 >> Hi,
 >>
 >> In version 3 of tesseract-ocr there's a new page layout analysis
 >> module. I'm interested to learn in what way it is used and how it can
 >> be used.
 >>
 >> Does it provide additional user functionality or is it only used
 >> internally? I.e. can I query it somehow to output all recognized text
 >> areas (position and dimensions) without its actual text content?
 >> Does it have any influence on the mark-up of the text output? I.e.
 >> e.g. additional line breaks between text in case of a new paragraph.
 >> I've played with the different pagesegmode values (0-3) but it gives
 >> me the exact same output for each of them. Do these settings have
 >> anything to do with the layout analysis?
 >>
 >> If recognizing text areas is what it does but you can't output just
 >> the position and dimensions of them, it would be great to see this as
 >> a new feature. In a program like gImageReader you have to do this
 >> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's
 >> analysis is more accurate, one could use that as an input for
 >> OCRFeeder again.
 >>
 >> Yours,
 >>
 >> Age Bosma
 > 


 -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com
>>  
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop 

[tesseract-ocr] Re: Support for Tesseract OCR

2016-03-08 Thread Tom Morris
On Monday, March 7, 2016 at 5:55:48 AM UTC-5, Pushparag Vaidya wrote:
>
>
> We are planning to integrate Tesseract OCR engine for one of our 
> application we are building for internal DMS.
>
> Can someone let me know of the licensing requirements? 
>

The license is Apache v2 as specified here: 
https://github.com/tesseract-ocr/tesseract/blob/master/COPYING
It's a pretty liberal license, but you should have your lawyer review it 
for you with an eye towards whatever your specific needs are.

Also,  is there any Support Agreement that I can get into with either 
> Apache (or Google)?
>

Neither Google nor the Apache Software Foundation offer commercial support, 
but there are a number of developers who do commercial work with Tesseract 
and may be willing to provide you with support, depending on what type of 
support you're looking for.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/27ce2616-d8be-4542-a843-88080bdeec50%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: limiting tesseract to one language

2016-03-08 Thread Tom Morris
On Tue, Mar 8, 2016 at 6:03 AM, Bojan Djuric  wrote:

>
> Sorry, I tried the  -c tessedit_load_sublangs="" option, which did not
> work.
>

Yes, I said that didn't work. I'd suggest trying the workaround that I said
would work, namely, unpacking the config file from srp_latn.traineddata,
editing it to remove the offending line, and repacking it.  The necessary
commands are in my original message below.

Tom

On Monday, March 7, 2016 at 5:18:36 PM UTC+1, Tom Morris wrote:
>
>> On Mon, Mar 7, 2016 at 3:39 AM, Bojan Djuric  wrote:
>>
>>> Tried that, did not work for me either :)
>>>
>>
>> I mentioned two things. Which one(s) did you try? If you tried
>> editing/replacing the config file in srp_latn.traineddata and it didn't
>> work, you can provide more details on your exact steps and the results?
>>
>>
>>> Workaround could be to copy srp (cyrillic), and osd files to another
>>> folder, and use --tessdata-dir parameter.
>>> But that would complicate things.
>>>
>>> On Sunday, March 6, 2016 at 8:27:27 PM UTC+1, Tom Morris wrote:


 I was hoping you'd be able to override that on the command line, using -c
 tessedit_load_sublangs="", but that doesn't seem to work with the
 current order of evaluation, at least with my limited testing.

 If you have the training tools installed, you can patch your copy of
 the language file by doing the following:

 $ combine_tessdata -e srp_latn.traineddata srp_latn.config
 $ cp /dev/null srp_latn.config

 $ combine_tessdata -o srp_latn.traineddata srp_latn.config

 That will remove the problematic line from your config (you might want
 to copy srp_latn to srp_latn_only or some other name if you'd like both
 behaviors available to you).

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEEA7tH-XGSzd3pq059QzE9Sf0tsfw_8m-CXqmy9Tzczjw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Page layout analysis module

2016-03-08 Thread zdenko podobny
IMO it is - in hocr (xml) output or tsv (in master branch a.k.a 3.05)

Zdenko

On Tue, Mar 8, 2016 at 3:14 PM, Age Bosma  wrote:

> Hi Teng,
>
> The options I mention aren't available in tesseract. I listed them as
> suggestions for extending tesseract. They haven't been implemented as far
> as I know.
>
> Best regards,
>
> Age
>
>
>
> On Monday, 7 March 2016 09:56:40 UTC+1, Teng Long wrote:
>>
>>
>> Hi Age, I'm a newbie in OCR.
>> You mentioned 3 option to use tesseract,
>> could you please tell me how to use this 3 options?
>>
>> any command is appreciated.
>> Like:
>>tesseract sample2.jpg ouput -l eng -psm 3
>>
>> Thank you !
>>
>> On Monday, June 20, 2011 at 8:19:03 PM UTC+8, Age Bosma wrote:
>>>
>>> Thank you for your reply.
>>>
>>> Nice to learn that it is possible programming-wise. I should, however,
>>> have been more clear that I was referring to command-line functionality.
>>>
>>> Would it be an idea to extend the tesseract command-line tools to have
>>> it output containing block dimensions?
>>>
>>> So one option to output just the text (current behaviour):
>>> 
>>> Some text
>>> And yet again some other text
>>> 
>>>
>>> A second option to output the text marked with it's block dimensions:
>>> 
>>> [block:10,20,250,20]
>>> Some text
>>> [block:350,400,600,410]
>>> And yet again some other text
>>> 
>>>
>>> A a third option to output just all blocks:
>>> 
>>> [block:10,20,250,20]
>>> [block:350,400,600,410]
>>> 
>>>
>>> Yours,
>>>
>>> Age
>>>
>>>
>>> On 20-06-11 11:56, patrickq wrote:
>>> > You can definitely get just layout analysis before text recognition -
>>> > look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
>>> > structure. You can then iterate through that structure to look at
>>> > blocks and rows within these blocks. Keep in mind that a sentence in
>>> > the image could be broken out into separate boxes altogether if you
>>> > have anything more complex than a simple page, so you'll have to do
>>> > the stiching yourself of rows in entirely different boxes, based on
>>> > their coordinates. There are even cases where you might get
>>> > "Patrick"returned as one row containing "Ptrik" and one row containing
>>> > "ic" - rare but happens too, especially when the text line has a slope
>>> > (even if very moderate).
>>> >
>>> > Patrick
>>> >
>>> > On Jun 19, 4:07 pm, Prodoc  wrote:
>>> >> Hi,
>>> >>
>>> >> In version 3 of tesseract-ocr there's a new page layout analysis
>>> >> module. I'm interested to learn in what way it is used and how it can
>>> >> be used.
>>> >>
>>> >> Does it provide additional user functionality or is it only used
>>> >> internally? I.e. can I query it somehow to output all recognized text
>>> >> areas (position and dimensions) without its actual text content?
>>> >> Does it have any influence on the mark-up of the text output? I.e.
>>> >> e.g. additional line breaks between text in case of a new paragraph.
>>> >> I've played with the different pagesegmode values (0-3) but it gives
>>> >> me the exact same output for each of them. Do these settings have
>>> >> anything to do with the layout analysis?
>>> >>
>>> >> If recognizing text areas is what it does but you can't output just
>>> >> the position and dimensions of them, it would be great to see this as
>>> >> a new feature. In a program like gImageReader you have to do this
>>> >> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's
>>> >> analysis is more accurate, one could use that as an input for
>>> >> OCRFeeder again.
>>> >>
>>> >> Yours,
>>> >>
>>> >> Age Bosma
>>> >
>>>
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 

[tesseract-ocr] Re: Page layout analysis module

2016-03-08 Thread Age Bosma
Hi Teng,

The options I mention aren't available in tesseract. I listed them as 
suggestions for extending tesseract. They haven't been implemented as far 
as I know.

Best regards,

Age


On Monday, 7 March 2016 09:56:40 UTC+1, Teng Long wrote:
>
>
> Hi Age, I'm a newbie in OCR.
> You mentioned 3 option to use tesseract, 
> could you please tell me how to use this 3 options?
>
> any command is appreciated.
> Like:
>tesseract sample2.jpg ouput -l eng -psm 3
>
> Thank you !
>
> On Monday, June 20, 2011 at 8:19:03 PM UTC+8, Age Bosma wrote:
>>
>> Thank you for your reply.
>>
>> Nice to learn that it is possible programming-wise. I should, however,
>> have been more clear that I was referring to command-line functionality.
>>
>> Would it be an idea to extend the tesseract command-line tools to have
>> it output containing block dimensions?
>>
>> So one option to output just the text (current behaviour):
>> 
>> Some text
>> And yet again some other text
>> 
>>
>> A second option to output the text marked with it's block dimensions:
>> 
>> [block:10,20,250,20]
>> Some text
>> [block:350,400,600,410]
>> And yet again some other text
>> 
>>
>> A a third option to output just all blocks:
>> 
>> [block:10,20,250,20]
>> [block:350,400,600,410]
>> 
>>
>> Yours,
>>
>> Age
>>
>>
>> On 20-06-11 11:56, patrickq wrote:
>> > You can definitely get just layout analysis before text recognition -
>> > look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
>> > structure. You can then iterate through that structure to look at
>> > blocks and rows within these blocks. Keep in mind that a sentence in
>> > the image could be broken out into separate boxes altogether if you
>> > have anything more complex than a simple page, so you'll have to do
>> > the stiching yourself of rows in entirely different boxes, based on
>> > their coordinates. There are even cases where you might get
>> > "Patrick"returned as one row containing "Ptrik" and one row containing
>> > "ic" - rare but happens too, especially when the text line has a slope
>> > (even if very moderate).
>> > 
>> > Patrick
>> > 
>> > On Jun 19, 4:07 pm, Prodoc  wrote:
>> >> Hi,
>> >>
>> >> In version 3 of tesseract-ocr there's a new page layout analysis
>> >> module. I'm interested to learn in what way it is used and how it can
>> >> be used.
>> >>
>> >> Does it provide additional user functionality or is it only used
>> >> internally? I.e. can I query it somehow to output all recognized text
>> >> areas (position and dimensions) without its actual text content?
>> >> Does it have any influence on the mark-up of the text output? I.e.
>> >> e.g. additional line breaks between text in case of a new paragraph.
>> >> I've played with the different pagesegmode values (0-3) but it gives
>> >> me the exact same output for each of them. Do these settings have
>> >> anything to do with the layout analysis?
>> >>
>> >> If recognizing text areas is what it does but you can't output just
>> >> the position and dimensions of them, it would be great to see this as
>> >> a new feature. In a program like gImageReader you have to do this
>> >> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's
>> >> analysis is more accurate, one could use that as an input for
>> >> OCRFeeder again.
>> >>
>> >> Yours,
>> >>
>> >> Age Bosma
>> > 
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: limiting tesseract to one language

2016-03-08 Thread Bojan Djuric

Sorry, I tried the  -c tessedit_load_sublangs="" option, which did not work.


On Monday, March 7, 2016 at 5:18:36 PM UTC+1, Tom Morris wrote:
>
> On Mon, Mar 7, 2016 at 3:39 AM, Bojan Djuric  > wrote:
>
>> Tried that, did not work for me either :)
>>
>
> I mentioned two things. Which one(s) did you try? If you tried 
> editing/replacing the config file in srp_latn.traineddata and it didn't 
> work, you can provide more details on your exact steps and the results?
>  
>
>> Workaround could be to copy srp (cyrillic), and osd files to another 
>> folder, and use --tessdata-dir parameter. 
>> But that would complicate things.
>>
>> On Sunday, March 6, 2016 at 8:27:27 PM UTC+1, Tom Morris wrote:
>>>
>>>
>>>
>>> I was hoping you'd be able to override that on the command line, using -c 
>>> tessedit_load_sublangs="", but that doesn't seem to work with the 
>>> current order of evaluation, at least with my limited testing.
>>>
>>> If you have the training tools installed, you can patch your copy of the 
>>> language file by doing the following:
>>>
>>> $ combine_tessdata -e srp_latn.traineddata srp_latn.config
>>> $ cp /dev/null srp_latn.config
>>>
>>> $ combine_tessdata -o srp_latn.traineddata srp_latn.config
>>>
>>>
>>> That will remove the problematic line from your config (you might want 
>>> to copy srp_latn to srp_latn_only or some other name if you'd like both 
>>> behaviors available to you).
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/48245fe7-659e-45ba-85af-6475a1d68e8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.