[tesseract-ocr] Break down pedigree

2018-05-27 Thread Daron Goode
Hello,

I am new to Tesseract and could use some guidance on how a versed person 
would tackle this issue.  I have a php website where I can get the data out 
of a pdf without any issues but the order of the data that I am pulling is 
a mess.  The issue is that the return is only one long sting without any 
return characters or other way to break it down into parts  I was going to 
slice the pdf into several chunks and run each one though OCR at a time but 
I find that Tesseract has the power to do what I need it to do. Also with 
the 1000s of times the user will be uploading a new pdf it might not line 
up exactly the way I need it to. 

My end goal is to be able to update all these values to my database in the 
order they are related.  For the 4th generation that would be 31 different 
areas to scoop up the data I need.  If these are in order with an X 
coordinate I can always use that and work my Y values down.  

Even if all I had to work with is a /n character for each line I might be 
able to make that work.  

On the 4th generation Pedigree I tried to cut the last entire 4th 
generation out.  If I go that route that would only be 6 crops I need to 
make on this (1 for the dog, two for each of those parents, and then each 
generation.  My users will have 3 or 4 generation pedigrees.  

Any advice would be greatly appreciated. 
Thanks
Daron





-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eb8d5420-67be-44b0-aec7-c6de7b78f758%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: German "Straße" is often "StraBe" (tesseract 4.0)

2018-05-27 Thread Quan Nguyen
Latin.traineddata 

 can 
be found under script folder.

https://github.com/tesseract-ocr/tessdata_fast

On Friday, May 25, 2018 at 5:02:29 AM UTC-5, Thomas Güttler wrote:
>
> Hi Shree,
>
> what do you mean with "script/Latin traineddata"? I am new to tesseract 
> and use version 4.0 via docker.
> Most internet pages are about tesseract 3.0.x. 
>
> I am unsure where to start.
>
> Maybe it is better to use 3.0.x?
>
> Regards,
>   Thomas
>
> Am Donnerstag, 24. Mai 2018 13:41:30 UTC+2 schrieb shree:
>>
>> Please try with script/Latin traineddata to see if you get better results.
>>
>> I have added your comment to issue at 
>> https://github.com/tesseract-ocr/langdata/pull/54
>>
>>
>>
>> On Thursday, May 24, 2018 at 5:05:55 PM UTC+5:30, Thomas Güttler wrote:
>>>
>>> I use tesseract 4.0 via docker (tesseractshadow/tesseract4re)
>>>
>>> Very often tesseract detects "StraBe" instead of "Straße".
>>>
>>> Yes, I use -l=deu
>>>
>>> The word "Straße" is very common in german. It means "street".
>>>
>>> Since "StraBe" makes no sense I would like to improve this.
>>>
>>> What do you suggest?
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fa7f1004-df61-4bc8-b039-3ef39f64b909%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Training error "Couldn't find a matching blob"

2018-05-27 Thread Quan Nguyen
You need a much larger sample, in the range of hundreds or at least several 
dozens, so that even though some symbols could experience "Couldn't find a 
matching blob" errors, other samples would get picked up.

On Saturday, May 26, 2018 at 1:52:39 AM UTC-5, Paul Kitchen wrote:
>
> I am creating training data for GD symbols using Tesseract 3.05.01. One 
> of my TIFF files I use for training is in the attached 
> gdt.symbols.exp10.tif. When I attempt to use this TIFF with the 
> corresponding gdt.symbols.exp10.box, I get this output:
>
> Tesseract Open Source OCR Engine v3.05.01 with Leptonica
> Page 1
> FAIL!
> APPLY_BOXES: boxfile line 7/Ⓜ ((1153,69),(1431,346)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 10/Ⓜ ((1993,69),(2268,346)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:  10
>Boxes failed resegmentation:   2
>Found 8 good blobs.
> Generated training data for 5 words
>
>
> Basically, both circled M symbols are failing.
>
> I've attached ImagesWithBoxes.PNG which is a screen capture from 
> jTessBoxEditor showing the TIFF image with boxes. As you can see, the boxes 
> appear to be correct.
>
> Why isn't tesseract able to use the circle M symbols for training? Can I 
> change the image of the symbols some how to help tesseract... maybe connect 
> the circle and M parts with a line?
>
> Thanks in advance.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/75aa477d-ec94-4c08-bb0e-10d6765a2798%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.