Answering part of what I asked last, I've found a way of getting the
alternatives to each char, but seems to be not working in 3.01 according to
what I tested and
http://code.google.com/p/tesseract-ocr/issues/detail?id=714
My snippet:
#include <api/resultiterator.h>
...
tess_api.SetVariable("save_blob_choices", "T");
...
tesseract::ResultIterator* it = tess_api.GetIterator();
do{ char* uval = it->GetUTF8Text(tesseract::RIL_SYMBOL);
cout<<uval<<"("<<it->Confidence(tesseract::RIL_SYMBOL)<<"){";
tesseract::ChoiceIterator ci(*it); do { const char* val =
ci.GetUTF8Text(); cout<<" "<<(val == NULL ? "#" : val)<<"
"<<ci.Confidence(); } while (ci.Next()); cout<<"}";}while
(it->Next(tesseract::RIL_SYMBOL));
El lunes, 6 de mayo de 2013 01:50:42 UTC-3, Andres escribió:
>
> Hi Dmitri,
>
> Many thanks for your hints, as always.
>
> Regarding the links in my previous message, sorry for that, I'll repost
> the entire message below this message, fixed.
>
> I like the method that you tell that you use in CustomOCR. Is there a way
> of getting the character variants without making a hack ? As I saw, the
> interface of the API just exposes the confidence level for each character.
> Am I right with this ?
>
> Regarding psm mode, I'm using this from insinde my code with value 7,
> which is for "Treat the image as a single text line". Is that the parameter
> that you are suggesting me ?
>
> Anyway, I think that I might have big newbie errors in my training, so I
> will be grateful if you just see my training image and my problematic
> image, to know if you see an obvious error at first sight.
>
> My training image:
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>
> Problematic image (a "6" recognized as a "5"):
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>
> Another problematic image ("A A" recognized as "M")
> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit
>
> The following is my original message with the links fixed:
>
> Dear people,
>
> I trained Tesseract for my font (FE-Schrift:
> http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad
> results. I am using Tesseract 3.01 under Windows.
>
> In this image:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
>
> Where text is SAA5298 I’m getting SM529B, this is being done from inside a
> program and I know that the “M” from the result is the result of the “AA”
> of the source. So, Tesseract is making a very bad segmentation of these
> two characters, and even they are very good separated, as you can see. Do
> you have an idea about why is this happening ? In the other hand, is there
> a way to give tesseract a hint for this (e.g., telling it the character
> width).
>
> The other problem is with this one:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>
> Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”,
> even when the image is very good.
>
> Here is my fonts training file:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>
> Here is my box file:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
>
> Here is my .traineddata file:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
>
> And here is a .cmd file for testing these 2 images:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
>
>
>
> Thanks,
>
> Andres
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió:
>>
>> Andres,
>>
>> Above all, your first link seem to be pointing to a "traineddata" file
>> instead of an image. Second, without actually diving deep into your
>> problem, I can suggest specifying the single line psm mode in the
>> command line. And finally you can use the user patterns feature to
>> restrict possible output of Tesseract (for the format see comments in
>> dict/trie.h on read_pattern_list()). Another way of achieving the
>> latter, like we do in CustomOCR, and it seems to be more reliable, is
>> to use the API to get a number of of character variants for each blob
>> alng with confidence levels and match them against a set of possible
>> patterns. You can find how to do this by searching around this forum.
>>
>> HTH and good luck with Tesseract!
>>
>> Warm regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>> On Fri, May 3, 2013 at 8:24 PM, Andres <[email protected]> wrote:
>> > Dear people,
>> >
>> > I trained Tesseract for my font (FE-Schrift:
>> > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad
>> results.
>> > I am using Tesseract 3.01 under Windows.
>> >
>> > In this image:
>> >
>> >
>> https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing
>> >
>> > Where text is SAA5298 I’m getting SM529B, this is being done from
>> inside a
>> > program and I know that the “M” from the result is the result of the
>> “AA” of
>> > the source. So, Tesseract is making a very bad segmentation of these
>> two
>> > characters, and even they are very good separated, as you can see. Do
>> you
>> > have an idea about why is this happening ? In the other hand, is there
>> a way
>> > to give tesseract a hint for this (e.g., telling it the character
>> width).
>> >
>> > The other problem is with this one:
>> >
>> >
>> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>> >
>> > Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”,
>> even
>> > when the image is very good.
>> >
>> >
>> >
>> > Here is my fonts training file:
>> >
>> >
>> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
>> >
>> > Here is my box file:
>> >
>> >
>> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
>> >
>> > Here is my .traineddata file:
>> >
>> >
>> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
>> >
>> > And here is a .cmd file for testing these 2 images:
>> >
>> >
>> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Andres
>> >
>> > --
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected]
>> > To unsubscribe from this group, send email to
>> > [email protected]
>> > For more options, visit this group at
>> > http://groups.google.com/group/tesseract-ocr?hl=en
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
>> Groups
>> > "tesseract-ocr" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an
>> > email to [email protected].
>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >
>> >
>>
>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.