Hi Misti,
Thanks for the info.
Will have a look at that.
Yes getting a good picture as a blind person isn't all that easy.
Which output format might be the best to preserve the most formatting, headings
and other things? hocr?
Greetings,
Simon
Von: tesseract-ocr@googlegroups.com Im Auftrag
of the box with tesseract?
Can tesseract also recognize tables and headings?
A few years ago someone would need to process the images first.
Is this still the status?
Greetings,
Simon
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To u
Hello everybody,
I just finished fine tuning according to Ray's tutorial.
I did the following steps:
1. I used tesstrain.sh to create training data and the starter
traineddata. The training data consists of the eng.training_text with the
multiple times added ± character.
m 14:19:33 UTC+1:
> You need to look at it in the unicode list.
>
> On Sat, Jan 20, 2024, 3:50 PM Simon wrote:
>
>> Hey thanks for the response!
>>
>> How exactly do I add characters to the unicharset?
>>
>> Typically the unicharset has to follow a spe
hrieb am Freitag, 19. Januar 2024 um 16:22:24 UTC+1:
> Yes, you need to add them before you create the starter model. You can
> edit the Latin.unicarset before you run the combine command.
>
> On Fri, Jan 19, 2024, 5:27 PM Simon wrote:
>
>> Ok somehow I had "no
sed:
When I try to train some new characters do I have to add them to the
Latin.unicharset before I create the starter traineddata or do I just add
these characters to the created unicharset after I created starter
traineddata?
Simon schrieb am Freitag, 19. Januar 2024 um 10:38:24 UTC+
ages or
something that could give me more insights on why it didn't work?
Simon schrieb am Donnerstag, 18. Januar 2024 um 11:11:52 UTC+1:
> Hello everybody,
>
> I have a question regarding "Fine Tuning +- a few characters".
>
> In general the instructions on
>
Hello everybody,
I have a question regarding "Fine Tuning +- a few characters".
In general the instructions
on
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters
say that you have to make a starter traineddata from the unicharset, but
Hello everybody,
currently I am trying to train just a few layern of the
eng_best.traineddata file. I already created 30,000 box gt.txt and .tif
files for training specifically for my problem.
As I tried to follow the instructions for training tesseract 4
I just saw the second picture I attached should be the following. In that
one you can see the .box file information.
[image: GoogleGroupsQuestion2.png]
Simon schrieb am Sonntag, 3. Dezember 2023 um 10:38:51 UTC+1:
> Hello everybody,
>
> is anyone familar with jTessBoxEditor.
> I
Hello everybody,
is anyone familar with jTessBoxEditor.
I am currently generating synthetic training data. Within this
synthetically created tif files are numbers to be trained. Within this
program I also automatically create .box files. But somehow the box
coordinates jTessBoxEditor shows
:
>
> Hi Simon, yes, I think the instructions you can give to the segmentation
> step are quite limited, mostly the PSM parameter and I suppose a few minor
> ones. There is something about tables but I've never used it and yours
> might be too small for this to work. Yes, you should be a
?
Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:
> Hi Simon,
> if I understand correctly how tesseract works, it follows this steps:
>
> - it segments the image into lines of text
> - it then takes each individual line and slides a small window, 1px wide I
Thanks a lot!
This is not possible with the tesstrain repository right?
desal...@gmail.com schrieb am Donnerstag, 23. November 2023 um 10:28:26
UTC+1:
> If the original model lacks the ∠ symbol, fine tuning is not going to add
> it for you. We have all went through that process. To introduce a
Alright,
this might be a litte bit of a dump question but where exactly can I see
the CER?
2 Percent improvement time=56, best error was 12.49 @ 8294
At iteration 8350/1/1, Mean rms=2.701%, delta=2.491%, char
train=10.385%, word train=24.4%, skip ratio=0%, New best char error =
Alright,
this might be a litte bit of a dump question but where exactry can I see
the CER?
2 Percent improvement time=56, best error was 12.49 @ 8294
At iteration 8350/1/1, Mean rms=2.701%, delta=2.491%, char
train=10.385%, word train=24.4%, skip ratio=0%, New best char error =
As I learned in the list.train and list.eval folders there are lstmf file
paths required. Also make sure when you are using tesseract on linux the
end of file in the file should be LF and NOT the windows standard CRLF.
Maybe this will help you:
If I need to train new characters that are not recognized by a default
model, is fine tuning in this case the right approach?
One of these characters ist the one for angularity: ∠
This symbols appear in technical drawings and should be recognised in
those. E.g. for the scenario in the
As I am training my model I got in contact with the following metrics:
E.g.:
At iteration 6345/6500/6500, Mean rms=6.246%, delta=7.139%, char
train=68.07%, word train=92.2%, skip ratio=0%, New best char error = 68.07
wrote checkpoint.
Unfortunately I don't find any proper and detailed
As it is not properly possible to combine my traineddata from scratch with
an existing one, I have decided to also train my traineddata model numbers.
Therefore I wrote a script which synthetically generates groundtruth data
with text2image.
This script uses dozens of different fonts and
ur installed 'eng'
> database is doing what it's supposed to, on its own, first.
>
> The next sane thing to try is flipping them around, ie "eng+gdt" instead
> of "gdt+eng", to see if results change and /how/, as that might give us all
> a hint about what's going on
Hello everybody,
right now I am working with tesseract to train it new symbols. Therefore I
used tif pictures with only the desired symbol in it. I trained with
tesstrain Repository and about 4000 training images. At the end of the
procedure I got the traineddata file for my model Common_gdt.
I am using " tesseract file1.png stdout -l osd--psm 0"
With some images that are correctly oriented it reports 180 degrees. It
gets it right if I rotate the images 90, 180, 270.
The images are lists with numbers and names in English.
Is there any way to improve performance on this?
/tessdata/issues?utf8=%E2%9C%93=is%3Aissue+is%3Aopen+Fraktur
>
> Zdenko
>
>
> st 2. 10. 2019 o 11:58 Akos Simon >
> napísal(a):
>
>> training tesseract
>>
>> Tesseract it is an OCR TEXT recognition software that can be trained.
>> I have gott
confused here,
hopefully, this will change with your help ? .. ;)
Thanks, Zdenko !!
On Wednesday, October 2, 2019 at 7:38:08 AM UTC+2, zdenop wrote:
>
> Why do you think training will help you? What other option you have tried?
>
> Zdenko
>
>
> st 2. 10. 2019 o 7:26 Ak
Fraktur Fonts OCR recognition with Tesseract OCR is what I am looking
for, I installed VietOCR v5.5.2 and Tesseract 4.1.0 on my mac, and now
I am trying to find help on how to train it better there are too many
OCR errors...
How would I go about training the software? Can anyone
Thanks for the fix.
Greetings,
Simon
Mit freundlichen Grüßen
Simon Eigeldinger
Informatik
Nebengebäude 1, OG1
[Hohenems_logo]Stadt Hohenems
Kaiser-Franz-Josef-Straße 4
6845 Hohenems
T: +43 5576 7101-1143 | E: simon.eigeldin...@hohenems.at | www.hohenems.at
Diese Nachricht und allfällige
I found answer is to set -l osd.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email
I am using tesseract4 and all working fine with english. However tesseract4
cannot detect page orientation so I want to use tesseract3 for this.
I thought I just had to do tesseract --oem 0but now it says
"couldn't load any languages"
Is there a way to use tesseract3 whilst tesseract4
Hi,
like me as a blind i wonder how i might use some of those tools?
because you can't see if the pic is good or bad.
actually we might need something that does that automatically.
any ideas on that?
Greetings and thanks,
Simon
Am 06.03.2018 um 02:42 schrieb Michael Smith:
I just do some
Hm.
I guess i just ship all 3 of them. *lol*
and add the text of the wiki to the readme.
Greetings,
Simon
Am 04.03.2018 um 18:43 schrieb ShreeDevi Kumar:
The traineddata files in tessdata_best are larger in size and OCR takes
more time. They are supposedly slightly more accurate
Hi ShreeDevi,
I have scraped the cygwin builds.
i am using now the builds i get from the appveyor builds which just
needs me to repackage the resulting stuff.
so tessdata_best isn't like the wiki says for better accuracy?
greetings,
Simon
Am 03.03.2018 um 05:12 schrieb ShreeDevi Kumar:
Hi
.
is that 3rd set still useable or shouldn't that ome not be used anymore?
on the wiki
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
it's still listed as useable.
Any suggestions?
Greetings and thanks,
Simon
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https
I guess i have to correct myself.
german fraktur is in the tessdata repo.
Am 18.10.2017 um 21:34 schrieb 'Simon Eigeldinger' via tesseract-ocr:
Hi Yury,
Maybe the same happened to it like the german fraktur data.
they seem to have not been updated for a long time and they have been
removed
Hi Yury,
Maybe the same happened to it like the german fraktur data.
they seem to have not been updated for a long time and they have been
removed from the main repos.
Greetings,
Simon
Am 18.10.2017 um 19:15 schrieb Yury Tarasievich:
Hi guys,
I may be wrong but the Russian tessdata does
Hi,
I have packaged a new version of tesseract with various tessdata files.
though i am looking at the moment where to upload it the best.
Greetings,
Simon
Am 11.10.2017 um 21:02 schrieb Dung Tran:
Hello,
I am a newbie and I tried to install Tesseract on my window machine. I
Hi,
Thanks for the info.
Greetings,
Simon
Am 17.09.2017 um 19:16 schrieb ShreeDevi Kumar:
Simon,
There is a significant difference in speed. Depending on the language, the
difference in accuracy may be minimal or more.
You should compare both for a representative sample to see which
Hi ShreeDevi,
Thanks for the info.
So it seems for blind people who need the best accuracy they should use
tessdata_best.
Greetings,
Simon
Am 17.09.2017 um 16:52 schrieb ShreeDevi Kumar:
Please see
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-329667239
ShreeDevi
.
and there is the tessdata repo.
what is that doing now in the future?
Greetings and thanks for helping,
Simon
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
--
You received this message because you are subscribed to the Google Groups
"tesserac
I have a some text "Morgan, L 220". There is a horizontal
line crossing it out.
Due to the crossing out it is difficult for OCR to identify word
boundaries. Therefore I am combining the content of a whole row and then
parsing it myself e.g. treating a big space as a separator.
I have columns like this
356 Smith 23123 Jones12
123 Jacks 19124 Barnes 10
Wordboxes are correctly identified for all names/numbers for some pages.
However on other pages there are numerous missing boxes for columns of
numbers especially the last
when we do that every commit that
happens to the tesseract repo, especially when more per day happen.
i wonder if its more interesting to build it once a day. would that be a
good idea?
greetings and thanks for adding the builds,
simon
Am 07.02.2017 um 15:49 schrieb Egor Pugin:
@egorpugin
together and
some times you kind of miss the last line which seems not to be included
in the pdf as text but in the txt file if you creat it.
is it also possible to compile the training tools as well?
greetings and thanks,
Simon
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren
the artifacts for win32 and win64.
can someone post some infos about how to use them?
Greetings and thanks a lot,
Simon
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
--
You received this message because you are subscribed to the Google Groups
, Simon Eigeldinger
wrote:
Hi all,
can i use tesseract to open multipage pdfs directly?
we have multi function printers which produce pdfs with images which can
be run through ocr.
how can i acomplish that for tesseract?
do i need a second program for that?
greetings,
simon
---
Diese E-Mail
Hi all,
can i use tesseract to open multipage pdfs directly?
we have multi function printers which produce pdfs with images which can
be run through ocr.
how can i acomplish that for tesseract?
do i need a second program for that?
greetings,
simon
---
Diese E-Mail wurde von Avast Antivirus
for tesseract so i guess i will build my own builds then
which i will share with people.
i guess i would build 32 and 64 bit versions if i can with one install.
greetings,
simon
Am 15.05.2016 um 13:10 schrieb Marco Atzeri:
On 15/05/2016 12:33, Simon Eigeldinger wrote:
Hi all,
i am thinking
.
do i get a performance boost when i compile tesseract with 64 bit?
i also don't know if i can install cygwin 32 and 64 bit on the same
system or if i just need cygwin 64 bit to also compile 32 bit progams.
greetings,
simon
--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com
Hi,
I have tesseract running on my linux box.
i want to compile a windows version under linux.
greetings,
simon
Am 14.05.2016 um 18:09 schrieb ShreeDevi Kumar:
There is an archlinux distribution for tesseract - see
https://www.archlinux.org/packages/community/i686/tesseract/
ShreeDevi
have?
Which dependencies do i need?
I never have done that and would be grateful for some hand holding.
I just compiled stuff on cygwin.
Thanks and greetings,
Simon
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
--
You received this message
and greetings,
Simon
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, sen
Thanks for the info.
might have a look at this.
greetings,
simon
Am 05.03.2016 um 05:09 schrieb ShreeDevi Kumar:
https://github.com/Alexpux/MINGW-packages/blob/master/mingw-w64-tesseract-ocr/PKGBUILD
Modify the pkgbuild to use the latest source.
ShreeDevi
builds daily builds
from the source.
greetings,
simon
--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.eigeldin...@vol.at
MSN: simon_eigeldin...@hotmail.com
ICQ: 121823966
Jabber: domaso...@andrelouis.com
---
Diese E-Mail wurde von Avast Antivirus-Software
Hi all,
I begin with tesseract and I get an error when I want to compile a program
thaht uses tesseract API and MySQL in C++. I have this error :
In file included from /usr/include/mysql/mysql.h:75:0, from ocr.cpp:4:
/usr/include/mysql/my_list.h:26:3: error: conflicting
hi,
i did test it 2 days ago and it seems to work.
at least over here and on a windows 7 machine in the office.
but i could recheck again.
greetings,
simon
Am 24.07.2015 um 08:50 schrieb zdenko podobny:
it is not about input, but output.
pdf output is key feature of leptonica 1.71 release
which seem to contain everything but shows the warning
message.
i recompiled a new version on my fake website so people can play with
the training tools as well.
so and now i am off for 2 weeks.
have a nice time while i am not around.
greetings,
simon
Am 24.07.2015 um 08:50 schrieb zdenko podobny
hi,
and i just opened a ticket:
https://github.com/tesseract-ocr/tesseract/issues/61
greetings,
simon
Am 23.07.2015 um 23:23 schrieb Jim O'Regan:
On 23 July 2015 at 19:02, Simon Eigeldinger simon.eigeldin...@vol.at wrote:
Hi all,
pango_font_info.cpp:223:46: error: 'strcasestr
hi,
thanks for the info.
so i guess then i might recompile my windows builds in debug mode then?
greetings,
simon
Am 23.07.2015 um 21:11 schrieb zdenko podobny:
1. Well if someone compile code from git (s)he should know what revision
is using ;-) And of course git code (unreleased
.exe
around 13.4 mb.
german and english data files.
tesseract-all-langs-win-git-20150721.exe
around 351.7 mb.
all the data files for tesseract which it can use at the moment.
Let's see if it works.
had no time currently to test but will do in the office tomorrow.
greetings,
simon
Am
://bhajans.ramparivar.com
On Fri, Nov 14, 2014 at 7:12 PM, Simon Støvring simonst...@gmail.com
javascript: wrote:
Hello,
I am trying to recognize single characters written with the Gotham Bold
font. I have trained Tesseract by following Michael Jay Lissners guide
Adding New Fonts to Tesseract 3 OCR
The letters will always be uppercase, so capitlization is not really an
issue.
I can try to layout the letters in a straight line and use the line mode.
However, I need to know the location of each character. That is which row
and column it is placed on. If Tesseract fails recognizing a single
://bhajans.ramparivar.com
On Sat, Nov 15, 2014 at 3:39 PM, Simon Støvring simonst...@gmail.com
javascript: wrote:
I have tried with the English traineddata and got similar results.
However, I had not tried recognizing the entire 'prepared-image' with psm 6
and I see that gives pretty good
to match correctly but generally it's just not
good enough and I'ld like to know if there's any way I can improve it.
Should I train differently? Should I pass other configurations or should I
process the images before trying to recognize the characters?
Best regards,
Simon B. Støvring
--
You
.
--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.eigeldin...@vol.at
MSN: simon_eigeldin...@hotmail.com
ICQ: 121823966
Jabber: domaso...@andrelouis.com
---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz
ist aktiv.
http
hi,
is there a guideline what to do with poor quality pics?
i am blind so i have no clue what sighted people do with those. *smile*
and it seems tesseract can't do much about pic quality.
maybe imagemagick might be a good choice for fixing things?
greetings,
simon
Am 24.10.2014 um 23:02
?
thanks.
greetings,
simon
--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.eigeldin...@vol.at
MSN: simon_eigeldin...@hotmail.com
ICQ: 121823966
Jabber: domaso...@andrelouis.com
---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus
btw forgot to say thanks to robert melton for telling me about this
script or at least for googling for that.
wonder if we can get something awesome out of things.
greetings,
simon
--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.eigeldin...@vol.at
.
greetings,
simon
Am 15.10.2014 um 18:06 schrieb Chris Cameron:
All the files I mention can be found here:
https://www.dropbox.com/sh/v5w4zl0c2z1wra1/AACxjmomYL4o-iQEhBrLvNgHa
Incidentally, I now see that Chrome's PDF viewer is also unable to search
the PDF.
Thanks,
Chris
--
Simon Eigeldinger
format is 4; unreadable
Error during processing.
tested with the eurotext.tif file from the testing directory on a
windows system.
compiled with cygwin.
https://dl.dropboxusercontent.com/u/1598766/tesseract-error.7z
greetings,
simon
--
Simon Eigeldinger
Follow me on Twitter: http
as well.
greetings,
simon
---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz
ist aktiv.
http://www.avast.com
--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from
# build and install tesseract
make install
---script end---
before in 2012 i used to do a
make -j 4
before i did
make install
after make install i did make training but it seems the training tools
seem to have some compilation issues.
greetings,
simon
Am 01.10.2014 um 16:32 schrieb Wes
hi,
maybe it gets hardcoded when you use --prefix with the configure script.
but i guess thats so with every program you use this.
greetings,
simon
Am 19.02.2012 14:22, schrieb zdenko podobny:
Hi,
I am not aware about any hardcoded path in tesseract excluding one
variable: configure set
hi,
well last time i was able to do that successfully and when i tried that
it was 31. december 2011 and with 644 i guess.
i might try again with 678.
greetings,
simon
Am 19.02.2012 18:09, schrieb Sriranga(78yrsold):
Zdenko,
Just now I svn upated upto r-676 in Linux (since I don't know
Hi Zdenko,
last time i tried to compile it it was in end of december 2011. there it
worked that way. maybe the code can't be compiled right now. in that
time i compiled SVN R 644 and last time i tried 675.
But thanks. maybe i just need to wait a little bit.
Thanks,
Greetings,
Simon
Am
i specify. can i also
specify something that its not so hardlinked to paths cause when i give
the binaries to another person they have to install it to the same place.
greetings,
simon
Am 18.02.2012 08:44, schrieb zdenko podobny:
I am not not cygwin user so just some ideas:
- cygwin
Hello,
I want to compile Tesseract from SVN under cygwin.
Can someone tell me how to do that?
Greetings,
Simon
--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.eigeldin...@vol.at
MSN: simon_eigeldin...@hotmail.com
ICQ: 121823966
Jabber: domaso
76 matches
Mail list logo