[tesseract-ocr] Newbie question: Bad results on a Korean case

2023-12-04 Thread 'Nick S.' via tesseract-ocr
[image: KoreanOCRExample.PNG] Hi all, as a Tesseract/OCR newbie, I am currently working on deepening my understanding of the Tesseract foundations and OCR basics. This is why I came across the following strange results: When scanning some Korean Wikipedia pages (related to mathematics),

Re: [tesseract-ocr] What is the state of the C and Python APIs?

2018-08-07 Thread Nick White
pect it's still quite straightforward. Best, Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To

Re: [tesseract-ocr] error in lstm training

2018-06-03 Thread nick
hi shree thanks for your reply. i will check it as soon as possible. On Saturday, June 2, 2018 at 3:56:39 PM UTC+4:30, shree wrote: > > > !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 > > You can only continue_from models in tessdata_best repo which are float > models. The

[tesseract-ocr] error in lstm training

2018-06-02 Thread nick
hi i tried to finetune eng.traineddata. in lstm training raised this error : lstmtraining --continue_from ./tesseract-4.0.0-beta.1.20180414/tessdata/eng.lstm --traineddata ./tesseract-4.0.0-beta.1.20180414/tessdata/eng.traineddata --max_iterations 400 --debug_interval 0

[tesseract-ocr] how make a .traineddata of combination of two language (arabic+english)

2018-05-30 Thread nick
hi Is possible make a new .traineddata for support two languages ? (arabic + english) HOW ? thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-05-28 Thread nick
AD_LIMIT* > (through: *pytesseract.pytesseract.OMP_NUM_THREADS = 3 *or > *pytesseract.pytesseract.OMP_THREAD_LIMIT > = 3*) no multi threading is happening. > > > > Den mån 28 maj 2018 kl 12:11 skrev nick >: > >> how and where we could change this variable ? >> >> -- >> You received thi

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-05-28 Thread nick
how and where we could change this variable ? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this

[tesseract-ocr] Re: use multi threads in tesseract

2018-05-28 Thread nick
I found OMP_THREAD_LIMIT but i don't know to change it to 20 ?! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To

[tesseract-ocr] use multi threads in tesseract

2018-05-28 Thread nick
hi I want to run tesseract on cpu with 20 cores. but tesseract uses a few core when ocr a page !!! how could i change the setting and force tesseract use all 20 cores ?! thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe

[tesseract-ocr] Re: OpenCL GPU offloading significantly slower (Titan XP)

2018-05-24 Thread nick
Hi Janpieter could we speed up the testing process of tesseract with GPU ? is it possible ? On Friday, May 11, 2018 at 10:59:39 AM UTC+4:30, Janpieter Sollie wrote: > > Hi George, > > The OpenCL engine of tesseract is currently being renewed for improved > accuracy. The part that you are

[tesseract-ocr] detect tables and images in the main image with tesseract

2018-05-22 Thread nick
hi how can i detect tables and images in the main image with tesseract ? is this possible ? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Re: Training Tesseract4.0 (LSTM) on word level bounding boxes

2018-05-22 Thread nick
hi how can we train the tesseract 4 beta, with our lines dataset? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com.

Re: [tesseract-ocr] a way to extract the location of each components in image

2018-05-21 Thread nick
hi Zdenko I run the basic example code of bellow : #include #include int main() { char *outText; tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); // Initialize tesseract-ocr with English, without specifying tessdata path if (api->Init(NULL, "eng")) {

[tesseract-ocr] Re: error in running tesseract with API example

2018-05-21 Thread nick
I solved that error. but now , raised a new error: Failed loading language 'eng' > > Tesseract couldn't load any languages! > > Could not initialize tesseract. > > > how could solve it ? thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr"

[tesseract-ocr] Re: error in running tesseract with API example PYTHON

2018-05-21 Thread nick
error was : Error in findFileFormatStream: failed to read first 12 bytes of file > > terminate called after throwing an instance of > 'std::ios_base::failure[abi:cxx11]' > > what(): basic_filebuf::underflow error reading the file: iostream error > -- You received this message because you are

[tesseract-ocr] error in running tesseract with API example PYTHON

2018-05-21 Thread nick
hi i want to run tesseract API python with bellow codes, but raised error: import osimport ctypes lang = "eng" filename = "/usr/src/tesseract-ocr/phototest.tif" libname = "/usr/local/lib64/libtesseract.so.3"TESSDATA_PREFIX = os.environ.get('TESSDATA_PREFIX')if not TESSDATA_PREFIX:

[tesseract-ocr] error in running tesseract with API example

2018-05-21 Thread nick
hi, i want to run tesseract with this code : #include #include int main() { char *outText; tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); // Initialize tesseract-ocr with English, without specifying tessdata path if (api->Init(NULL, "eng")) {

Re: [tesseract-ocr] a way to extract the location of each components in image

2018-05-20 Thread nick
> > > ne 20. 5. 2018 o 8:00 nick <wcd...@gmail.com > napísal(a): > >> hi >> >> is there a way to extract the location of each components (lines) in the >> image ? >> >> for example : in the attached images specified the location of each >> c

[tesseract-ocr] Re: What is the exact role of the '(lang).wordlist'

2018-05-20 Thread nick
is that true ? the only role of eng.wordlist is post processing ? On Wednesday, March 28, 2018 at 11:29:11 AM UTC+4:30, notorio...@gmail.com wrote: > > What is the exact role of the '(lang).wordlist' > > > I have no idea (lang).wordlist?? > > It can help tesseract 4.00 to process *postprocessing

[tesseract-ocr] Re: Figure, Graph, Image detection/classification using Tesseract OCR

2018-05-20 Thread nick
this is my question too, Does anyone have an idea for this ? On Friday, April 6, 2018 at 3:34:22 PM UTC+4:30, Mohit Jain wrote: > > I'd like to know if it's possible to use Tesseract OCR for automatically > detecting figures, graphs or images which occur in the image? From > reviewing the

Re: [tesseract-ocr] Modyfying existing traineddata

2016-02-23 Thread Nick White
essedit_char_whitelist=ABC...123...' on the command line. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegro

Re: [tesseract-ocr] Run Tesseract on linux without shared libraries

2016-01-21 Thread Nick White
orbidden? My first thought in such a case would be a one-liner shell script that executes tesseract for you. Any reason that wouldn't work either? Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and

Re: [tesseract-ocr] Using plain makefiles for fun and profit (was: Run Tesseract on linux without shared libraries)

2016-01-21 Thread Nick White
simple, fast, and customisable way of building, and definitely works correctly with -j12 or whatever I have thrown at it. Any feedback would be very welcome indeed. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubs

Re: [tesseract-ocr] Tesseract for Tibetan

2015-11-25 Thread Nick White
languages in an image, using a plus in the language argument (i.e. '-l eng+spa'). Hope that's helpful. Nick 0. https://github.com/tesseract-ocr/langdata 1. https://github.com/tesseract-ocr/tessdata 2. https://en.wikipedia.org/wiki/Central_Tibetan_language -- You received this message becau

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-10 Thread Nick White
On Tue, Nov 10, 2015 at 08:59:19AM -0800, Ryan Baumann wrote: > Thanks for this, Nick. I'm just getting around to looking into moving my Latin > training into the tesstrain.sh system and this is very helpful. Great, I was planning to do that myself with your Latin training - let me know

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-09 Thread Nick White
OK, I just made a first attempt at some documentation for tesstrain.sh https://github.com/nickjwhite/tesseract/wiki/tesstrain.sh Any and all comments would be very welcome indeed. Nick On Sun, Nov 08, 2015 at 07:08:37PM +0530, Sriranga(83yrsold) wrote: > ​​​ > [icon_10_ge] tmp-kan.

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-02 Thread Nick White
bit until I have the time to explain everything more completely (should be able to do it sometime this week). Nick 0. clone it with the command 'git clone http://ancientgreekocr.org/grctraining.git' -- You received this message because you are subscribed to the Google Groups "tesseract-o

[tesseract-ocr] Ground truth files

2015-10-29 Thread Nick White
interested? Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send em

Re: [tesseract-ocr] Ground truth files

2015-10-29 Thread Nick White
Just a note, all the .git URLs listed below are git repositories, and there isn't a web interface to them on my server, so just clone them directly like this: git clone http://ancientgreekocr.org/mignetools.git Nick On Thu, Oct 29, 2015 at 06:23:21PM +, Nick White wrote: > Hi

Re: [tesseract-ocr] Tesseract option 4 bug

2015-10-23 Thread Nick White
latest version and see if it works for you? Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to

Re: [tesseract-ocr] Tesseract option 4 bug

2015-10-22 Thread Nick White
Can you give an example of something that isn't working as you expect? Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@goo

Re: [tesseract-ocr] How to install on Shared Hosting

2015-10-22 Thread Nick White
ere--without-root Note that if you don't have leptonica installed on the machine you'll have to do that first. Let us know if you have any troubles. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this g

Re: [tesseract-ocr] Tesseract 3.04 error.

2015-09-17 Thread Nick White
stderr as failing that should be something you can easily fix by changing the behaviour of your java code. Certainly compiling an older version of Tesseract (which, as Zdenko says, has significantly worse OS X support) is not the correct way to go. Nick -- You received this message becau

Re: [tesseract-ocr] Re: Why would this be? -> When I reinitialize tesseract for every call in a loop it consistently runs faster by a something like .1 second per loop iteration

2015-09-17 Thread Nick White
On Fri, Sep 11, 2015 at 12:13:02AM -0700, fsbo.cons...@gmail.com wrote: > To anyone else who may run across this, it is because of the way C++ uses > scope > to optimize the code when it compiles. Things that are within the scope of the > for loop will run faster than things that have larger

Re: [tesseract-ocr] Re: Easiest way to run Tesseract from a Mac

2015-08-21 Thread Nick White
, they should give us details of why not, and we can fix it. This is free software, we can do better than providing 2 options that work some of the time - we can just fix bugs ourselves. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group

[tesseract-ocr] Small update on the tools I wrote

2015-04-30 Thread Nick White
) are contained in the repositories listed at http://ancientgreekocr.org/ Hope you're all well out there :) Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email

Re: [tesseract-ocr] Re: Tesseract for recognition the international phonetic transcription

2015-04-30 Thread Nick White
all non-ascii characters) in the digital 2nd edition OED, which is probably a very good starting point for generating training texts. Nick On Thu, Apr 23, 2015 at 01:32:12PM -0700, Tom Morris wrote: Has anyone tackled training for the IPA since this initial query? I'm considering using

Re: [tesseract-ocr] Losing accuracy when training tessearct on fonts it already is trained on

2014-08-25 Thread Nick White
On Fri, Aug 22, 2014 at 12:42:21PM -0700, Thomas Bruno wrote: Is this common when training from text2image output? APPLY_BOXES: boxfile line 5364/748 ((1488,893),(1532,6)): FAILURE! Couldn't find a matching blob FAIL! Yes, there will be some of these. Check the proportion of failing to

Re: [tesseract-ocr] Can I remove some fonts from an existing traineddata?

2014-08-21 Thread Nick White
On Wed, Aug 20, 2014 at 07:39:50PM -0700, SHEN Fei wrote: hi Nick, I'm trying to use tesseract in my mobile phone so the tessdata size is critical. Since I only care about very few fonts, it would be convenient if I could add/ remove a special font. Maybe removing some dictionary files

Re: [tesseract-ocr] Makefile:372: recipe for target 'all' failed - using current version with leptonica 1.71 on cygwin

2014-08-21 Thread Nick White
like 'sh autogen.sh', as you did), so you don't cd to a directory containing malicious or weird programs and inadvertantly run them when you're trying to run standard system programs. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group

Re: [tesseract-ocr] tesseract trained successfully but gives:Tesseract Open Source OCR Engine v3.03 with Leptonica Segmentation fault (core dumped)

2014-08-21 Thread Nick White
Hi Dovhani, Does this happen with all images when using your training, or just one? Nick On Thu, Aug 21, 2014 at 03:03:47AM -0700, Dovhani Foneworx wrote: Hi guys, I have a problem, I have succesfully trained tesseract 3.03 in Ubunt 14.04 but when i run tesseract it is giving errors

Re: [tesseract-ocr] tesseract trained successfully but gives:Tesseract Open Source OCR Engine v3.03 with Leptonica Segmentation fault (core dumped)

2014-08-21 Thread Nick White
In that case it must be a problem with your training data. Can you let us know the exact commands you used to create it? Alternatively, you could post a gdb backtrace, if you know how to do that. Nick On Thu, Aug 21, 2014 at 04:19:40PM +0200, Dovhani Foneworx wrote: Hi Nick, this happens

Re: [tesseract-ocr] Re: Tesseract compilation on code blocks (gcc + mingw)

2014-08-21 Thread Nick White
On Thu, Aug 21, 2014 at 11:29:09AM -0700, shree wrote: zdenko, the current problem also seems related to strtok_r please see http://stackoverflow.com/questions/12973750/ fatal-error-strtok-r-h-no-such-file-or-directory-while-compiling-tesseract-oc

Re: [tesseract-ocr] Can I remove some fonts from an existing traineddata?

2014-08-20 Thread Nick White
not, not at the moment, at least. If you wanted to look into the code, I think enough information is preserved that it would be possible, but it would be a lot of work. Why do you want to? Extra fonts don't degrade recognition to a noticable extent, in my usage. Nick -- You received this message because you

Re: [tesseract-ocr] Losing accuracy when training tessearct on fonts it already is trained on

2014-08-20 Thread Nick White
new training alongside the official eng.traineddata, and call it something else, so you call tesseract like this: tesseract -l eng+mycustomeng image.png outbase Nick 0. See the training/text2image tool in the main code repository 1. https://groups.google.com/forum/#!topic/tesseract-dev

Re: [tesseract-ocr] Best image pre-processing software

2014-08-20 Thread Nick White
to the above technologies, and I highly value strong documentation support. Sounds good to me. unpaper and scantailor are nice, but everything they do should be well covered by your choice of tools. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr

Re: [tesseract-ocr] Re: How to disable image pre-processing?

2014-08-13 Thread Nick White
it to Tesseract, and it shouldn't mess with it. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post

Re: [tesseract-ocr] Error when running make - scanutils.cpp:38:14: error: typedef redefinition with different types ('long' vs '__darwin_off_t' (aka 'long long'))

2014-08-12 Thread Nick White
On Tue, Aug 12, 2014 at 12:58:23PM +0530, Shree Devi Kumar wrote: On Tue, Aug 12, 2014 at 4:31 AM, testing1234 cory.hix...@gmail.com wrote: Note.. Step 5 above the last command should be sudo make install-langs Nick, it maybe helpful to add/update instructions in wiki. Cory

Re: [tesseract-ocr] Outreach from the Wikisource community

2014-08-12 Thread Nick White
reusable ground truth data to test Tesseract with. Are there programatic ways of getting at the data, for example downloading all page images and corresponding text that is marked as green, for a specific language / script? Thanks for getting in touch! Nick -- You received this message

Re: [tesseract-ocr] Re: Trying to understand custom dictionaries

2014-08-12 Thread Nick White
in the dictionary. Yes, that is exactly correct. Traun Christopher, if you want to only have certain recognised words printed, the only way to do it is to recognise everything, then run a regex or some other script over the output afterwards. Tesseract doesn't do that itself. Nick -- You received

Re: [tesseract-ocr] Passing RegEx to Zone Scans

2014-08-12 Thread Nick White
, but more suggestive. If you were using the API you could totally set only the pattern you wanted, and only recognise the region you with the zone, and that should work quite well. Give it a try if you have time, and let us know how it works. Nick On Tue, Jul 29, 2014 at 01:27:10PM -0700, David

Re: [tesseract-ocr] I compiled and installed tesseract from the source on CentOS. I kept both 3.01 and 3.02 versions. I use environment path stored in bash file to point to the version in use.

2014-08-06 Thread Nick White
LD_LIBRARY_PATH=$HOME/svnocr/local/lib export TESSDATA_PREFIX=$HOME/svnocr/local/share/tesseract-ocr Nick P.S. Hi again all, I'm back, after the recent silence. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop

Re: [tesseract-ocr] Not getting accuracy with Arabic font

2014-08-06 Thread Nick White
are you wanting, and what are you getting? Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group

Re: [tesseract-ocr] OCR using C

2014-08-06 Thread Nick White
://code.google.com/p/tesseract-ocr/wiki/APIExample#Example_using_the_C-API_in_a_C_program Compiling and running it is quite a standard affair across platforms, nothing special is required to link to Tesseract. Nick -- You received this message because you are subscribed to the Google Groups

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-08-06 Thread Nick White
perfect in every way, you wonderful man. ;) Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group

Re: [tesseract-ocr] Failed to get the text

2014-08-06 Thread Nick White
Hi Fajar, Looks like you should try binarising the image yourself prior to handing it over to Tesseract. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email

Re: [tesseract-ocr] Re: Get Tesseract ocr to ignore or replace images with whitespace

2014-08-06 Thread Nick White
/22425545/stroke-width-transform-opencv-using-python http://stackoverflow.com/questions/6199/stroke-width-transform-swt-implementation-python Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-08-06 Thread Nick White
, and what it contains Does that sound good to people? I'll take silence from the list to mean that sounds perfect in every way, you wonderful man. ;) Thanks, Nick. That's great. You should probably have separate sections for training 3, 3.02, 3.03, 3.03.03 ...etc

Re: [tesseract-ocr] what does width= right -left = no silly +1/-1 mean in this tutorial?

2014-07-17 Thread Nick White
On Wed, Jul 16, 2014 at 11:17:00PM -0700, Jing JC wrote: I am going through Ray Smith's tutorial, and don't get it? He means that as the co-ordinate system uses bottom left as the origin, you will never get a minus number co-ordinate (as you could if the origin was elsewhere). -- You

Re: [tesseract-ocr] JTessbox Modifying the boxes

2014-07-17 Thread Nick White
On Thu, Jul 17, 2014 at 12:14:43AM -0700, Jing JC wrote: The Ray's tutorial said the bounding box overlaps. so when I modify the box inside JTessbox, do I keep the overlapping boxes, or make the boxes non touching. That's interesting, actually; I didn't realise Tesseract did outlining

Re: [tesseract-ocr] questions when reading unicharset manual: https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

2014-07-15 Thread Nick White
, so the most significant bit is 1, all the others are zero; 1. Convert that to hexadecimal and you get 10. b has isalpha and islower set, so it is 00011. Does that make sense to you? Nick On Mon, Jul 14, 2014 at 09:54:40PM -0700, Jing JC wrote: The example given are: ; 10 Common 46 b

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White
get better. I'll reply to your other email soon. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White
understanding of the codebase, but ultimately I have limited time and haven't got around to it yet. Are there particular things you'd like documentated, that I could start on? Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White
Sorry for the noise. I've looked into this more, and discovered more :) On Tue, Jul 15, 2014 at 10:54:06AM -0400, Nick White wrote: On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: When I download the traineddata files and extract the unicharset file from them I notice

Re: [tesseract-ocr] How to find the font properties

2014-07-15 Thread Nick White
, if you like: https://code.google.com/p/tesseract-ocr/issues/detail?id=1219 Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr

Re: [tesseract-ocr] questions when reading unicharset manual: https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

2014-07-15 Thread Nick White
Hi, On Tue, Jul 15, 2014 at 10:04:24AM -0700, Jing JC wrote: yep yep. Thanks a lot Nick. I tried to cancel mu post last night. but seems I can not get access to it after posted but before approved. I tried to match the V2's example to V3's format. I figured it out later

Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

2014-07-15 Thread Nick White
. Oh, by the way, the Things I would NOT recommend working on is a very old page (from 2010); I wouldn't take it too seriously... Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from

Re: [tesseract-ocr] Re: is tesseract 3.03's source tar available? need to compile on CentOS 5.6

2014-07-14 Thread Nick White
that Ubuntu would take it for their LTS release, and it could then be updated later in Ubuntu (when 3.03 is actually released). It's confusing, but such is life ;) Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group

Re: [tesseract-ocr] builing the svn source code in windows is too difficult.

2014-07-13 Thread Nick White
I build the tesseract svn source code in win8, I used the VS2013/Cygwin/MinGW to build this, all failed. Hi, you need to give us more clues as to why it failed. What error messages did you get? what version of leptonica the newest svn use? 1.70 or 1.71? Tesseract should work fine with

Re: [tesseract-ocr] is tesseract 3.03's source tar available? need to compile on CentOS 5.6

2014-07-13 Thread Nick White
On Sun, Jul 13, 2014 at 06:38:11PM +0430, universal reseller wrote: is google drive use tesseract 3.03 ? It's -rc1, meaning release candidate 1. So it isn't an official release, but rather a testing preview release, which should be to what the final 3.03 will be. i checked one english pdf

Re: [tesseract-ocr] Re: need help removing garbage characters from my OCR

2014-07-12 Thread Nick White
general purpose. So basically I just turned every pixel white that wasnt a pixel that contained part of a letter, and when I send that to tesseract I get flawless output with the language data I trained. Thanks so much for the replies Paul and Nick, I learned a lot and it put me in the right direction

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White
, but not if you don't explicitly state the -psm. Nick On Wed, Jul 09, 2014 at 04:17:47PM -0700, Alex Ryan wrote: Paul, I havent gotten a chance to play around with that yet, but thanks for linking that, I might very well have to go that route. I am having a very confusing issue tho that Im hoping

Re: [tesseract-ocr] Any way to prevent contextual digits-letters flipping ?

2014-07-10 Thread Nick White
Hi, I haven't tried it, but quickly grepping around the source code suggests setting the config variable crunch_include_numerals to true might do the job. Please let us know if that works. Nick On Wed, Jul 09, 2014 at 11:15:10PM -0700, Damien D wrote: Hi everyone, tesseract seems

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
of the training tools like this were originally written for internal use by Google and do funky things like depend on map-reduce, so have to be rewritten for us plebs ;)) Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
On Sat, Jul 05, 2014 at 03:34:05PM -0700, Albrecht Hilker wrote: Hello zdenop It is clear that you are not the right person to answer this question. If YOU would ever have looked into the source code you have seen that these values ARE in use (in version 3.03). You're being pretty unfair on

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
here. It says g hardly ever rises above 43, whereas 9 can quite happily rise up to 66 (which looks like it roughly corresponds to the baseline, given how many other characters are about there). From that we can guess that 128 is the x-height, and 64 is roughly the baseline. More anon. Nick

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White
this crash? If so, can you open a bug in the issue tracker, attaching the training data and image file that crash it? Thanks, Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
V £ 4 9 Q A P ¢ ] 3 2 © 8 / X é j ; 7 € O ¥ U x } E § = ! ’ G ) Z q { “ — Y K * W \ ° fi ‘ _ fl /* * Copyright 2014 Nick White nick.wh...@durham.ac.uk * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * http

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-09 Thread Nick White
On Wed, Jul 09, 2014 at 03:16:08AM -0700, Paul wrote: How about using ImageJ (can be automated with macros) to create a better binary result of the image. Thanks for mentioning this; I hadn't heard of it and it sounds very useful. I added a link to the ImproveQuality wiki page. Nick

Re: [tesseract-ocr] Is there any influence of the input format of the image PNG vs TIFF

2014-07-09 Thread Nick White
On Wed, Jul 09, 2014 at 09:48:20AM -0700, Rani Yaroshinski wrote: From the point of view of the performance measures of the OCR ? I don't think anybody has figures on this. You could do some tests yourself, and let us know the results. I would guess that file size would be a bigger slowdown

Re: [tesseract-ocr] Is it wise to interfere with the pre-processing pipeline of Leptonica

2014-07-09 Thread Nick White
On Wed, Jul 09, 2014 at 09:50:01AM -0700, Rani Yaroshinski wrote: In order to improve the accuracy of the OCR results ? Yes, it is, if you know more details about the images you'll be using, so can do better than Tesseract's guesses. See

Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-09 Thread Nick White
out a more definitive answer. If you get there first, let me know and I can update the TrainingTesseract3 page as appropriate. Nick 0. git clone http://ancientgreekocr.org/grc.git -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-09 Thread Nick White
to figure it out in short order. Thanks a lot for bringing this up; as I said, it has been bothering me, but I hadn't found the time to do anything much about it. More soon! Nick 0. git clone http://ancientgreekocr.org/grc.git -- You received this message because you are subscribed to the Google

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-08 Thread Nick White
#Image_processing Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-04 Thread Nick White
On Fri, Jul 04, 2014 at 02:08:46AM -0700, Meenal Goyal wrote: If you're sure that all the words you will encounter will be in the dictionary this should help somewhat: https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_ increase_the_trust_in/strength_of_the_dictionary?

Re: [tesseract-ocr] New language traineddata based on the existing one.

2014-07-04 Thread Nick White
On Fri, Jul 04, 2014 at 02:15:52AM -0700, Iskander Sharipov wrote: I need to create new tessdata language, which is very similar to russian in charset. Every time I try to do so by training tesseract on a box containing needed letters I get new traineddata, which actually can recognize new

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-03 Thread Nick White
/strength_of_the_dictionary? Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr

Re: [tesseract-ocr] How to download the Tesseract trained data for Digital display numbers ( Seven Segments Data trained data )

2014-07-03 Thread Nick White
? Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr

Re: [tesseract-ocr] Terrible results from Tesseract API

2014-07-03 Thread Nick White
Hi Elena, Just a guess, but maybe this line: api - SetSourceResolution(600); is the source of your troubles? Tesseract from the command line would have just been guessing it, and perhaps its guess, coupled with its ideas about different sizes of fonts, were better than yours? Nick

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-02 Thread Nick White
That's a tough thing to preprocess. Take a look at this recent thread on this list: question about training tesseract. Nick On Tue, Jul 01, 2014 at 11:48:07PM -0700, Meenal Goyal wrote: Hi Nick, I have read that post earlier and also tried to preprocess the image. This is the input image

Re: [tesseract-ocr] Tesseract-OCR

2014-07-01 Thread Nick White
could build the development code for Windows if you're so inclined. If you don't want to do that, you'll have to wait for the release. if not than can anyone say when Tesseract - ocr v 3.03 going to release ? No, we don't have a release date. Sorry. Nick -- You received this message because

Re: [tesseract-ocr] How to use the API in linux system

2014-07-01 Thread Nick White
myprogram.cpp -g -Wall -I/home/nick/local/include/tesseract -L/home/nick/local/lib -llept -ltesseract Let us know if you need anything more. I'll probably add these examples to the wiki soon, thanks for the prompt :) Nick -- You received this message because you are subscribed to the Google

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-01 Thread Nick White
in the first place. Check out this wiki page: https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality If you want to send a specific example image to the mailing list, we can try to offer more specific advice. Nick -- You received this message because you are subscribed to the Google

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-06-30 Thread Nick White
of such words are: fiJfifilnlflfiflhu-«fifllfllfilfi , neefls» , oscxmwxufis etc. Do you mean you want Tesseract to only match dictionary words, or recognise, but not print, words that aren't in the dictionary? Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group

Re: [tesseract-ocr] Advice needed on effective hexadecimal recognition

2014-06-30 Thread Nick White
not sure. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr

Re: [tesseract-ocr] read multi-language ( arabic and english) image

2014-06-27 Thread Nick White
extra information you have to those bugs, to help the issues be resolved sooner: https://code.google.com/p/tesseract-ocr/issues/detail?id=899 https://code.google.com/p/tesseract-ocr/issues/detail?id=1220 Nick -- You received this message because you are subscribed to the Google Groups

Re: [tesseract-ocr] question on training tesseract for arbitrary big images

2014-06-27 Thread Nick White
as possible, and only give Tesseract the image of the text. There are other people who have done similar things on this list, I recommend you look through the archives to find more information on good ways of doing this. Nick -- You received this message because you are subscribed to the Google

Re: [tesseract-ocr] Support for Sinhala

2014-06-27 Thread Nick White
we can potentially improve things in the future. If they don't respond, instructions on training Tesseract are on the wiki: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group

Re: [tesseract-ocr] Can tesseract read cursive handwriting?

2014-06-27 Thread Nick White
for an algorithm. Search the list archives, this has been asked (and answered) several times. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr

Re: [tesseract-ocr] 'BLOCK_LINE_IT' was not declared in this scope

2014-06-27 Thread Nick White
the C-API, or the C++ API? Can you share an example of the code you're using? Thanks, Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr

  1   2   3   4   5   >