Re: [users] OCR and more, a mini lesson (Re: Saving scanned image as ODT?)

Bruce Martin Thu, 08 Jul 2010 11:51:59 -0700

Dear Sigrid and Others:

When you scan a page of text, what you get, as already explained is animage.


Going a bit deeper, this is a matrix of dots, not computer text.

Computer text is essentially made up of a string of hexadecimal numbers.Each character in a code page is associated with a specific hex number.

This dictates the character, but not the font, size colour, style orassociated language. All of these attributes are stored by othernon-printable codes.


As can be seen this is radically different from a raster image.

OCR:

However, if that raster image (assuming it is of adequate graphicquality, size and resolution) is then passed through OCR (or OpticalCharacter Recognition) software, which is a separate application, andmay inherently be rather processor demanding at times, the result is oneor other of different text files, which can normally all be opened inWriter, regardless of the operating platform. It then should becorrectly language coded in Writer, spell checked and proof read.

(Often, a device sold to support Windows includes separate OCR softwarethat is really not part of the windows driver as it may appear. It issimply linked to the driver for user convenience.

In Linux, the software may be available, but is likely not linked in thesame way, depending on how the device back end has been written.)

Linux is more complicated than Windows, but also far more powerful andversatile.


Versatility and complexity generally go hand in hand everywhere.

In short, OCR is like a fast, automated typist - fast, but at timesstupid. Accuracy can be anywhere from 0% to 100%, depending on imagequality, content and many other factors.


My experience tells me that this is an area where you get what you pay for.

Although there are a couple of Open Source OCR applications in Linux (atleast for Fedora 12+), I have not yet seen a GUI front end for them.

Generally they want to be run from a terminal command line, which canmake their use very unwieldy. for that reason, I generally do OCR withcommercial paid software on a Windows Platform.

My main machine can boot either Fedora 13 x-64 or Windows XP, BIOS(firmware) controlled.

SANE (Scanner Access Made Easy) is a common front end for scanners inLinux, but each model of scanner or multifunction must have its own SANEback end. Since there are so many models around, there has not been theprogrammers' time available to create many of the back ends. Thosemodels which are supported (or partially) are online in a table as partof the SANE website.

If one installs a back-end for a composite device, such as a scanner,printer and/or fax, that back end may support all or any part of theexisting features of the hardware and firmware in the machine.

Each type of feature will normally work with the front end appropriateto the type of device and the desktop it is intended for. Hence theprinter in such a device is normally supported by CUPS (Common UnixPrinting System) even though both it and the scanner (and possibly fax)are supported by the respective front ends for their type.

Generally, the more expensive the hardware is/was, the more likely it isto be supported.


This is for several plausible reasons.

If a user has an expensive piece of hardware, the high price he paid forit will likely make him (or his company) do what is necessary to ensurethat it is supported, whether by choice of a more expensive model tobegin with or by having his own programmer do the programming work,which then normally gets shared all over.

On the other hand, a cheap device, that has been produced by thezillions often uses proprietary protocols to control it.

Due to the low cost, there is less general interest to struggle with theproprietary protocol, and it may be a copyright violation to reverseengineer it, especially as the amount of time and specialized workrequired to reverse engineer the device for the information needed towrite a back end is much greater than in the case of an expensive modelwhich uses a standard protocol that has likely already been reverseengineered, and needs only minimal patching in the programming.

Also, since the owners paid little for the device, they are less likelyto be willing and able to help support the support of that device.

Most of this information is public and on various websites, but it is upto each user to do his own research.

Added to this, some of the added software may be open source, but hostedon paid repositories. This helps cover the high cost of hosting, asidefrom the open source factor.

Use and cost of these repositories will be described on their web pages.Some of them may also offer for download compiled RPMs or DEBs toconfigure your system to use their facilities.

Others may simply provide a tarball for you to compile according youyour own individual needs.


Finally I hope this can help make SANE from appearing /in/sane!

On 08/07/2010 12:30 PM, Sigrid Carrera wrote:

Hi,

2010/7/8 Gordon<[email protected]>:

On 08/07/2010 17:08, Marcello Romani wrote:


When the scan is complete, you'll have an ordinary odt document with an
image in it.
Regardless of whether the image has been scanned or inserted by hand, it
will be editable with the usual limitations of OOo (brighntess,
contrast, tone, etc.), which have nothing to do with scanning of course.


I think we're drifting off here!
What I am really after is this.
If I connect to the Officejet in Windows and put a document on the scanner,
there is an option to scan the document in as a Word document, not as an
image file, and the resultant file can be opened and edited just like a
normal Word document.
Now unfortunately, the Linux driver for this all-in-one machine doesn't
allow that function so I have had to install a third-party scanning utility
such as Xsane. I haven't found yet a scanning utility that will allow the
scanned image to be "saved as" an editable word-processing document and not
as an uneditable image file.
I'm wondering if any such thing actually exists in Linux...


yes, it does exist. Search for OCR functioality, for example gocr.
I don't have any experience with this, so I don't know how good it is.
But at least, I know, that it exists. ;)

Sigrid

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [users] OCR and more, a mini lesson (Re: Saving scanned image as ODT?)

Reply via email to